A digital human modeling method and system

By extracting 3D reconstruction data from monocular videos and combining it with a Gaussian physical model and a multimodal driving mechanism, natural and delicate facial expressions are generated, solving the problems of stiff expressions and semantic disconnect in existing digital human modeling technologies, and realizing a high-quality digital human solution.

CN122265488APending Publication Date: 2026-06-23HUBEI POST TELECOMM PLANNING DESIGN

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HUBEI POST TELECOMM PLANNING DESIGN
Filing Date
2026-03-18
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing digital human modeling technology struggles to achieve natural and nuanced facial expressions, especially when dealing with complex materials, semi-transparent objects, or fine hair. Furthermore, expression-driven features are disconnected from audio and text semantics, failing to meet the needs of scenarios such as AR/VR social networking, virtual broadcasting, and film and television production.

Method used

By extracting 3D reconstruction data from monocular videos and combining Gaussian physical models and multimodal driving mechanisms, a digital human model with facial expressions is generated. This includes optimizing SMPL parameters, generating Gaussian point sets, rendering facial animation sequences, and achieving synchronization between lip movements and emotions through audio-text-emotion hybrid driving.

Benefits of technology

It achieves natural changes in digital human facial expressions and synchronization with semantic context, meeting personalized needs in diverse scenarios and enhancing the realism and immersion in the interaction process.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265488A_ABST
    Figure CN122265488A_ABST
Patent Text Reader

Abstract

The application discloses a digital human modeling method, comprising the following steps: S1, three-dimensional reconstruction data about each frame of image is extracted from monocular video, the three-dimensional reconstruction data comprises original data, high-precision human mask, original audio signal, optimized SMPL parameters and camera parameters corresponding to each frame of image; S2, a three-dimensional structure human body grid in a "T" posture is generated based on the optimized SMPL parameters, after being discretized into a 3D Gaussian point set, the initialization of the 3D Gaussian point set is completed in combination with the original data, the high-precision human mask and the camera parameters corresponding to each frame of image; S3, a facial animation sequence capable of representing mouth shape and emotional dynamics is obtained based on the original audio signal, a text description of an expected facial morphology and a reference image; S4, a Gaussian physical model is added in a 3DGS process to render the initialized 3D Gaussian point set, and the 3DGS rendering result is mixed and rendered with the facial animation sequence to generate a digital human model with facial expressions.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of digital human modeling technology, and in particular to a digital human modeling method and system. Background Technology

[0002] With the continuous advancement of technology and the deepening of social informatization, emerging concepts such as digital humans and metaverses are constantly appearing. Interactive digital humans, as a key medium bridging the physical world and digital space, are increasingly becoming a focus of attention for both academia and industry. High-quality and flexibly editable digital avatars have shown broad application prospects in various fields such as augmented reality / virtual reality (AR / VR), virtual try-on, film and television production, and telepresence. In recent years, with the cross-integration of deep learning, computer graphics, and high-precision sensor technologies, the research focus of interactive digital humans has gradually shifted from offline static modeling to real-time driving and dynamic interaction.

[0003] Early research primarily relied on parametric human body models, such as SCAPE and SMPL. These models control human shape and posture changes through low-dimensional parameters, offering high computational efficiency, but they have significant limitations in reproducing personalized appearance details. On the other hand, while 3D scanning technology based on multi-view camera arrays or depth sensors can reconstruct high-precision geometric models, its equipment is expensive, the process is complex, and it still faces challenges in capturing dynamic details.

[0004] Currently, digital human modeling technology mainly revolves around the competition and integration of two paradigms: implicit representation and explicit representation. Implicit neural representation methods, represented by Neural Radiance Fields (NeRF), map 3D spatial coordinates and viewing angle to color and density parameters using a multilayer perceptron (MLP). By leveraging neural networks to learn continuous volumetric representations of the scene, they achieve high-quality image synthesis from new perspectives, demonstrating excellent performance in terms of light and shadow continuity and detail reproduction. However, this method relies on a computational model of dense sampling along light rays and global MLP queries, resulting in low training and rendering speeds, making it difficult to meet the demands of real-time interactive scenarios such as AR / VR and virtual social interaction.

[0005] To overcome these limitations, the academic community has gradually formed two main technical paths. The first is the fusion method, which aims to combine the continuous modeling capabilities of implicit representations with the high efficiency of explicit data structures. By introducing explicit auxiliary structures such as hash coding, sparse tensors, or local feature grids, the learning burden of neural networks is shifted from global optimization to local feature extraction, thereby significantly improving training and inference speed while maintaining visual quality. The second is the pure explicit rasterization method, which directly leverages traditional graphics primitives such as point clouds and meshes, fully utilizing the highly optimized rasterization pipeline of GPUs to achieve extreme rendering efficiency. However, this method often affects visual realism when dealing with complex materials, semi-transparent objects, or fine hair due to geometric discontinuities and interpolation imperfections.

[0006] In recent years, 3D Gaussian Splatting (3DGS) has emerged as a groundbreaking technology, offering a new solution for portable digital human modeling. This method represents a scene as a series of learnable, anisotropic 3D Gaussian distributed primitives, whose attributes include position, color, transparency, and scale and orientation information defined by the covariance matrix. This allows for more accurate fitting of complex, irregular geometric surfaces. During the rendering stage, 3DGS abandons the volumetric sampling and neural query mechanisms of NeRF, instead directly compositing the final image on the image plane by rapidly projecting and alpha-blending the 3D Gaussian primitives after they are sorted by depth. This enables high-fidelity real-time rendering of hundreds of frames per second on consumer-grade GPUs. However, existing 3DGS solutions still have core flaws. They focus on static geometric reconstruction and basic pose-driven functions, without specifically optimizing for facial expressions in digital humans. Gaussian points exist only as pure visual primitives, lacking feature decoupling and dynamic driving mechanisms related to expressions, making it difficult for generated digital humans to present natural and delicate facial expressions. Furthermore, the expression control of parametric human models (such as SMPL) relies on fixed low-dimensional parameters, which can only achieve simple movements such as basic lip opening and closing and eye closing. They cannot simulate the precise synchronization of mouth shape and phonemes when real humans speak, and they are even less able to present the micro-expression changes in eyebrows, cheeks, etc. caused by emotional fluctuations. The expressions are stiff and monotonous, and they cannot achieve dynamic changes in expressions with the intensity of emotions. Moreover, the expressions lack biomechanical details such as facial muscle traction and skin wrinkling, making it difficult for digital humans to convey real emotions.

[0007] Furthermore, existing technologies generally fail to address the coupling issue between emotion and facial expression. They either generate average-level expressions using only a single emotion tag, failing to capture the natural fluctuations in intensity during speech; or the expression-driven mechanism is disconnected from audio and text semantics, resulting in mismatched lip movements and expressions that don't fit the context. These deficiencies in facial expression make it difficult for current digital humans to meet the needs of AR / VR social networking, virtual broadcasting, and film production scenarios. Summary of the Invention

[0008] In view of the above-mentioned prior art, the present invention provides a digital human modeling method, which mainly solves the technical problems existing in the background art.

[0009] To achieve the above objectives, the technical solution of this invention is implemented as follows: The first aspect of this invention discloses a digital human modeling method, comprising the following steps: S1. Extract three-dimensional reconstruction data for each frame of image from monocular video. The three-dimensional reconstruction data includes the original data corresponding to each frame of image, high-precision human body mask, original audio signal, optimized SMPL parameters and camera parameters. S2. Generate a 3D human body mesh in the "T" pose based on the optimized SMPL parameters. After discretizing it into a 3D Gaussian point set, the initialization of the 3D Gaussian point set is completed by combining the original data corresponding to each frame image, the high-precision human body mask and camera parameters. S3. Based on the original audio signal, the text description of the desired facial shape, and the reference image, obtain a facial animation sequence that can represent lip movements and emotional dynamics. S4. In the 3DGS process, add a Gaussian physical model to render the initialized 3D Gaussian point set, and mix the 3DGS rendering result with the facial animation sequence to generate a digital human model with facial expressions.

[0010] Optionally, extract 3D reconstruction data for each frame of the monocular video, specifically including: S101. Input the monocular video with the target person into the ROMP model, and determine the camera parameters and initial SMPL human body model parameters corresponding to each frame of the monocular video. S102. Detect two-dimensional human joints in each frame of image using the AlphaPose algorithm. Using the two-dimensional human joints as the constraint target, use the gradient descent algorithm to iteratively optimize the initial SMPL human model parameters, minimize the error between the three-dimensional model joint projection and the two-dimensional detection results, and obtain the optimized SMPL parameters. S103. Generate human body bounding boxes based on the optimized SMPL parameters. Input each frame of the original image and the human body bounding boxes into the SAM model and generate a high-precision human body mask through pixel-level semantic segmentation. S104. Simultaneously extract the original pixel data corresponding to each frame of the image and the original audio signal associated with the monocular video, and summarize them to form the three-dimensional reconstruction data.

[0011] Optionally, for the generated 3D human body mesh in the "T" pose, surface sampling is performed in the sampling area defined by the high-precision human body mask to discretize it into a 3D Gaussian point set. During the discretization process, the camera parameters back-infer the 2D projection coordinates of the human body mesh to the 3D space to calibrate the initial position of the Gaussian points. At the same time, based on the color and transparency features of the corresponding pixels in the original data, appearance attributes are assigned to each Gaussian point, and finally the initialized 3D Gaussian point set is generated.

[0012] Optionally, a facial animation sequence characterizing lip movements and emotional dynamics is obtained based on the original audio signal, a text description of the desired facial shape, and reference images, specifically including: Construct and train a freeze decoupling network, which includes a content encoder. Emotional encoder Decoder D; The Emotion2Vec model extracts frame-level sentiment features from the original audio signal, and the Whisper model transcribes the original audio signal into text. The RoBERTa pre-trained language model then extracts semantic features from the text. A cross-attention mechanism is used to allow audio and text features to query and complement each other, and finally outputs a predicted sentiment intensity curve that is aligned frame by frame with the input audio. The predicted sentiment intensity of each frame is used to analyze the final frozen sentiment encoder. The generated sentiment embeddings are scaled frame by frame to generate a basic sentiment feature sequence, which is then projected onto the frozen sentiment encoder. The final sentiment feature sequence is generated in the process; The acoustic features of the original audio signal are extracted using the Wav2Vec2.0 model and projected onto the frozen content encoder. Generate lip-sync parameters; The reference image is input into the CLIP image encoder to extract visual features, which are then linearly mapped to obtain the visual guide code. The text description of the desired facial shape is input into the CLIP text encoder to extract semantic features, which are then linearly mapped to obtain the text guide code. The two are fused to obtain a multimodal guidance signal. ; The lip-sync parameters, the final dynamic emotional feature sequence, and the multimodal guidance signal are combined. The input motion decoder decodes and obtains a facial animation sequence that can represent lip-sync and dynamic changes in emotion.

[0013] Optionally, train the frozen decoupling network, specifically including: Input various types of speech segments and seven emotion tags into the EmoFace model to generate a facial animation dataset in which content and emotion can be arbitrarily combined; The facial animation dataset is input into the content encoder. Emotional encoder The decoupled network consisting of decoder D is trained using a triple self-supervised learning strategy. After training, the content encoder is frozen. Emotional encoder The weights of decoder D form the frozen content encoder. Emotional encoder .

[0014] Optionally, a triple self-supervised learning strategy can be used for training, specifically including: For the input sequence Through the content encoder Extract content features related to pronunciation using an emotion encoder. Extract emotion-related features, input the content features and emotion features into decoder D for reconstruction, and obtain a facial animation sequence consistent with the original input sequence; Two sets of facial animation sequences with the same content but different emotions were selected as sequence pairs and processed by a content encoder. Extract content features from the two sets of sequences and use an emotion encoder. Extract the emotional features of each of the two sets of sequences; input the content features of the first set of sequences and its own emotional features into decoder D to reconstruct a facial animation sequence consistent with the first set of sequences; input the content features of the second set of sequences and its own emotional features into decoder D to reconstruct a facial animation sequence consistent with the second set of sequences. The content features of the first set of sequences and the emotional features of the second set of sequences are input into decoder D to generate the first intermediate animation sequence; the content features of the second set of sequences and the emotional features of the first set of sequences are input into decoder D to generate the second intermediate animation sequence. Repeat the above feature extraction steps for the first intermediate animation sequence and the second intermediate animation sequence respectively, and reconstruct the original first set of original sequences and the original second set of original sequences respectively. This cyclic consistency constraint avoids the loss of key information during feature exchange.

[0015] Optionally, a Gaussian physical model can be added during the 3DGS process to render the initialized 3D Gaussian point set, specifically including: Material labels are assigned to each 3D Gaussian point based on a semantic segmentation model. The material labels are mapped to differentiated physical parameters through a predefined material lookup table, and each 3D Gaussian point is assigned a value based on the physical parameters. Apply environmental external forces to 3D Gaussian points that meet the conditions after assignment, and calculate the collision constraint forces generated by the interaction between the 3D Gaussian points and external colliders. At the same time, dynamically construct the spring-damping network between adjacent 3D Gaussian points through KD tree, and calculate the spring constraint forces according to Hooke's law to maintain the model topology. Calculate the resultant force of external forces and constraint forces acting on each 3D Gaussian point, and update the velocity and spatial position of the 3D Gaussian point by numerical integration based on the calculation results; The updated position and shape of the 3D Gaussian points are passed to the standard 3D Gaussian sputtering renderer to generate the final image of the current frame. The above process is repeated frame by frame to obtain the 3DGS rendering result.

[0016] The second aspect of the present invention discloses a digital human modeling system, the modeling system including a server and a PC front-end, the server being used to implement the digital human modeling method as described in any of the preceding claims, and the PC front-end being used to display the modeling results.

[0017] A third aspect of the present invention discloses a non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a server, implements the digital human modeling method as described in any of the preceding claims.

[0018] The fourth aspect of the present invention discloses a computer program product, comprising a computer program, characterized in that, when the computer program is executed by a server, it implements the digital human modeling method as described in any of the preceding claims.

[0019] The beneficial effects of this invention are as follows: This invention uses SMPL human body parameters as a reliable geometric prior, and through surface sampling of the "T" pose mesh in the normalized space, it discretizes the continuous geometry into a high-fidelity 3D Gaussian point set. This not only lays a differentiable and structured foundation for neural representation, but also accurately restores the personalized body features of the target person. It ensures the accuracy and delicacy of the geometric form of the digital human from the source, and effectively reduces the ambiguity of 3D reconstruction under monocular video input. Through the constructed audio-text-emotion multimodal hybrid driving mechanism, precise synchronization between lip movements and audio phonemes is achieved. It can also capture the emotional fluctuations in speech rhythm and combine them with multimodal guidance signals. This allows digital human expressions to change naturally with semantic context, getting rid of the rigidity and monotony caused by single emotional labels. It also enables fine control of facial morphology through text descriptions or reference images, meeting personalized needs in diverse scenarios. This invention upgrades the Gaussian point, which is traditionally used only as a visual rendering primitive, into an entity with physical properties such as mass and elasticity. It achieves differentiated biomechanical parameter configuration through material property mapping, maintains the stability of the model's topological structure by relying on spring network constraints, and dynamically adjusts the shape of the Gaussian point by combining a stress-driven shape deformation solver. This makes the digital human's facial movements not only visually realistic but also strictly follow the laws of biomechanics. It can realistically simulate natural physical phenomena such as soft tissue traction and collision indentation, greatly improving the realism and immersion in the interaction process. It provides a high-quality and highly flexible digital human solution for real-time interactive scenarios such as AR / VR social networking, virtual anchors, and film and television production. Attached Figure Description

[0020] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only preferred embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0021] Figure 1 This is a flowchart illustrating the digital human modeling method in the embodiments of this application; Figure 2 This is a process simulation diagram of the digital human modeling method in the embodiments of this application; Figure 3 This is a schematic diagram of the process for obtaining three-dimensional reconstruction data in an embodiment of this application; Figure 4 This is a schematic diagram of the process for generating facial animation sequences in an embodiment of this application. Figure 5 This is a schematic diagram of the rendering process based on the Gaussian physical model in an embodiment of this application; Figure 6 This is a schematic diagram of the architecture of the modeling system in the embodiments of this application. Detailed Implementation

[0022] The technical solution of the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments. Unless otherwise defined, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. The terminology used in this specification is for the purpose of describing particular embodiments only and is not intended to limit the invention. In the following description, the expression "some embodiments" refers to a subset of all possible embodiments; however, it should be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with each other without conflict.

[0023] In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the invention. However, it will be apparent to those skilled in the art that the invention can be practiced without one or more of these details. In other instances, certain technical features well-known in the art have not been described in order to avoid obscuring the invention.

[0024] It should be understood that the present invention can be embodied in various forms and should not be construed as being limited to the embodiments set forth herein. Rather, providing these embodiments will make the disclosure thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Furthermore, the terminology used herein is intended only to describe particular embodiments and is not intended to limit the invention. When used herein, the singular forms “a,” “an,” and “the” are also intended to include the plural forms unless the context clearly indicates otherwise. It should also be understood that the terms “compose” and / or “comprising,” when used in this specification, identify the presence of the stated features, integers, steps, operations, elements, and / or components, but do not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups. When used herein, the term “and / or” includes any and all combinations of the associated listed items.

[0025] It should also be noted that when an element is referred to as being "fixed to" another element, it can be directly attached to the other element or there may be an intervening element. When an element is referred to as being "connected to" another element, it can be directly connected to the other element or there may be an intervening element. The terms "vertical," "horizontal," "inner," "outer," "left," "right," and similar expressions used herein are for illustrative purposes only and do not represent the only possible implementation.

[0026] To fully understand this invention, a detailed structure will be presented in the following description to illustrate the technical solution proposed by this invention. Optional embodiments of the invention are described in detail below; however, in addition to these detailed descriptions, the invention may have other embodiments.

[0027] Please refer to the attached document. Figures 1 to 2 The first aspect of this invention discloses a digital human modeling method, comprising the following steps: S1. Extract three-dimensional reconstruction data for each frame of image from monocular video. The three-dimensional reconstruction data includes the original data corresponding to each frame of image, high-precision human body mask, original audio signal, optimized SMPL parameters and camera parameters. Preferably, in this embodiment, a monocular video of the target person is acquired using a mobile phone or a camera device. During the acquisition process, it is ensured that the target person is fully presented in the frame and includes natural body movements and facial expressions. This monocular video serves as the raw input for the extraction of 3D reconstruction data. Extracting 3D reconstruction data for each frame of the monocular video specifically includes: S101. Input the monocular video with the target person into the ROMP model, and determine the camera parameters and initial SMPL human body model parameters corresponding to each frame of the monocular video. S102. Generate human body bounding boxes based on the optimized SMPL parameters. Input each frame of the original image and the human body bounding boxes into the SAM model and generate a high-precision human body mask through pixel-level semantic segmentation. S103. Detect two-dimensional human joints in each frame of image using the AlphaPose algorithm. Using the two-dimensional human joints as the constraint target, iteratively optimize the initial SMPL human model parameters using the gradient descent algorithm to minimize the error between the three-dimensional model joint projection and the two-dimensional detection results, and obtain the optimized SMPL parameters. S104. Simultaneously extract the original pixel data corresponding to each frame of the image and the original audio signal associated with the monocular video, and summarize them to form the three-dimensional reconstruction data.

[0028] See Figure 3 Specifically, the acquired monocular video is input into the ROMP (Recovering Meshes and Poses) model. This model automatically decomposes the continuous video sequence frame by frame, and simultaneously estimates the camera parameters and initial SMPL human body model parameters for each frame. The camera parameters specifically include intrinsic and extrinsic parameters. Intrinsic parameters include parameters describing the camera's own optical characteristics, such as lens focal length and image principal point coordinates. Extrinsic parameters include parameters describing the spatial position of the camera relative to the target person, such as rotation matrix and translation vector. The SMPL human body model parameters refer to the human body parameters of the SMPL model. The SMPL model is a classic parametric human geometric model. Its core function is to accurately describe the body shape and posture changes of the human body through low-dimensional parameters. It is a basic geometric tool widely used in digital human modeling, 3D human body reconstruction and other fields. In this embodiment, the initial SMPL human body model parameters include 10-dimensional shape parameters and 23 joint angle posture parameters. The shape parameters are used to characterize the inherent physical characteristics of the target person, such as height, shoulder width, and body shape. The posture parameters are used to describe the bending, rotation and other posture states of the target person's limb joints, providing core basic parameter support for subsequent data processing.

[0029] Due to the inherent depth ambiguity limitations of monocular vision, the initial SMPL human model parameters directly estimated by the ROMP model inevitably contain noise and bias. When the corresponding 3D human model is reprojected into image space, it is easy to find discrepancies with the real human contour and joint positions, thus affecting the accuracy of subsequent 3D reconstruction. Therefore, it is necessary to introduce the AlphaPose algorithm to process each frame of the image. This algorithm accurately detects the coordinates of 26 2D human joints in the image, which cover key parts such as the head, torso, and limbs. Using these 26 2D human joints as explicit constraints, the gradient descent algorithm is used to iteratively optimize the initial SMPL human model parameters. During the iteration process, the projection error between the 2D coordinates of the 3D model joints after projection through the camera parameters and the 2D joint coordinates detected by AlphaPose is continuously calculated. By continuously adjusting the shape and pose parameters of SMPL, the projection error is gradually minimized until the error converges to a preset reasonable range, finally obtaining optimized SMPL parameters with higher accuracy and a better fit to the real human form.

[0030] Based on the optimized SMPL parameters, the kinematic forward calculation of the SMP model is used to substitute the posture parameters (joint angles) and shape parameters into the preset human skeleton kinematic model to solve for the three-dimensional coordinates of each basic three-dimensional joint in the world coordinate system. This process uses the skeletal chain relationship built into the SMP model to convert the joint angles into relative position offsets between joints. Combined with the correction of the human body shape by the shape parameters, it ensures that the three-dimensional joint coordinates can accurately reflect the actual body shape and posture of the human body in the current frame. By projecting the coordinates of all 3D joints onto the 2D image space using camera parameters, the corresponding 2D joint coordinates are obtained. After obtaining all 2D joint coordinates, the minimum value is extracted as the left boundary and the maximum value is extracted as the right boundary by traversing the coordinates of all 2D joints in the horizontal direction of the image. Similarly, the minimum value is extracted as the upper boundary and the maximum value is extracted as the lower boundary by traversing the coordinates of all 2D joints in the vertical direction of the image, thus forming the initial 2D human bounding box.

[0031] Considering that the human body edges may extend beyond the area defined by the joints, the initial bounding box needs to be optimized. A bounding box dilation strategy is adopted, expanding the left boundary of the initial bounding box by 5-10 pixels to the left, the right boundary by 5-10 pixels to the right, the top boundary by 5-10 pixels upwards, and the bottom boundary by 5-10 pixels downwards. The number of dilated pixels can be adaptively adjusted according to the image resolution; for example, 8 pixels are expanded at a 1920×1080 resolution. Simultaneously, the coordinates of the dilated bounding box are calibrated to ensure that the coordinates of the top-left and bottom-right corners are within the image resolution range, avoiding invalid coordinates where the boundary exceeds the image. The resulting 2D human body bounding box can accurately select the complete human body region in each frame of the image, effectively eliminating interference from the background region for subsequent segmentation while ensuring that the details of the human body edges are not cropped. Each frame of the original image, along with the bounding box of the human body, is input into the SAM (Segment Anything Model). The SAM model uses its built-in Transformer architecture to extract features and perform pixel-level semantic analysis on the input image, accurately distinguishing human body pixels from background pixels, and then generating a high-precision binary human body mask. In this mask, the human body area is marked with white pixels and the background area is marked with black pixels. This not only completely eliminates the interference of complex backgrounds (such as environmental scenery, clothing textures, etc.) on subsequent processing, but also completely preserves key information such as the human body contours and clothing details.

[0032] After completing the above SMPL parameter optimization and high-precision human body mask generation, the original pixel data corresponding to each frame image is extracted synchronously. This original pixel data is RGB three-channel image data, with each channel having a pixel depth of 8 bits. The resolution is consistent with the acquired monocular video, which can completely preserve the appearance details of the human body in each frame image. At the same time, the original audio signal aligned with the time sequence of the monocular video is extracted. The time stamp synchronization mechanism ensures that the audio signal corresponds one-to-one with the time stamp of each frame image, avoiding the problem of lip movements and audio being out of sync during subsequent animation generation.

[0033] S2. Generate a 3D human body mesh in the "T" pose based on the optimized SMPL parameters. After discretizing it into a 3D Gaussian point set, the initialization of the 3D Gaussian point set is completed by combining the original data corresponding to each frame image, the high-precision human body mask and camera parameters. Specifically, the optimized SMPL parameters are substituted into the SMPL standard template mesh, and the template mesh is adjusted for personalized deformation to accurately match the inherent body features of the target person, such as height, shoulder width, and weight. Then, the posture parameters are fixed to the standard joint angles corresponding to the "T" posture, i.e., a standard posture with the body upright, arms outstretched horizontally, and palms facing forward. Through forward kinematic calculations of the SMPL model, a well-structured and accurately shaped "T" posture 3D human body mesh is generated. This mesh retains the target person's personalized body features while possessing a unified and standardized posture structure. For the generated "T" posture 3D human body mesh, surface sampling is performed within the human body area defined by a high-precision human body mask to achieve discretization. During this process, using previously acquired camera intrinsic and extrinsic parameters, and through the inverse operation of perspective projection, the surface of the human body mesh is sampled... The two-dimensional projection coordinates corresponding to the sample points are back-derived to three-dimensional space. Combined with the original three-dimensional coordinates of the SMPL mesh, double calibration is performed to ensure that the initial three-dimensional coordinates of each 3D Gaussian point are accurately aligned with the corresponding human body area in the original image. This avoids Gaussian point position shifts due to depth estimation deviations and ensures visual consistency during subsequent rendering. At the same time, through the established spatial mapping relationship between 3D Gaussian points and pixels in the original image, the RGB color value of the corresponding position of each Gaussian point in the original image is extracted as the basic color attribute of that Gaussian point. Combined with the pixel value distribution of the high-precision human body mask, high transparency values ​​are assigned to Gaussian points inside the human body area. The transparency gradient of Gaussian points in the transition area between the human body edge and the background is adjusted so that the transparency attribute of the Gaussian points can restore the natural transition effect of the human body contour and avoid harsh edge truncation during subsequent rendering.

[0034] Finally, the "T" pose 3D human body mesh is discretized into a 3D Gaussian point set that combines accurate 3D position and realistic appearance attributes. Each Gaussian point contains the initial position after camera parameter calibration, covariance information adapted to human body shape, color features of the corresponding original pixel, and transparency attribute based on mask optimization.

[0035] S3. Based on the original audio signal, the text description of the desired facial shape, and the reference image, obtain a facial animation sequence that can represent lip movements and emotional dynamics. See Figure 4 Preferably, in this embodiment, the inherent "content" and "emotion" factors in facial animation are completely separated; then, using the decoupled independent feature space, driving signals from the original audio signal, the text description of the desired facial shape, and the reference image modality are injected respectively, and the dynamic intensity of the emotion is finely controlled, ultimately synthesizing a highly natural and rich facial animation sequence that can characterize lip movements and emotional dynamics, specifically including: S301. Construct and train a freeze-decoupling network, wherein the decoupling network includes a content encoder. Emotional encoder Decoder D; S302. The frame-level sentiment features of the original audio signal are extracted by the Emotion2Vec model, and the original audio signal is transcribed into text by the Whisper model. Then, the semantic features of the text are extracted by the RoBERTa pre-trained language model. The cross-attention mechanism allows the audio and text features to query and complement each other, and finally outputs a predicted sentiment intensity curve that is aligned frame by frame with the input audio. S303, Utilize the predicted emotional intensity of each frame to the finally frozen emotional encoder The generated sentiment embeddings are scaled frame by frame to generate a basic sentiment feature sequence, which is then projected onto the frozen sentiment encoder. The final sentiment feature sequence is generated in the process; S304. Extract the acoustic features of the original audio signal using the Wav2Vec2.0 model and project them onto the frozen content encoder. Generate lip-sync parameters; S305. Input the reference image into the CLIP image encoder to extract visual features and obtain the visual guide code through linear mapping. The text description of the desired facial shape is input into the CLIP text encoder to extract semantic features, which are then linearly mapped to obtain the text guide code. The two are fused to obtain a multimodal guidance signal. ; S306, Combine lip-sync parameters, the final dynamic emotional feature sequence, and the multimodal guidance signal. The input motion decoder decodes and obtains a facial animation sequence that can represent lip-sync and dynamic changes in emotion.

[0036] Specifically, the above implementation process can be broken down into two major stages: feature decoupling pre-training and dynamic animation generation and inference. In the feature decoupling pre-training stage, the core objective is to train a neural network that can understand and separate the intrinsic components of facial animation. This is regarded as a self-supervised representation learning problem. By designing a special training mechanism, the network is forced to encode content (i.e., lip and jaw movements related to pronunciation) and emotion (i.e. eyebrow, eye, and cheek movements related to emotion) into two non-interfering latent spaces.

[0037] During the feature decoupling pre-training phase, a frozen decoupling network was obtained, which includes a content encoder. Emotional encoder Decoder D, where the content encoder Ignore facial expressions and focus solely on pronunciation related to lip shape. Emotional encoder. Ignoring lip shape changes and extracting only facial muscle movement patterns related to emotional state, decoder D recombines content features and emotional features to accurately reconstruct the original facial animation sequence. After pre-training, it obtains the frozen content latent space and the frozen emotional latent space.

[0038] In the dynamic animation generation and inference stage, lip-sync parameters are generated first. During this process, an Audio Content Mapping module is used. Relying on the pre-trained frozen content latent space, acoustic features in the audio are mapped to lip-sync parameters synchronized with phonemes, providing the underlying driving signal for lip-sync in facial animation. The input to this module is the Wav2Vec2.0 model processing the original audio signal. The Wav2Vec2.0 model, based on the results of front-end feature extraction, can capture low-level acoustic features that are highly correlated with lip shape. These acoustic features are then input into a lightweight mapping network (composed of multiple fully connected layers) in the ACM module. The network projects these features onto a frozen content latent space, generating a sequence of lip movement parameters that are precisely synchronized with the audio phonemes. To ensure the accuracy and consistency of lip-sync parameters, the ACM module's training process employs a dual loss function for optimization. This training loss includes a reconstruction loss (L_recon) and a similarity loss (L_sim). The reconstruction loss ensures that the generated lip-sync parameters can reproduce realistic lip movement details, guaranteeing the accuracy of the final animation. The similarity loss constrains the content features extracted from the audio to be consistent with the content features extracted from the frozen emotional latent space of the real animation in terms of cosine distance. The calculation formula for the ACM module's training process is shown below:

[0039] At the same time, a dynamic emotional feature sequence is generated synchronously, firstly through... The Emotion2Vec model extracts frame-level, prosodic tone-related emotional features from the original audio signal. The original audio signal was transcribed into text using the Whisper model, and then the semantic embeddings of the text were obtained using pre-trained language models such as RoBERTa. By uncovering the hidden emotional tendencies behind the text, a cross-attention mechanism is used to embed semantic meaning. and emotional characteristics Through mutual querying and supplementation, the final output is a predicted sentiment intensity curve that is aligned frame by frame with the input audio. .

[0040] To ensure the accuracy of emotion intensity, a ground truth value needs to be calculated. Since a single emotion label can only guide the generation of an average-level expression and cannot reflect the natural fluctuations in facial expressions when a person is speaking, a set of facial controllers representing the upper face (eyebrows, eyes), and lips is selected. The L1 norm of the displacement of each controller in each frame t is calculated and summed as the ground truth value of the emotion intensity for that frame. The calculation formula is as follows:

[0041] in It is the set of selected controllers.

[0042] Based on the predicted emotion intensity value for each frame, the frozen emotion encoder is... The resulting basic emotional embedding Frame-by-frame scaling is performed to generate a dynamically changing sequence of underlying sentiment features. The calculation formula is as follows:

[0043] The basic sentiment feature sequence is processed by a fusion encoder. Projecting onto the frozen emotional latent space, the training loss in this process includes reconstruction loss. L_recon Similarity loss L_sim With intensity prediction loss L_int Similarity loss ensures the predicted dynamic sentiment feature sequence. Emotional feature sequences extracted from real animation Maintaining consistency across the overall distribution, the intensity prediction loss ensures the predicted intensity curve remains consistent. With pseudo-tags As close as possible. The calculation formula is as follows:

[0044] Furthermore, to ensure that the visual representation of the generated facial animation conforms to specific morphological requirements, a multimodal guidance signal needs to be constructed. This is achieved by inputting a reference image into a CLIP image encoder to extract high-level visual features of the face, which are then converted into standardized visual guidance codes via a linear mapping layer. The text description of the desired facial shape is input into the CLIP text encoder, which extracts semantic features from the text and converts them into standardized text guide codes via a linear mapping layer. And visual guidance codes With text guide code To maintain consistency in dimensions, a dynamic weighted fusion strategy is employed to fuse the two types of preambles, forming a unified multimodal preamble signal. ; Finally, the generated lip-sync parameter sequence, dynamic emotional feature sequence, and multimodal guidance signal will be combined. The input is fed into a motion decoder, which employs a Transformer architecture. Its self-attention mechanism effectively models inter-frame temporal dependencies, ensuring the continuity of lip movements and a smooth transition in emotional intensity. The decoder integrates the core information from the three types of signals and maps them to control parameters for facial animation. These include lip movement control parameters that are precisely synchronized with audio phonemes, facial expression parameters that reflect dynamic emotional fluctuations, and morphological constraint parameters that conform to the reference image and text description. Through decoding operations, the final facial animation sequence is generated. This sequence achieves strict synchronization between lip movements and the original audio, presents natural and subtle emotional changes through dynamic emotional intensity adjustment, and faithfully reproduces the core requirements of the reference image and text description in terms of facial morphology.

[0045] In some embodiments of this application, training the frozen decoupling network requires generating a dedicated dataset for network training. This dataset must satisfy the characteristic that content and emotion can be arbitrarily combined to support the effectiveness of decoupling learning. Specifically, an existing high-performance model, EmoFace, is utilized. By inputting various speech segments and seven basic emotion labels (including happy, angry, sad, calm, surprised, fearful, and disgusted) into EmoFace, a large-scale facial animation dataset with arbitrarily combinable content and emotion is generated in batches through the efficient generation capabilities of the EmoFace model. The core advantage of this dataset is that the "content" and "emotion" attributes of each animation sequence are independently controllable, that is, the same speech content can be equipped with different emotion labels, and the same emotion label can also be matched with different speech content.

[0046] The facial animation dataset is input into the content encoder. Emotional encoder A decoupled network consisting of a content encoder and a decoder D, where the content encoder... Its core function is to focus on extracting content features related to pronunciation, specifically covering facial movement information directly related to phoneme pronunciation, such as lip movements and jaw opening and closing, while ignoring irrelevant interference from emotional expression; Emotional Encoder The encoder focuses on extracting emotional features related to emotional state, including facial movement patterns such as raised or furrowed eyebrows, opening and closing of eyes, and contraction of cheek muscles, without being affected by the content of the speech. The decoder D is responsible for receiving the two types of features and recombining them to accurately restore the original facial animation sequence consistent with the input. Its role is to verify the completeness of the encoder's feature extraction and the recombinability of the two types of features.

[0047] To ensure the network can effectively separate content and sentiment features, a triple self-supervised learning strategy is employed for training. Once the triple self-supervised learning training process converges, the content encoder is frozen. Emotional encoder And all network weights of decoder D form a stable frozen content encoder. Post-freeze emotion encoder The frozen content latent space and the frozen emotion latent space, along with the corresponding frozen content latent space, will serve as the core benchmark for the subsequent dynamic animation generation and inference stages, providing reliable feature space support for the mapping of audio features to lip-sync parameters and the generation of dynamic emotion feature sequences through emotion intensity adjustment.

[0048] In some embodiments of this application, a triple self-supervised learning strategy is employed for training, specifically including: For the input sequence Through the content encoder Extract content features related to pronunciation using an emotion encoder. Extract emotion-related features, input the content features and emotion features into decoder D for reconstruction, and obtain a facial animation sequence consistent with the original input sequence; Two sets of facial animation sequences with the same content but different emotions were selected as sequence pairs and processed by a content encoder. Extract content features from the two sets of sequences and use an emotion encoder. Extract the emotional features of each of the two sets of sequences; input the content features of the first set of sequences and its own emotional features into decoder D to reconstruct a facial animation sequence consistent with the first set of sequences; input the content features of the second set of sequences and its own emotional features into decoder D to reconstruct a facial animation sequence consistent with the second set of sequences. The content features of the first set of sequences and the emotional features of the second set of sequences are input into decoder D to generate the first intermediate animation sequence; the content features of the second set of sequences and the emotional features of the first set of sequences are input into decoder D to generate the second intermediate animation sequence. Repeat the above feature extraction steps for the first intermediate animation sequence and the second intermediate animation sequence respectively, and reconstruct the original first set of original sequences and the original second set of original sequences respectively. This cyclic consistency constraint avoids the loss of key information during feature exchange.

[0049] Specifically, for each input sequence in the dataset Through the content encoder Extract content features and use an emotion encoder Emotional features are extracted, and both types of features are input into the decoder D for reconstruction. The goal is to ensure that the decoder outputs an animation sequence that is completely identical to the original input sequence. This ensures that the encoder has complete target feature extraction capabilities while also guaranteeing the decoder's feature reconstruction capabilities. The calculation formula is shown below:

[0050] An overlap-swapping strategy is employed to achieve initial feature decoupling. This strategy forces similar features to cluster in the latent space, eliminating irrelevant information. This is achieved by selecting two sequence pairs with identical content but different sentiments. and Through the content encoder Extract content features and use an emotion encoder Extracting emotional features, Emotional characteristics and The combination of content features reconstructs The animation (content unchanged, emotion from the creator) allows the content encoder to extract lip movement features without emotional information. Similarly, the animation... Content features and The combined emotional features are then input into decoder D for reconstruction. This constraint forces the content features to be free of any emotional information and the emotional features to contain no content-related interference, thus achieving preliminary separation of the two types of features. The calculation formula is as follows:

[0051] To enhance the robustness of decoupling and prevent the loss of feature information by employing a cyclic swapping strategy, based on the same sequence pairs mentioned above, the following steps are taken: Emotional characteristics and The content features are input into decoder D, and the first intermediate animation sequence is reconstructed. ,use Emotional characteristics and The content features are input into decoder D to reconstruct the second intermediate animation sequence. The two new sequences generated in the first round and The feature extraction and swapping process is repeated, with the goal of being able to reconstruct the original feature cyclically. and The calculation formula is as follows:

[0052] Once the above triple self-supervised learning training process converges—that is, when the error between the decoded animation sequence and the original input sequence reaches a preset threshold and the separation degree of the two types of features meets the requirements—the content encoder is frozen. Emotional encoder And all network weights of decoder D form a stable frozen content encoder. Post-freeze emotion encoder This involves freezing the content latent space and freezing the sentiment latent space. This cyclic consistency constraint ensures that even after multiple feature exchanges, the features extracted by the encoder still contain all the information needed to reconstruct the original sequence, effectively preventing information loss or degradation. Through the above training, the network is forced to learn how to decompose content and sentiment factors into two independent subspaces.

[0053] S4. In the 3DGS process, add a Gaussian physical model to render the initialized 3D Gaussian point set, and mix the 3DGS rendering result with the facial animation sequence to generate a digital human model with facial expressions.

[0054] The core of this step is to upgrade 3D Gaussian points from purely visual primitives into physical entities with biomechanical properties through the tight coupling of physical simulation and neural rendering. At the same time, it integrates the emotions and lip movements of facial animation to ultimately generate a digital human model that combines physical realism with delicate expressions. See Figure 5 Preferably, in this embodiment, a Gaussian physical model is added during the 3DGS process to render the initialized 3D Gaussian point set, specifically including: S401. Assign a material label to each 3D Gaussian point based on the semantic segmentation model, map the material label to a differentiated physical parameter through a predefined material lookup table, and assign a value to each 3D Gaussian point based on the physical parameter. S402. Apply environmental external forces to the 3D Gaussian points that meet the conditions after assignment, and calculate the collision constraint forces generated by the 3D Gaussian points and external colliders during the interaction process. At the same time, dynamically construct the spring-damping network between adjacent 3D Gaussian points through KD tree, and calculate the spring constraint forces according to Hooke's law to maintain the model topology. S403. Calculate the resultant force of external forces and constraint forces on each 3D Gaussian point, and update the velocity and spatial position of the 3D Gaussian point by numerical integration based on the calculation results. S404. Pass the updated position and shape of the 3D Gaussian point to the standard 3D Gaussian sputtering renderer to generate the final image of the current frame. Repeat the above process frame by frame to obtain the 3DGS rendering result.

[0055] Specifically, the initial 3D Gaussian point set is first assigned physical attributes. A pre-trained semantic segmentation model performs semantic recognition on each 3D Gaussian point, assigning a corresponding material label based on its spatial position in the "T"-pose human body mesh and its corresponding tissue region. These material labels specifically cover key human tissue types such as skin, fat, muscle, and bone, ensuring that each Gaussian point possesses clear semantic attributes. Then, a predefined material lookup table is invoked to map each material label to differentiated physical parameters. These physical parameters include core attributes such as mass, elastic coefficient, and damping coefficient, and are specifically set according to tissue type. 3D Gaussian points in the skin region are assigned a lighter mass and a moderate elastic coefficient, Gaussian points in the muscle region are assigned a medium mass and a higher elastic coefficient, and 3D Gaussian points in the bone region are assigned a larger mass and a high-strength rigidity parameter. This semantically driven differentiated assignment ensures that the physical response of the Gaussian points strictly follows the biomechanical laws of human tissue, laying the foundation for the realism of subsequent physical simulations.

[0056] Environmental forces, including gravity and wind, are applied to qualified 3D Gaussian points after assignment. These forces only affect Gaussian points on the human body surface and soft tissues (skin, fat, and muscle). Gaussian points in rigid areas such as bones are unaffected by environmental forces due to physical parameter settings. Simultaneously, the interaction state between 3D Gaussian points and external colliders (such as objects in the virtual scene or different parts of the human body) is monitored in real time. The collision constraint force generated during contact is calculated to prevent penetration or unreasonable overlap of Gaussian points, ensuring the physical correctness of the interaction process. Based on this, a spring-damping network is dynamically constructed between adjacent 3D Gaussian points using the KD-tree algorithm, establishing implicit connections between Gaussian points and simulating the internal structural strength of human tissue. The spring constraint force is then calculated according to Hooke's Law to resist tension and compression, effectively maintaining the topological structure of the digital human model and preventing unnatural tearing or excessive deformation. The calculation expression is as follows:

[0057] The resultant force on each 3D Gaussian point is calculated using a physics state solver. This resultant force is the vector sum of the environmental forces, collision constraints, and spring constraints. The calculation formula is as follows:

[0058] Numerical integration is performed based on this resultant force and Newton's second law to update the velocity and spatial position of each 3D Gaussian point, ensuring that the motion of the Gaussian point conforms to the laws of mechanics. The calculation expression for the update process is as follows:

[0059] Simultaneously, the shape deformation solver is activated to calculate the local stress tensor based on the local stress state (such as compression or shear) of each Gaussian point. This stress tensor is approximately derived from the resultant force borne by the spring network around the point. Then, a lightweight MLP is used to map the local stress tensor and dynamically adjust the covariance matrix of the Gaussian point to make the shape change of the Gaussian point accurately match the physical stress. For example, the Gaussian point in the compression area exhibits flattening deformation through covariance matrix adjustment, while the tension area exhibits stretching deformation, realizing physical-driven fine shape adjustment and making soft tissue movement more natural secondary movement effects.

[0060] After the physics simulation loop ends, the updated 3D Gaussian point positions, the stress-adjusted covariance matrix (shape information), and the original appearance attributes (color, transparency) are passed to the standard 3D Gaussian sputtering renderer. Through rendering processes such as depth sorting, fast projection, and alpha blending, the 3DGS rendering result of the current frame is generated. The above operation is repeated frame by frame to obtain a continuous 3DGS dynamic rendering sequence. This sequence has both the high-fidelity visual quality of 3DGS technology and incorporates dynamic response characteristics that conform to physical laws, which can realistically simulate natural physical phenomena such as soft tissue deformation and collision feedback. Based on this, the 3DGS rendering sequence and the facial animation sequence are mixed and rendered. The timing alignment mechanism ensures that the timestamps of the two are completely synchronized. During the fusion process, the physical realism of the 3DGS rendering is preserved, such as the natural stretching of facial muscles with facial expressions and the soft tissue deformation when the lips open and close. On the other hand, the lip-sync characteristics and emotional dynamic expression of the facial animation sequence are superimposed. Finally, a digital human model with both physical accuracy and delicate expressions is generated. This model can not only present limb and facial movements that conform to biomechanical laws, but also accurately reproduce lip movements synchronized with audio and natural emotional expressions.

[0061] See Figure 6 The second aspect of the present invention discloses a digital human modeling system, the modeling system including a server and a PC front-end, the server being used to implement the digital human modeling method as described in any of the preceding claims, and the PC front-end being used to display the modeling results.

[0062] Specifically, the modeling system employs a client-server (C / S) architecture, with its core consisting of a PC front-end running an operating system and a Linux server. The PC front-end acts as the interactive and display client, primarily handling data acquisition and real-time interaction. The computationally intensive 3D modeling and rendering tasks are deployed on the Linux server, while real-time parameter estimation and visualization functions are handled by the Windows PC. This distributed hardware and software architecture achieves the goal of building a portable, high-quality, and interactive digital human avatar system.

[0063] A third aspect of the present invention discloses a non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that the computer program, when executed by a server, implements the digital human modeling method as described in any of the preceding claims.

[0064] The fourth aspect of the present invention discloses a computer program product, comprising a computer program, characterized in that, when the computer program is executed by a server, it implements the digital human modeling method as described in any of the preceding claims.

[0065] The above are merely specific embodiments of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. The scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A digital human modeling method, characterized in that, Includes the following steps: S1. Extract three-dimensional reconstruction data for each frame of image from monocular video. The three-dimensional reconstruction data includes the original data corresponding to each frame of image, high-precision human body mask, original audio signal, optimized SMPL parameters and camera parameters. S2. Generate a 3D human body mesh in the "T" pose based on the optimized SMPL parameters. After discretizing it into a 3D Gaussian point set, the initialization of the 3D Gaussian point set is completed by combining the original data corresponding to each frame image, the high-precision human body mask and camera parameters. S3. Based on the original audio signal, the text description of the desired facial shape, and the reference image, obtain a facial animation sequence that can represent lip movements and emotional dynamics. S4. In the 3DGS process, add a Gaussian physical model to render the initialized 3D Gaussian point set, and mix the 3DGS rendering result with the facial animation sequence to generate a digital human model with facial expressions.

2. The digital human modeling method according to claim 1, characterized in that, Extracting 3D reconstruction data for each frame of a monocular video, specifically including: S101. Input the monocular video with the target person into the ROMP model, and determine the camera parameters and initial SMPL human body model parameters corresponding to each frame of the monocular video. S102. Detect two-dimensional human joints in each frame of image using the AlphaPose algorithm. Using the two-dimensional human joints as the constraint target, use the gradient descent algorithm to iteratively optimize the initial SMPL human model parameters, minimize the error between the three-dimensional model joint projection and the two-dimensional detection results, and obtain the optimized SMPL parameters. S103. Generate human body bounding boxes based on the optimized SMPL parameters. Input each frame of the original image and the human body bounding boxes into the SAM model and generate a high-precision human body mask through pixel-level semantic segmentation. S104. Simultaneously extract the original pixel data corresponding to each frame of the image and the original audio signal associated with the monocular video, and summarize them to form the three-dimensional reconstruction data.

3. The digital human modeling method according to claim 2, characterized in that, For the generated 3D human body mesh in the "T" pose, surface sampling is performed in the sampling area defined by the high-precision human body mask to discretize it into a 3D Gaussian point set. During the discretization process, the camera parameters back-infer the 2D projection coordinates of the human body mesh to the 3D space to calibrate the initial position of the Gaussian points. At the same time, based on the color and transparency features of the corresponding pixels in the original data, appearance attributes are assigned to each Gaussian point, and finally the initialized 3D Gaussian point set is generated.

4. The digital human modeling method according to claim 3, characterized in that, Based on the original audio signal, a text description of the desired facial shape, and reference images, a facial animation sequence that can represent lip movements and emotional dynamics is obtained, specifically including: Construct and train a freeze decoupling network, which includes a content encoder. Emotional encoder Decoder D; The Emotion2Vec model extracts frame-level sentiment features from the original audio signal, and the Whisper model transcribes the original audio signal into text. The RoBERTa pre-trained language model then extracts semantic features from the text. A cross-attention mechanism is used to allow audio and text features to query and complement each other, and finally outputs a predicted sentiment intensity curve that is aligned frame by frame with the input audio. The predicted sentiment intensity of each frame is used to analyze the final frozen sentiment encoder. The generated sentiment embeddings are scaled frame by frame to generate a basic sentiment feature sequence, which is then projected onto the frozen sentiment encoder. The final sentiment feature sequence is generated in the process; The acoustic features of the original audio signal are extracted using the Wav2Vec2.0 model and projected onto the frozen content encoder. Generate lip-sync parameters; The reference image is input into the CLIP image encoder to extract visual features, which are then linearly mapped to obtain the visual guide code. The text description of the desired facial shape is input into the CLIP text encoder to extract semantic features, which are then linearly mapped to obtain the text guide code. The two are fused to obtain a multimodal guidance signal. ; The lip-sync parameters, the final dynamic emotional feature sequence, and the multimodal guidance signal are combined. The input motion decoder decodes and obtains a facial animation sequence that can represent lip-sync and dynamic changes in emotion.

5. The digital human modeling method according to claim 4, characterized in that, Training the frozen decoupling network specifically includes: Input various types of speech segments and seven emotion tags into the EmoFace model to generate a facial animation dataset in which content and emotion can be arbitrarily combined; The facial animation dataset is input into the content encoder. Emotional encoder The decoupled network consisting of decoder D is trained using a triple self-supervised learning strategy. After training, the content encoder is frozen. Emotional encoder The weights of decoder D form the frozen content encoder. Emotional encoder .

6. The digital human modeling method according to claim 5, characterized in that, A triple self-supervised learning strategy is employed for training, specifically including: For the input sequence Through the content encoder Extract content features related to pronunciation using an emotion encoder. Extract emotion-related features, input the content features and emotion features into decoder D for reconstruction, and obtain a facial animation sequence consistent with the original input sequence; Two sets of facial animation sequences with the same content but different emotions were selected as sequence pairs and processed by a content encoder. Extract content features from the two sets of sequences and use an emotion encoder. Extract the emotional features of each of the two sets of sequences; input the content features of the first set of sequences and its own emotional features into decoder D to reconstruct a facial animation sequence consistent with the first set of sequences; input the content features of the second set of sequences and its own emotional features into decoder D to reconstruct a facial animation sequence consistent with the second set of sequences. The content features of the first set of sequences and the emotional features of the second set of sequences are input into decoder D to generate the first intermediate animation sequence; the content features of the second set of sequences and the emotional features of the first set of sequences are input into decoder D to generate the second intermediate animation sequence. Repeat the above feature extraction steps for the first intermediate animation sequence and the second intermediate animation sequence respectively, and reconstruct the original first set of original sequences and the original second set of original sequences respectively. This cyclic consistency constraint avoids the loss of key information during feature exchange.

7. A digital human modeling method according to claim 6, characterized in that, In the 3DGS process, a Gaussian physical model is added to render the initialized 3D Gaussian point set, specifically including: Material labels are assigned to each 3D Gaussian point based on a semantic segmentation model. The material labels are mapped to differentiated physical parameters through a predefined material lookup table, and each 3D Gaussian point is assigned a value based on the physical parameters. Apply environmental external forces to 3D Gaussian points that meet the conditions after assignment, and calculate the collision constraint forces generated by the interaction between the 3D Gaussian points and external colliders. At the same time, dynamically construct the spring-damping network between adjacent 3D Gaussian points through KD tree, and calculate the spring constraint forces according to Hooke's law to maintain the model topology. Calculate the resultant force of external forces and constraint forces acting on each 3D Gaussian point, and update the velocity and spatial position of the 3D Gaussian point by numerical integration based on the calculation results; The updated position and shape of the 3D Gaussian points are passed to the standard 3D Gaussian sputtering renderer to generate the final image of the current frame. The above process is repeated frame by frame to obtain the 3DGS rendering result.

8. A digital human modeling system, characterized in that, The modeling system includes a server and a PC front-end. The server is used to implement the digital human modeling method as described in any one of claims 1 to 7, and the PC front-end is used to display the modeling results.

9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the server, it implements the digital human modeling method as described in any one of claims 1 to 7.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the server, it implements the digital human modeling method as described in any one of claims 1 to 7.