Three-dimensional human body reconstruction method and device

By enhancing feature extraction through the StyleUNet architecture and multi-scale sinusoidal activation modules, the spectral bias problem caused by the ReLU activation function is solved, achieving high-fidelity 3D human reconstruction and improving the reconstruction quality of high-frequency details.

CN122199761APending Publication Date: 2026-06-12INST OF SEMICONDUCTORS - CHINESE ACAD OF SCI

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
INST OF SEMICONDUCTORS - CHINESE ACAD OF SCI
Filing Date
2026-03-23
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

In existing 3D human body reconstruction methods, the spectral bias based on the ReLU activation function leads to insufficient high-frequency detail representation, resulting in overly smooth reconstruction results that are difficult to accurately recover high-frequency information such as clothing wrinkles, human body edges, and local texture changes.

Method used

The StyleUNet architecture is used to extract intermediate features, and high-frequency features are enhanced by a multi-scale sinusoidal activation module. Combined with a Gaussian attribute prediction head network, high-fidelity 3D human video frames are generated.

Benefits of technology

It significantly improves the prediction accuracy of 3D Gaussian properties and the reconstruction quality of high-frequency details, and enhances the high-frequency detail recovery capability of the reconstruction results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199761A_ABST
    Figure CN122199761A_ABST
Patent Text Reader

Abstract

The embodiment of the application relates to the technical field of three-dimensional human body reconstruction, and provides a three-dimensional human body reconstruction method and device, which comprises the following steps: acquiring an original video, and extracting original video data from the original video, wherein the original video data comprises posture parameters, image frames, camera parameters and a human body mask; inputting the posture parameters into a feature extraction network, and extracting intermediate features containing space semantics and posture correlation through the feature extraction network; performing high-frequency feature enhancement on the intermediate features to obtain enhanced features; predicting a three-dimensional Gaussian attribute parameter set based on the enhanced features; and reconstructing a three-dimensional human body video frame corresponding to the original video based on the attribute parameter set, the image frames, the camera parameters and the human body mask. The high-fidelity three-dimensional human body reconstruction is realized, the prediction accuracy of the three-dimensional Gaussian attribute is significantly improved, and the reconstruction quality of high-frequency details is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of animated 3D human body reconstruction technology, and in particular to a 3D human body reconstruction method and apparatus. Background Technology

[0002] In the field of animatable 3D human reconstruction, methods based on 3DGaussian Splatting typically employ convolutional neural networks to predict Gaussian properties from pose-conditional inputs, including: Gaussian center position, covariance, color, opacity, and spherical harmonics (SH) coefficients.

[0003] Existing methods commonly employ ReLU or Leaky ReLU activation functions as nonlinear modeling techniques in feature mapping. While these methods exhibit good training stability, their activation functions are essentially low-frequency bias functions, meaning they are better suited to fitting smooth changes but less effective at representing high-frequency structures.

[0004] In human body modeling scenarios, a large amount of important visual information belongs to high-frequency signals, such as clothing wrinkles, human body edges, local texture changes, and joint area details. However, traditional activation functions have problems such as insufficient ability to represent high-frequency details, easy to cause over-smoothing of Gaussian attribute predictions, and blurring of details when pose changes. The fundamental reason is that Convolutional Neural Networks (CNNs) have spectral bias, so the network will preferentially learn low-frequency information.

[0005] Therefore, how to enhance the high-frequency representation ability of pose features and improve the prediction accuracy of three-dimensional Gaussian attributes has become an urgent problem to be solved in human body modeling scenarios. Summary of the Invention

[0006] This invention provides a three-dimensional human body reconstruction method and apparatus to address the shortcomings of existing three-dimensional human body reconstruction methods based on the ReLU activation function, such as spectral bias, insufficient high-frequency detail representation, and overly smooth reconstruction results. It achieves high-fidelity three-dimensional human body reconstruction, significantly improves the prediction accuracy of three-dimensional Gaussian properties, and enhances the reconstruction quality of high-frequency details.

[0007] This invention provides a three-dimensional human body reconstruction method, comprising: Acquire the original video and extract the original video data from the original video, wherein the original video data includes pose parameters, image frames, camera parameters and human body mask; The pose parameters are input into a feature extraction network, which extracts intermediate features that are related to spatial semantics and pose. The intermediate features are enhanced with high-frequency features to obtain the enhanced features; The set of attribute parameters for predicting a 3D Gaussian based on the enhanced features; Based on the set of attribute parameters, image frames, camera parameters, and human body mask, the three-dimensional human body video frames corresponding to the original video are reconstructed.

[0008] In one possible implementation, the method further includes: The original video is split into a continuous sequence of image frames; The pose is estimated on the image frame sequence, and the pose parameters are extracted, wherein the pose parameters include at least one of human joint position, bone orientation and SMPL model parameters; The camera parameters are extracted from the image frame sequence using a camera calibration or motion reconstruction algorithm. The camera parameters include an intrinsic parameter matrix and an extrinsic parameter matrix. The intrinsic parameter matrix includes the focal length and principal point coordinates, and the extrinsic parameter matrix includes a rotation matrix and a translation vector. The human body mask is extracted from the image frame sequence using a human body segmentation model or a threshold segmentation algorithm. The human body mask is a binary mask used to identify the human body region and the background region in the image frame.

[0009] In one possible implementation, the method further includes: The feature extraction network adopts the StyleUNet architecture, which includes multiple convolutional layers; The posture parameters are encoded into a posture feature map, and the posture parameters include human joint heatmaps and / or skeletal orientation vectors. The pose feature map is input into the downsampling path of the StyleUNet architecture, and multi-scale spatial features are extracted layer by layer through the multi-layer convolutional layers, wherein the ReLU activation function is used for nonlinear transformation in the multi-layer convolutional layers; During the downsampling process, shallow features in the multi-scale spatial features are passed to the corresponding upsampling layer through skip connections, preserving detailed information; The multi-scale spatial features are input into the upsampling path of the StyleUNet architecture, and the resolution of the pose feature map is restored through deconvolution or interpolation operations. The shallow features from the skip connections are then fused to obtain the upsampling features. By fusing the pose feature map with the upsampled features, an intermediate feature containing spatial semantics and pose-related information is generated.

[0010] In one possible implementation, the method further includes: The intermediate features are input to a multi-scale sinusoidal activation module. The multi-scale sinusoidal activation module uses a multi-scale sinusoidal activation function to enhance the high-frequency features of the intermediate features to obtain enhanced features. The multi-scale sinusoidal activation module includes two cascaded multi-scale sinusoidal activation layers. The output of the first multi-scale sinusoidal activation layer is used as the input of the second multi-scale sinusoidal activation layer to recover high-frequency detail information at different scales in a hierarchical manner.

[0011] In one possible implementation, the method further includes: The enhanced features are input into a Gaussian attribute prediction head network, which predicts a set of attribute parameters for a 3D Gaussian. The set of attribute parameters includes the Gaussian center position, covariance matrix, color, opacity, and spherical harmonic coefficients.

[0012] In one possible implementation, the method further includes: Based on the camera parameters, the Gaussian center position in the set of attribute parameters of the three-dimensional Gaussian is projected onto the two-dimensional image plane; Based on the human body mask, the rendering area is determined, and Gaussian sputtering rendering is performed only within the human body area identified by the human body mask. Using 3D Gaussian sputtering rendering technology, the color value and opacity of each pixel are calculated based on the attribute parameter set of the 3D Gaussian sputtering, and 3D human video frames corresponding to the reconstructed original video are generated.

[0013] In one possible implementation, the method further includes: Calculate the reconstruction loss between the 3D human body video frame and the image frame; Based on the reconstruction loss, the parameters of the feature extraction network, the multi-scale sinusoidal activation module, and the Gaussian attribute prediction head network are optimized through backpropagation.

[0014] The present invention also provides a three-dimensional human body reconstruction device, comprising the following modules: The data extraction module is used to acquire the original video and extract the original video data from the original video, wherein the original video data includes pose parameters, image frames, camera parameters and human body mask; The feature extraction module is used to input the pose parameters into the feature extraction network and extract intermediate features containing spatial semantics and pose-related information through the feature extraction network. The feature enhancement module is used to enhance the intermediate features with high-frequency features to obtain enhanced features; An attribute prediction module is used to predict a set of attribute parameters for a three-dimensional Gaussian based on the enhanced features. The reconstruction module is used to reconstruct the three-dimensional human video frame corresponding to the original video based on the set of attribute parameters, image frames, camera parameters, and human body mask.

[0015] The present invention also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the computer program to implement the three-dimensional human body reconstruction method as described above.

[0016] The present invention also provides a non-transitory computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the three-dimensional human body reconstruction method as described above.

[0017] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the three-dimensional human body reconstruction method as described above.

[0018] The present invention provides a 3D human body reconstruction method and apparatus, which acquires an original video and extracts original video data from it. The original video data includes pose parameters, image frames, camera parameters, and a human body mask. The pose parameters are input into a feature extraction network, which extracts intermediate features containing spatial semantics and pose-related information. These intermediate features are then enhanced with high-frequency features to obtain enhanced features. A set of 3D Gaussian attribute parameters is predicted based on the enhanced features. Finally, the 3D human body video frames corresponding to the original video are reconstructed based on the attribute parameter set, image frames, camera parameters, and the human body mask. Compared to existing 3D human body reconstruction methods based on the ReLU activation function, which suffer from spectral bias, insufficient high-frequency detail representation, and overly smooth reconstruction results, this solution achieves high-fidelity 3D human body reconstruction by enhancing intermediate features with high-frequency domain enhancement strategies. This significantly improves the prediction accuracy of 3D Gaussian attributes and enhances the reconstruction quality of high-frequency details. Attached Figure Description

[0019] To more clearly illustrate the technical solutions in this invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0020] Figure 1 This is one of the flowcharts of the three-dimensional human body reconstruction method provided by the present invention.

[0021] Figure 2 This is the second flowchart of the three-dimensional human body reconstruction method provided by the present invention.

[0022] Figure 3 This is a schematic diagram of the structure of the multi-scale sinusoidal activation module provided by the present invention.

[0023] Figure 4 This is a schematic diagram of the structure of the three-dimensional human body reconstruction device provided by the present invention.

[0024] Figure 5 This is a schematic diagram of the structure of the electronic device provided by the present invention. Detailed Implementation

[0025] To make the objectives, technical solutions, and advantages of this invention clearer, the technical solutions of this invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some, not all, of the embodiments of this invention. All other embodiments obtained by those skilled in the art based on the embodiments of this invention without creative effort are within the scope of protection of this invention.

[0026] To facilitate understanding of the embodiments of the present invention, further explanations and descriptions will be provided below with reference to the accompanying drawings and specific embodiments. These embodiments do not constitute a limitation on the embodiments of the present invention.

[0027] Figure 1 This is one of the flowcharts illustrating the three-dimensional human body reconstruction method provided by this invention, such as... Figure 1 As shown, the method includes the following: S11. Obtain the original video and extract the original video data from the original video.

[0028] Acquire raw video containing human movement and extract raw video data from it. This raw video data includes: Posture parameters are used to describe the geometric structure information of human posture, including at least one of the following: human joint position, bone orientation, and SMPL model parameters. Image frames, a sequence of consecutive image frames obtained by splitting the original video, serve as the target reference for reconstruction and the background for rendering; Camera parameters describe the intrinsic and extrinsic parameters of the camera's imaging geometry, including intrinsic matrix (focal length, principal point coordinates) and extrinsic matrix (rotation matrix, translation vector). Human body mask, a binary mask, is used to identify the human body region and background region in an image frame, limiting the effective area for subsequent rendering.

[0029] S12. Input the pose parameters into the feature extraction network, and extract intermediate features containing spatial semantics and pose-related information through the feature extraction network.

[0030] Pose parameters are input into a feature extraction network, which extracts intermediate features containing spatial semantics and pose-related information. Specifically, the feature extraction network adopts the StyleUNet architecture, which includes downsampling paths, upsampling paths, and skip connections. The pose parameters are first encoded into pose feature maps, which are then processed layer by layer through multiple convolutional layers to extract multi-scale spatial features. Finally, skip connections are used to fuse shallow detail features and deep semantic features to generate intermediate features containing human structural information.

[0031] S13. The intermediate features are enhanced with high-frequency features to obtain enhanced features.

[0032] The intermediate features are input into the multi-scale sinusoidal activation module for high-frequency feature enhancement, resulting in enhanced features. The multi-scale sinusoidal activation module employs a multi-scale sinusoidal activation function, constructing a nonlinear mapping through the combination of multi-frequency sinusoidal functions. This enables the network to possess frequency response capabilities and high-frequency feature modeling capabilities, effectively recovering high-frequency detail information suppressed by the traditional ReLU activation function, including clothing wrinkles, human body edges, local texture variations, and joint region details.

[0033] S14. Based on the enhanced features, predict the set of attribute parameters of the three-dimensional Gaussian.

[0034] The enhanced features are input into the Gaussian attribute prediction head network to predict the set of attribute parameters for a 3D Gaussian. The set of attribute parameters includes: The Gaussian center position (μ) describes the position of the Gaussian ellipsoid in three-dimensional space; The covariance matrix (Σ) describes the shape and orientation of the Gaussian ellipsoid. Color (c) describes the basic color information of the Gaussian ellipsoid; Opacity (α) describes the transparency of the Gaussian ellipsoid; The spherical harmonic coefficients (SH) describe the color changes of the Gaussian ellipsoid under different viewpoints, enabling viewpoint-dependent appearance modeling.

[0035] S15. Reconstruct the three-dimensional human video frame corresponding to the original video based on the attribute parameter set, image frame, camera parameters and human mask.

[0036] Based on the set of attribute parameters, image frames, camera parameters, and human body mask, the three-dimensional human body video frames corresponding to the original video are reconstructed using three-dimensional Gaussian sputtering rendering technology.

[0037] Specifically, the center position of the three-dimensional Gaussian is projected onto the two-dimensional image plane based on camera parameters; the rendering area is determined based on the human body mask, and Gaussian sputtering rendering is performed only within the human body area; by projecting the three-dimensional Gaussian onto the image plane and performing α-mixing, the color value and opacity of each pixel are calculated to generate a high-fidelity three-dimensional human body video frame, achieving a dynamic human body reconstruction effect consistent with the original video.

[0038] The 3D human body reconstruction method provided by this invention involves acquiring an original video and extracting original video data from it. The original video data includes pose parameters, image frames, camera parameters, and a human body mask. The pose parameters are input into a feature extraction network, which extracts intermediate features containing spatial semantics and pose-related information. These intermediate features are then enhanced with high-frequency features to obtain enhanced features. A set of 3D Gaussian attribute parameters is predicted based on the enhanced features. Finally, the 3D human body video frames corresponding to the original video are reconstructed based on the attribute parameter set, image frames, camera parameters, and the human body mask. Compared to existing 3D human body reconstruction methods based on the ReLU activation function, which suffer from spectral bias, insufficient high-frequency detail representation, and overly smooth reconstruction results, this method achieves high-fidelity 3D human body reconstruction by enhancing intermediate features with high-frequency domain enhancement strategies. This significantly improves the prediction accuracy of 3D Gaussian attributes and enhances the reconstruction quality of high-frequency details.

[0039] Figure 2 This is the second flowchart illustrating the three-dimensional human body reconstruction method provided by this invention, as shown below. Figure 2 As shown, the method includes the following: S21. The original video is split into a continuous sequence of image frames.

[0040] The original video is split into a continuous sequence of image frames, which contains multiple consecutive images in the time dimension, for subsequent pose estimation, camera parameter extraction, and human mask extraction.

[0041] S22. Extract pose parameters, camera parameters, and human body mask from the image frame sequence.

[0042] Extract pose parameters, camera parameters, and human mask from an image frame sequence. Specifically, this includes: Pose parameter extraction: Pose estimation is performed on the image frame sequence, and pose parameters are extracted. The pose parameters include at least one of the following: human joint position, bone orientation, and SMPL (Skinned Multi-Person Linear) model parameters, which are used to describe the geometric structure of human pose. Camera parameter extraction: Camera parameters are extracted from the image frame sequence using camera calibration or Structure from Motion (SfM) algorithm. The camera parameters include intrinsic and extrinsic matrices. The intrinsic matrix includes focal length and principal point coordinates, while the extrinsic matrix includes rotation matrix and translation vector, which are used to describe the imaging geometry of the camera. Human body mask extraction: The human body mask is extracted from the image frame sequence using a human body segmentation model or threshold segmentation algorithm. The human body mask is a binary mask used to identify the human body region and background region in the image frame, so as to limit the effective area for subsequent rendering.

[0043] S23. Encode the posture parameters into a posture feature map, wherein the posture parameters include human joint heatmaps and / or skeletal direction vectors.

[0044] The pose parameters are encoded into pose feature maps, which include human joint heatmaps and / or skeletal orientation vectors. These pose feature maps transform sparse pose parameters into dense feature representations aligned with the image space. Specifically, this includes: Joint heatmap encoding: For each human joint, a Gaussian heatmap is generated with its two-dimensional coordinates as the center. The peak position of the heatmap corresponds to the precise position of the joint, and the heatmap value reflects the confidence level of the joint. A joint heatmap channel with the same resolution as the input image is generated. Skeletal orientation vector encoding: Calculate the orientation vector between adjacent joints, encode the bone orientation information into a two-dimensional vector field, and generate a bone orientation vector feature map; Feature map stacking and fusion: The joint heatmap and the skeletal orientation vector feature map are stacked in the channel dimension to form a multi-channel pose feature map, realizing the transformation from sparse joint coordinates to dense pixel-level feature representation, and providing structured input for subsequent convolutional network processing.

[0045] S24. Input the pose feature map into the downsampling path of the StyleUNet architecture, and extract multi-scale spatial features layer by layer through the multi-layer convolutional layers.

[0046] The pose feature map is input into the downsampling path of the StyleUNet architecture, and multi-scale spatial features are extracted layer by layer through multiple convolutional layers. StyleUNet includes multiple convolutional layers, which use the ReLU activation function for non-linear transformation to extract basic human structural features and semantic information.

[0047] S25. During the downsampling process, shallow features in the multi-scale spatial features are transferred to the corresponding upsampling layer through skip connections, preserving detailed information.

[0048] During the downsampling process, shallow features in multi-scale spatial features are passed to the corresponding upsampling layer through skip connections, preserving detailed information and preventing information loss in deep networks.

[0049] S26. Input the multi-scale spatial features into the upsampling path of the StyleUNet architecture, restore the resolution of the pose feature map through deconvolution or interpolation operations, and fuse the shallow features from the skip connections to obtain upsampling features.

[0050] Multi-scale spatial features are input into the upsampling path of the StyleUNet architecture. The resolution of the pose feature map is restored through deconvolution or interpolation operations, and shallow features from skip connections are fused to obtain upsampling features, thus achieving hierarchical fusion of multi-scale features.

[0051] In each upsampling stage, instance normalization or adaptive instance normalization (AdaIN) is used to stylize and normalize the features, thereby enhancing the stability and diversity of feature representation.

[0052] S27. Fuse the pose feature map with the upsampled features to generate intermediate features that contain spatial semantics and pose-related information.

[0053] By fusing pose feature maps and upsampled features, intermediate features containing spatial semantics and pose-related information are generated. These intermediate features have the same spatial resolution as the input image, containing rich semantic and spatial structural information about human pose, accurately representing the pose position and approximate shape of various parts of the human body. However, due to the inherently low-frequency bias of the ReLU activation function and the spectral bias characteristic of convolutional neural networks, the intermediate features suffer some loss in high-frequency detail information (such as clothing wrinkles, fine textures, and edge sharpness), requiring compensation and enhancement through subsequent multi-scale sinusoidal activation modules.

[0054] S28. The intermediate features are input to the multi-scale sinusoidal activation module, which uses a multi-scale sinusoidal activation function to enhance the intermediate features with high frequency, thereby obtaining enhanced features.

[0055] The intermediate features are input into the multi-scale sinusoidal activation module, which uses a multi-scale sinusoidal activation function to enhance the intermediate features with high frequency, resulting in enhanced features. Figure 3This is a schematic diagram of the structure of the multi-scale sinusoidal activation module provided by the present invention. The multi-scale sinusoidal activation module includes two multi-scale sinusoidal activation layers connected in series. The output of the first multi-scale sinusoidal activation layer is used as the input of the second multi-scale sinusoidal activation layer to recover high-frequency detail information at different scales in a layered manner.

[0056] In this embodiment, the Multi-scale Sinusoidal Activation (MSA) is introduced as a local spectral refinement module, applied only to the high-frequency feature enhancement stage. Unlike sinusoidal activation-based networks designed for implicit neural representations, MSA does not replace the activation function of the entire network, nor does it modify the backbone architecture, convolutional layers, or the overall training dynamics of StyleUNet. Instead, it selectively enriches the spectral bandwidth of intermediate feature maps, making them fully compatible with pose-conditional Gaussian predictors based on CNNs, while maintaining their stability and inductive bias.

[0057] Specifically, multi-scale sinusoidal activation function MSA(x) The mathematical expression is: in, For multi-scale frequency parameters, These are learnable weight coefficients. K The number of frequency scales.

[0058] By constructing a nonlinear mapping through the combination of multi-frequency sine functions, the network is endowed with frequency response capability and high-frequency feature modeling capability, while avoiding the instability problem of traditional single-frequency sine networks (such as SIREN) during training.

[0059] S29. Input the enhanced features into the Gaussian attribute prediction head network, and predict the set of attribute parameters of a three-dimensional Gaussian through the Gaussian attribute prediction head network.

[0060] The enhanced features are input into the Gaussian attribute prediction head network, which predicts the set of attribute parameters of the 3D Gaussian ellipsoid. The set of attribute parameters includes the Gaussian center position μ, the covariance matrix Σ, the color c, the opacity α, and the coefficients of the spherical harmonic function (SH), which are used to fully describe the geometry and appearance attributes of each Gaussian ellipsoid in 3D space.

[0061] S210. Based on the camera parameters, project the Gaussian center position in the set of attribute parameters of the three-dimensional Gaussian onto the two-dimensional image plane.

[0062] Based on the camera parameters, the center position of the Gaussian in the set of attribute parameters of the three-dimensional Gaussian is projected onto the two-dimensional image plane to establish the correspondence between the three-dimensional Gaussian and the pixels of the two-dimensional image.

[0063] S211. Based on the human body mask, determine the rendering area, and perform Gaussian sputtering rendering only within the human body area identified by the human body mask.

[0064] Based on the human body mask, the rendering area is determined, and Gaussian sputtering rendering is performed only within the human body area identified by the human body mask, avoiding interference from the background area on the reconstruction results and improving rendering efficiency.

[0065] S212. Using three-dimensional Gaussian sputtering rendering technology, calculate the color value and opacity of each pixel according to the attribute parameter set of the three-dimensional Gaussian sputtering, and generate a three-dimensional human body video frame corresponding to the original video.

[0066] Using 3D Gaussian Splatting rendering technology, the color value and opacity of each pixel are calculated based on the attribute parameter set of 3D Gaussian. By projecting the 3D Gaussian onto the image plane and performing alpha blending, a 3D human video frame corresponding to the reconstructed original video is generated.

[0067] Optionally, the method further includes: a network optimization step, calculating the reconstruction loss between the 3D human video frame and the image frame, the reconstruction loss including photometric loss, perceptual loss and / or frequency domain loss; and optimizing the parameters of the feature extraction network, the multi-scale sinusoidal activation module and the Gaussian attribute prediction head network through backpropagation based on the reconstruction loss until the network converges.

[0068] As shown in Tables 1 and 2, the experimental results demonstrate that the 3D human reconstruction method provided by this invention outperforms existing methods in objective evaluation metrics such as PSNR, SSIM, LPIPS, and FID. Particularly noteworthy is the high-frequency error (High-FreqError) and frequency domain high-frequency error (FFT-High Error) metrics. The high-frequency error of this invention is 0.043, significantly lower than the comparative methods Animatable Gaussians (0.052), Pose Vocab (0.071), SLRF (0.079), and ARAH (0.112). This fully verifies the technical advantages of the multi-scale sinusoidal activation module in enhancing high-frequency detail representation and effectively solves the technical pain points of existing methods in restoring fine structures such as skin texture and clothing wrinkles, characterized by blurriness and distortion.

[0069] Table 1 Method PSNR↑ SSIM↑ LPIPS↓ FID↓ FFT-Distance ↓ HiFi-GaussianAvatar 27.3549 0.9690 0.0398 26.624 0.5531 Animatable Gaussians 27.1901 0.9682 0.0438 29.412 0.6039 Pose Vocab 25.4564 0.9657 0.0436 50.454 0.6299 SLRF 26.1567 0.9535 0.0421 53.412 0.6795 ARAH 20.2553 0.9609 0.0985 87.774 0.6856 Table 2 Method High-Freq Error ↓ FFT-High Error ↓ HiFi-GaussianAvatar 0.043 0.118 Animatable Gaussians 0.052 0.129 Pose Vocab 0.071 0.158 SLRF 0.079 0.169 ARAH 0.112 0.205 The 3D human body reconstruction method provided by this invention involves acquiring an original video and extracting original video data from it. The original video data includes pose parameters, image frames, camera parameters, and a human body mask. The pose parameters are input into a feature extraction network, which extracts intermediate features containing spatial semantics and pose-related information. These intermediate features are then enhanced with high-frequency features to obtain enhanced features. A set of 3D Gaussian attribute parameters is predicted based on the enhanced features. Finally, the 3D human body video frame corresponding to the original video is reconstructed based on the attribute parameter set, image frames, camera parameters, and the human body mask. This method achieves high-fidelity 3D human body reconstruction by enhancing intermediate features with a frequency domain enhancement strategy, significantly improving the prediction accuracy of 3D Gaussian attributes and enhancing the reconstruction quality of high-frequency details.

[0070] The three-dimensional human body reconstruction device provided by the present invention is described below. The three-dimensional human body reconstruction device described below can be referred to in correspondence with the three-dimensional human body reconstruction method described above.

[0071] Figure 4 This is a schematic diagram of the structure of the multi-scale sinusoidal activation module provided by the present invention, specifically including: The data extraction module 401 is used to acquire the original video and extract original video data from the original video, wherein the original video data includes pose parameters, image frames, camera parameters, and human body masks. For detailed explanations, please refer to the relevant descriptions in the above method embodiments, which will not be repeated here.

[0072] The feature extraction module 402 is used to input the pose parameters into the feature extraction network, and extract intermediate features containing spatial semantics and pose-related information through the feature extraction network. For detailed explanations, please refer to the relevant descriptions in the above method embodiments; they will not be repeated here.

[0073] The feature enhancement module 403 is used to enhance the intermediate features with high-frequency features to obtain enhanced features. For detailed explanations, please refer to the relevant descriptions in the above method embodiments; they will not be repeated here.

[0074] The attribute prediction module 404 is used to predict the set of attribute parameters of the three-dimensional Gaussian based on the enhanced features. For detailed explanations, please refer to the relevant descriptions in the above method embodiments; they will not be repeated here.

[0075] The reconstruction module 405 is used to reconstruct the three-dimensional human video frame corresponding to the original video based on the attribute parameter set, image frame, camera parameters, and human mask. For detailed explanation, please refer to the relevant descriptions in the above method embodiments; they will not be repeated here.

[0076] Figure 5 An example is a schematic diagram of the physical structure of an electronic device, such as... Figure 5 As shown, the electronic device may include a processor 510, a communications interface 520, a memory 530, and a communication bus 540, wherein the processor 510, communications interface 520, and memory 530 communicate with each other via the communication bus 540. The processor 510 can call logical instructions in the memory 530 to execute a three-dimensional human reconstruction method. This method includes: acquiring an original video and extracting original video data from the original video, wherein the original video data includes pose parameters, image frames, camera parameters, and a human mask; inputting the pose parameters into a feature extraction network to extract intermediate features containing spatial semantics and pose-related information; performing high-frequency feature enhancement on the intermediate features to obtain enhanced features; predicting a set of attribute parameters for a three-dimensional Gaussian based on the enhanced features; and reconstructing the three-dimensional human video frame corresponding to the original video based on the set of attribute parameters, image frames, camera parameters, and the human mask.

[0077] Furthermore, the logical instructions in the aforementioned memory 530 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0078] On the other hand, the present invention also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the three-dimensional human reconstruction method provided by the above methods. The method includes: acquiring an original video and extracting original video data from the original video, wherein the original video data includes pose parameters, image frames, camera parameters, and a human mask; inputting the pose parameters into a feature extraction network and extracting intermediate features containing spatial semantics and pose-related information through the feature extraction network; performing high-frequency feature enhancement on the intermediate features to obtain enhanced features; predicting a set of attribute parameters of a three-dimensional Gaussian based on the enhanced features; and reconstructing the three-dimensional human video frame corresponding to the original video based on the set of attribute parameters, image frames, camera parameters, and the human mask.

[0079] In another aspect, the present invention also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the three-dimensional human body reconstruction method provided by the above methods. The method includes: acquiring an original video and extracting original video data from the original video, wherein the original video data includes pose parameters, image frames, camera parameters, and a human body mask; inputting the pose parameters into a feature extraction network to extract intermediate features containing spatial semantics and pose-related information; performing high-frequency feature enhancement on the intermediate features to obtain enhanced features; predicting a set of attribute parameters for a three-dimensional Gaussian based on the enhanced features; and reconstructing the three-dimensional human body video frame corresponding to the original video based on the set of attribute parameters, image frames, camera parameters, and the human body mask.

[0080] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0081] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0082] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for three-dimensional human body reconstruction, characterized in that, include: Acquire the original video and extract the original video data from the original video, wherein the original video data includes pose parameters, image frames, camera parameters and human body mask; The pose parameters are input into a feature extraction network, which extracts intermediate features that are related to spatial semantics and pose. The intermediate features are enhanced with high-frequency features to obtain the enhanced features; The set of attribute parameters for predicting a 3D Gaussian based on the enhanced features; Based on the set of attribute parameters, image frames, camera parameters, and human body mask, the three-dimensional human body video frames corresponding to the original video are reconstructed.

2. The method according to claim 1, characterized in that, The process of acquiring the original video and extracting original video data from the original video includes: The original video is split into a continuous sequence of image frames; The pose is estimated on the image frame sequence, and the pose parameters are extracted, wherein the pose parameters include at least one of human joint position, bone orientation and SMPL model parameters; The camera parameters are extracted from the image frame sequence using a camera calibration or motion reconstruction algorithm. The camera parameters include an intrinsic parameter matrix and an extrinsic parameter matrix. The intrinsic parameter matrix includes the focal length and principal point coordinates, and the extrinsic parameter matrix includes a rotation matrix and a translation vector. The human body mask is extracted from the image frame sequence using a human body segmentation model or a threshold segmentation algorithm. The human body mask is a binary mask used to identify the human body region and the background region in the image frame.

3. The method according to claim 1 or 2, characterized in that, The feature extraction network adopts the StyleUNet architecture, which includes multiple convolutional layers; The step of inputting the pose parameters into a feature extraction network and extracting intermediate features containing spatial semantics and pose-related information through the feature extraction network includes: The posture parameters are encoded into a posture feature map, and the posture parameters include human joint heatmaps and / or skeletal orientation vectors. The pose feature map is input into the downsampling path of the StyleUNet architecture, and multi-scale spatial features are extracted layer by layer through the multi-layer convolutional layers, wherein the ReLU activation function is used for nonlinear transformation in the multi-layer convolutional layers; During the downsampling process, shallow features in the multi-scale spatial features are passed to the corresponding upsampling layer through skip connections, preserving detailed information; The multi-scale spatial features are input into the upsampling path of the StyleUNet architecture, and the resolution of the pose feature map is restored through deconvolution or interpolation operations. The shallow features from the skip connections are then fused to obtain the upsampling features. By fusing the pose feature map with the upsampled features, an intermediate feature containing spatial semantics and pose-related information is generated.

4. The method according to claim 3, characterized in that, The step of enhancing the intermediate features with high-frequency features to obtain enhanced features includes: The intermediate features are input to a multi-scale sinusoidal activation module. The multi-scale sinusoidal activation module uses a multi-scale sinusoidal activation function to enhance the high-frequency features of the intermediate features to obtain enhanced features. The multi-scale sinusoidal activation module includes two cascaded multi-scale sinusoidal activation layers. The output of the first multi-scale sinusoidal activation layer is used as the input of the second multi-scale sinusoidal activation layer to recover high-frequency detail information at different scales in a hierarchical manner.

5. The method according to claim 4, characterized in that, The set of attribute parameters for predicting a 3D Gaussian based on the enhanced features includes: The enhanced features are input into a Gaussian attribute prediction head network, which predicts a set of attribute parameters for a 3D Gaussian. The set of attribute parameters includes the Gaussian center position, covariance matrix, color, opacity, and spherical harmonic coefficients.

6. The method according to claim 5, characterized in that, The process of reconstructing the 3D human video frame corresponding to the original video based on the attribute parameter set, image frame, camera parameters, and human mask includes: Based on the camera parameters, the Gaussian center position in the set of attribute parameters of the three-dimensional Gaussian is projected onto the two-dimensional image plane; Based on the human body mask, the rendering area is determined, and Gaussian sputtering rendering is performed only within the human body area identified by the human body mask. Using 3D Gaussian sputtering rendering technology, the color value and opacity of each pixel are calculated based on the attribute parameter set of the 3D Gaussian sputtering, and 3D human video frames corresponding to the reconstructed original video are generated.

7. The method according to claim 6, characterized in that, The method further includes: Calculate the reconstruction loss between the 3D human body video frame and the image frame; Based on the reconstruction loss, the parameters of the feature extraction network, the multi-scale sinusoidal activation module, and the Gaussian attribute prediction head network are optimized through backpropagation.

8. A three-dimensional human body reconstruction device, characterized in that, include: The data extraction module is used to acquire the original video and extract the original video data from the original video, wherein the original video data includes pose parameters, image frames, camera parameters and human body mask; The feature extraction module is used to input the pose parameters into the feature extraction network and extract intermediate features containing spatial semantics and pose-related information through the feature extraction network. The feature enhancement module is used to enhance the intermediate features with high-frequency features to obtain enhanced features; An attribute prediction module is used to predict a set of attribute parameters for a three-dimensional Gaussian based on the enhanced features. The reconstruction module is used to reconstruct the three-dimensional human video frame corresponding to the original video based on the set of attribute parameters, image frames, camera parameters, and human body mask.

9. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the three-dimensional human body reconstruction method as described in any one of claims 1 to 7.

10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the three-dimensional human body reconstruction method as described in any one of claims 1 to 7.