Human body three-dimensional model generation method, training method and related device
By extracting RGB image features and enhanced radar point features from multimodal pose estimation, and combining them with Transformer network and motion capture marker training, the problems of low accuracy and poor robustness of human 3D reconstruction in complex environments are solved, and high-precision human 3D model generation is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- HANGZHOU DIANZI UNIV
- Filing Date
- 2026-02-26
- Publication Date
- 2026-06-12
Smart Images

Figure CN122199792A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of multimodal information fusion technology, and in particular to a method for human pose estimation based on millimeter-wave radar and RGB images. Background Technology
[0002] With the development of computer vision and intelligent sensing technologies, 3D human pose estimation and human mesh reconstruction have received widespread attention in fields such as intelligent surveillance, human-computer interaction, virtual reality, and behavior analysis. Traditional methods mostly rely on monocular or multi-view RGB cameras to detect 2D joints or contours of the human body, and then infer 3D pose and mesh through geometric constraints or deep learning models. However, schemes that rely solely on visible light images are prone to failure in complex lighting, strong reflections, nighttime scenes, and environments with severe occlusion, and are also sensitive to the color of the subject's clothing and background texture, making it difficult to guarantee robustness in practical applications.
[0003] On the other hand, millimeter-wave radar has good penetration and all-weather operation capabilities, and can stably perceive targets in environments such as low light, smoke, and slight obstruction. However, the point cloud data it collects is usually sparse and has low spatial resolution, making it difficult to distinguish the detailed structure of the human body. It is quite difficult to use millimeter-wave data alone to perform fine human posture and mesh reconstruction.
[0004] In recent years, multimodal fusion methods have begun to combine RGB images and millimeter-wave radar data to complement each other's strengths and weaknesses: images provide fine-grained appearance semantic information, while millimeter-wave radar provides stable spatial and motion information. Meanwhile, parametric human models (such as SMPL / SMPLX) can express complex human shapes and poses with a small number of parameters, making them suitable as a unified reconstruction representation. However, existing multimodal pose estimation methods still have shortcomings in data acquisition and annotation, cross-modal alignment, and end-to-end training processes. For example, time synchronization and spatial calibration between different sensors are complex, easily introducing errors and making it difficult to obtain high-quality 3D mesh annotations; multimodal feature fusion methods are relatively direct and do not fully explore the local correspondence between image semantics and millimeter-wave point clouds. Therefore, there is still an urgent need for a systematic training method that can fully utilize the complementary advantages of millimeter-wave radar and RGB images within a unified data acquisition and annotation framework, and combine with the SMPLX model to achieve high-precision pose estimation. Summary of the Invention
[0005] In view of the shortcomings of the prior art described above, the purpose of this application is to provide a method for generating a three-dimensional human body model, a training method, and related equipment, so as to improve the anti-interference ability in complex environments and improve the reconstruction accuracy of the three-dimensional human body model.
[0006] In a first aspect, this application provides a method for generating a three-dimensional human body model, the method comprising:
[0007] Acquire point cloud sequences from RGB images and radar;
[0008] Extract the local feature map, global feature map, and enhanced local feature map of the RGB image;
[0009] The point cloud sequence is encoded into point features of radar points;
[0010] The point features are enhanced using the local feature map to obtain the enhanced point features.
[0011] The enhanced local feature map, the global feature map, and the enhanced point features are encoded into a token sequence;
[0012] A three-dimensional human body model is generated based on the token sequence.
[0013] In one implementation of the first aspect, the step of performing feature enhancement processing on the point features using the local feature map to obtain enhanced point features includes:
[0014] Cross-modal semantic weights are obtained based on the local feature map and the point cloud sequence;
[0015] The point features and the cross-modal semantic weights are concatenated and then input into a multilayer perceptron to obtain cross-modal fine-grained correlation coefficients.
[0016] Multiplying the cross-modal fine-grained correlation coefficient and the cross-modal semantic weight yields the semantic modulation point features;
[0017] The point features and the semantically modulated point features are concatenated along the channel dimension to obtain the enhanced point features.
[0018] In one implementation of the first aspect, generating a three-dimensional human body model based on the token sequence includes:
[0019] The token sequence is input into the Transformer network to obtain the fused feature sequence;
[0020] The model parameters of the human three-dimensional model are obtained by decoding the fused feature sequence using a regression head.
[0021] A three-dimensional human body model is generated based on the model parameters.
[0022] Secondly, this application also provides a method for training a neural network.
[0023] The neural network is used for:
[0024] Acquire point cloud sequences from RGB images and radar;
[0025] Extract the local feature map, global feature map, and enhanced local feature map of the RGB image;
[0026] The point cloud sequence is encoded into point features of radar points;
[0027] The point features are enhanced using the local feature map to obtain the enhanced point features.
[0028] The enhanced local feature map, the global feature map, and the enhanced point features are encoded into a token sequence;
[0029] The token sequence is input into the Transformer network to obtain the fused feature sequence;
[0030] The model parameters of the human three-dimensional model are obtained by decoding the fused feature sequence using a regression head.
[0031] The training method includes:
[0032] While acquiring the RGB image and the point cloud sequence, the three-dimensional coordinates of the reflective marker points are obtained through a motion capture camera;
[0033] Fit a second three-dimensional human body model based on the three-dimensional coordinates of the reflective markers and generate second model parameters;
[0034] The RGB image and the point cloud sequence are input into the neural network to obtain the first model parameters; and a first human 3D model is generated based on the first model parameters.
[0035] The first loss is obtained based on the joint point error between the first human 3D model and the second human 3D model;
[0036] The second loss is obtained based on the vertex error between the first and second human 3D models;
[0037] The third loss is obtained based on the vertex acceleration error between the first and second human 3D models;
[0038] The fourth loss is obtained based on the error between the first model parameters and the standard model parameters;
[0039] The total loss is obtained based on the first loss, the second loss, the third loss, and the fourth loss;
[0040] The parameters of the neural network are updated based on the total loss.
[0041] Thirdly, this application also provides a three-dimensional human body model generation device, comprising:
[0042] The data acquisition module is used to acquire RGB images and point cloud sequences from radar.
[0043] The feature extraction module is used to extract local feature maps, global feature maps, and enhanced local feature maps of the RGB image;
[0044] A point feature generation module is used to encode the point cloud sequence into point features of radar points;
[0045] The point feature enhancement module is used to perform feature enhancement processing on the point features using the local feature map to obtain enhanced point features;
[0046] The token sequence generation module is used to encode the enhanced local feature map, the global feature map, and the enhanced point features into a token sequence.
[0047] The human body 3D model generation module generates a human body 3D model based on the token sequence.
[0048] Fourthly, this application also provides an electronic device, including a memory and a processor, wherein the processor is configured to execute a computer program stored in the memory to cause the electronic device to perform the human body three-dimensional model generation method.
[0049] Fifthly, this application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the aforementioned method for generating a three-dimensional human body model.
[0050] As described above, the human body 3D model generation method, neural network training method, human body 3D model generation device, electronic device, and computer-readable storage medium described in this application have at least one of the following beneficial effects:
[0051] By fusing two modal data—RGB images and millimeter-wave radar point cloud sequences—and combining local and global feature maps extracted from the RGB images, and using the global feature maps to enhance the local feature maps, the representation capability of image features under complex lighting and occlusion scenarios is effectively improved. Simultaneously, cross-modal association is performed between the point features of radar points and the local feature maps of the images. A multilayer perceptron is used to model the correlation coefficient between radar points and image regions, achieving point feature enhancement based on image semantic guidance. This significantly improves the semantic discriminative power of sparse point clouds, thereby enhancing the system's anti-interference capability in low-light, reflective, and partially occluded environments.
[0052] By uniformly encoding the enhanced local feature map, global feature map, and enhanced point features into a token sequence and inputting it into the Transformer network for deep fusion, the self-attention mechanism is fully utilized to model the long-distance dependency between multimodal features. This enables adaptive interaction between image and radar features in a unified latent space, allowing the fused features to more accurately reflect the posture and shape structure of the human body, thereby improving the reconstruction accuracy of the human 3D model.
[0053] During training, RGB images, radar point cloud sequences, and 3D coordinates of reflective markers captured by motion capture cameras are acquired simultaneously. A second human 3D model and its parameters are fitted based on the marker coordinates and used as supervised ground truth. A loss function is constructed by comparing the error between the first model parameters output by the neural network and the ground truth parameters. Combined with the geometric error between the first and second human 3D models, a multi-task optimization objective is formed, enabling the model to achieve end-to-end training without manual annotation, reducing annotation costs and improving training efficiency.
[0054] By decoding and fusing feature sequences using a regression head, the model parameters of a 3D human body model can be predicted, and a high-fidelity human body mesh can be generated accordingly. This has great potential for engineering application in real-world scenarios such as intelligent security, human-computer interaction, and virtual reality. Attached Figure Description
[0055] Figure 1 The flowchart shown is a method for generating a three-dimensional human body model according to an embodiment of this application.
[0056] Figure 2 The diagram shown is a schematic representation of the OverLock network in one embodiment of this application.
[0057] Figure 3 The following is a flowchart illustrating step S200 in one embodiment of this application.
[0058] Figure 4 The diagram shown is a structural schematic of the feature extraction module in one embodiment of this application.
[0059] Figure 5 The diagram shows a specific flowchart of how the Dynamic Block module updates the local and global feature maps in one embodiment of this application.
[0060] Figure 6 The diagram shown is a structural schematic of the Dynamic Block module in one embodiment of this application.
[0061] Figure 7 The diagram shown is a structural schematic of the GDSA module in one embodiment of this application.
[0062] Figure 8 The flowchart shown is a detailed flowchart of step S400 in one embodiment of this application.
[0063] Figure 9 The flowchart shown is a detailed flowchart of step S700 in one embodiment of this application.
[0064] Figure 10 This diagram illustrates an application scenario of step S100 in one embodiment of this application.
[0065] Figure 11 The diagram shown is a structural schematic of a human body three-dimensional model generation device in one embodiment of this application.
[0066] Figure 12 The diagram shown is a schematic diagram of a human body three-dimensional model generation device in one embodiment of this application.
[0067] Figure 13 The diagram shown is a schematic representation of the structure of an electronic device according to an embodiment of this application. Detailed Implementation
[0068] The following specific examples illustrate the implementation of this application. Those skilled in the art can easily understand other advantages and effects of this application from the content disclosed in this specification. This application can also be implemented or applied through other different specific embodiments, and various details in this specification can also be modified or changed based on different viewpoints and applications without departing from the spirit of this application. It should be noted that, unless otherwise specified, the following embodiments and features in the embodiments can be combined with each other.
[0069] It should be noted that the illustrations provided in the following embodiments are only schematic representations of the basic concept of this application. Therefore, the drawings only show the components related to this application and are not drawn according to the actual number, shape and size of the components in the actual implementation. In the actual implementation, the form, quantity and proportion of each component can be arbitrarily changed, and the layout of the components may also be more complex.
[0070] The following will elaborate on the principles and implementation methods of the human body 3D model generation method, neural network training method, human body 3D model generation device, electronic device, and computer-readable storage medium of this embodiment, so that those skilled in the art can understand the human body 3D model generation method, neural network training method, human body 3D model generation device, electronic device, and computer-readable storage medium of this embodiment without creative effort.
[0071] Please refer to Figure 1 This application provides a method for generating a three-dimensional human body model, which includes:
[0072] Step S100: Obtain the point cloud sequence of the RGB image and the radar;
[0073] Step S200: Extract the local feature map, global feature map, and enhanced local feature map of the RGB image;
[0074] Step S300: Encode the point cloud sequence into point features of radar points;
[0075] Step S400: Use the local feature map to perform feature enhancement processing on the point features to obtain the enhanced point features;
[0076] Step S500: Encode the enhanced local feature map, global feature map, and enhanced point features into a token sequence;
[0077] Step S600: Generate a 3D human body model based on the token sequence.
[0078] In step S100, RGB images and point cloud sequences are acquired synchronously and aligned using timestamps.
[0079] In step S200, local feature maps are used to capture detailed information such as edges and textures in the image, which is suitable for locating key areas of the human body. Global feature maps can be obtained through operations such as downsampling or global pooling, and are used to express the overall semantic context of the image, providing prior guidance for subsequent feature enhancement.
[0080] In step S300, the point features of the radar point and the method for generating them can be existing technologies. For example, the point feature corresponds to a high-dimensional vector that represents its physical attributes such as position, velocity and echo intensity, and is used to form a spatial perception representation of the target.
[0081] By extracting and utilizing local and global feature maps of RGB images, and using global features to enhance local feature maps, the stability and expressive power of local feature maps under conditions such as illumination changes and background interference are effectively improved, enhancing the ability to capture details of human posture.
[0082] This embodiment enhances the point features of radar points by using local feature maps of the image, achieving cross-modal transfer of image semantic information to millimeter-wave point clouds. This significantly improves the problem of semantic loss in sparse point clouds and enhances the accuracy of radar modality in human structure recognition. By encoding the enhanced local feature maps, global features, and enhanced point features into a unified token sequence, a structurally consistent and semantically rich input representation is provided for subsequent deep networks, facilitating deep interaction and fusion of multimodal features. Generating a 3D human model based on the token sequence fully leverages the advantages of sequence modeling paradigms such as Transformer in modeling long-distance dependencies and cross-modal relationships, resulting in improved pose rationality and geometric fidelity of the final output 3D model. This embodiment effectively addresses the problems of low accuracy and poor robustness in 3D human reconstruction caused by the limitations of a single modality in complex environments, demonstrating good practicality and engineering expansion potential.
[0083] Please refer to Figure 2 In one embodiment of this application, in step S200, the OverLoCK network is used to extract local feature maps, global feature maps, and enhanced local feature maps of the RGB image.
[0084] Please refer to Figure 3 , Figure 4 In one embodiment of this application, step S200 includes:
[0085] Step S210: Perform depthwise separable convolution on the RGB image to obtain the first feature map;
[0086] By using depthwise separable convolution, the computational load and number of parameters are significantly reduced while retaining effective feature extraction capabilities, which helps improve inference efficiency and enables the method of this embodiment to be implemented on resource-constrained devices.
[0087] Step S220: The RGB image and the first feature map are fused through the first residual connection to obtain the second feature map;
[0088] Introducing residual connections helps alleviate the vanishing gradient problem during network training, ensuring that low-level visual information such as edges and color transitions are not lost in the second feature map.
[0089] Step S230: Normalize the second feature map to obtain the third feature map;
[0090] Normalization can be performed using batch normalization or layer normalization to stabilize the input distribution of each layer, accelerate model convergence, improve generalization ability, and enhance adaptability to changes in illumination and contrast of the input image.
[0091] Step S240: Perform dilated reparameterized convolution on the third feature map to obtain the fourth feature map;
[0092] By employing dilated reparameterized convolution, multiple parallel branches are used during the training phase to achieve wide-range perception. During the inference phase to generate the 3D human body model, they are merged into a single convolution kernel, thereby perceiving a wider range of spatial context information, helping to identify partially occluded areas and improving the robustness of the model.
[0093] Step S250: First expand the dimensions of the fourth feature map and then compress it back to the original dimensions to obtain the fifth feature map;
[0094] Step S260: Adjust the channel weights of the fifth feature map to obtain the sixth feature map;
[0095] For example, channel description vectors are generated based on global average pooling, and then the importance coefficients of each channel are calculated through a fully connected layer. These coefficients are then multiplied with the original features to complete the weighting, thereby adjusting the weights of the feature channels.
[0096] Step S270: The second feature map and the sixth feature map are fused through the second residual connection to obtain the seventh feature map;
[0097] Step S280: Use the seventh feature map as a new RGB image, repeat the above steps multiple times, and use the final seventh feature map as a local feature map.
[0098] This embodiment provides a solid foundation for guiding radar point cloud enhancement and multimodal fusion by efficiently and effectively extracting local feature maps from RGB images, thereby improving the accuracy and robustness of human body 3D model generation.
[0099] Step S290: Further downsample the local feature map, reducing its width and height to 32 times that of the RGB image to obtain the global feature map.
[0100] Please refer to Figure 5 , Figure 6 In one embodiment of this application, step S200 further includes:
[0101] Step S2100: Concatenate the local feature map and the global feature map along the channel dimension to obtain the eighth feature map;
[0102] Step S2110: Perform depthwise separable convolution on the eighth feature map to obtain the ninth feature map;
[0103] Step S2120: The eighth feature map and the ninth feature map are fused through the third residual connection to obtain the tenth feature map;
[0104] Step S2130: Normalize the tenth feature map to obtain the eleventh feature map;
[0105] Step S2140: Adjust the region weights of the eleventh feature map through a spatial attention mechanism to obtain the twelfth feature map;
[0106] Step S2150: The tenth feature map and the twelfth feature map are fused through the fourth residual connection to obtain the thirteenth feature map;
[0107] Step S2160: Normalize the thirteenth feature map to obtain the fourteenth feature map;
[0108] Step S2170: First expand the dimensions of the fourteenth feature map and then compress it back to the original dimensions to obtain the fifteenth feature map;
[0109] Step S2180: The thirteenth feature map and the fifteenth feature map are fused through the fifth residual connection to obtain the sixteenth feature map;
[0110] Step S2190: Activate the sixteenth feature map using a nonlinear activation function to obtain a new local feature map;
[0111] Step S2111: Fuse the sixteenth feature map with the initial context prior to obtain a new global feature map;
[0112] Step S2112: Repeat the above steps multiple times, and use the final local feature map as the enhanced local feature map.
[0113] This embodiment effectively mitigates the information degradation problem caused by deep nonlinear stacking by employing a three-level residual connection, ensuring that the feature information of the original RGB image is preserved and reused in multiple rounds of transformation, thus improving the stability of model training. This embodiment introduces a spatial attention mechanism to achieve adaptive focusing in the spatial dimension, enabling the network to automatically enhance its response to key human body regions such as joints and torso, while suppressing the influence of irrelevant background regions. This embodiment improves the discrimination capability in complex scenes.
[0114] In one specific embodiment, the preprocessed RGB image is fed into an image encoder; at different stages of the encoder, low-level local texture and edge features and high-level global semantic features are output respectively; the multi-scale features are saved hierarchically as several feature maps and a global feature vector, which serve as the image input for subsequent cross-modal alignment and fusion.
[0115] Please refer to Figure 2 In one specific embodiment, the resolution of the RGB image is H×W, the resolution of the local feature map is H / 16×W / 16, and the resolution of the global feature map is H / 32×W / 32.
[0116] Please refer to Figure 2 , Figure 7In one specific embodiment, the global feature map is updated in the Focus-Net's dynamic block using the following formula:
[0117]
[0118] Where α and β are learnable parameters, and P0 is the initial context prior.
[0119] Please refer to Figure 8 In one embodiment of this application, step S400 includes:
[0120] Step S410: Obtain cross-modal semantic weights based on local feature maps and point cloud sequences;
[0121] Step S420: The point features and cross-modal semantic weights are concatenated and input into the multilayer perceptron to obtain the cross-modal fine-grained correlation coefficients.
[0122] Step S430: Multiply the cross-modal fine-grained correlation coefficient and the cross-modal semantic weight to obtain the semantic modulation point features;
[0123] Step S440: The point features and semantically modulated point features are concatenated along the channel dimension to obtain the enhanced point features.
[0124] In one embodiment of this application, step S410 includes:
[0125] Step S411: Project the point cloud sequence onto a two-dimensional plane to obtain the image pixel coordinates;
[0126] Step S412: Align the image pixel coordinates with the local feature map to obtain the feature map region where the radar point is located;
[0127] Step S413: Extract features from the feature map region to obtain the image semantic sampling vector of the radar point;
[0128] Step S414: Perform global average pooling on the spatial dimension of the local feature map to obtain the channel mean vector;
[0129] Step S415: Normalize the elements of the channel mean vector to obtain the normalized channel mean vector.
[0130] Step S416: Multiply the image semantic sampling vector with the normalized channel mean vector to obtain the cross-modal semantic weight of the radar point.
[0131] In one specific embodiment, the extrinsic parameters between the millimeter-wave point cloud and the camera are used for point-to-pixel projection. Each radar point is sampled or its corresponding local image descriptor is retrieved on the image feature map. Based on similarity or attention mechanisms, the local semantic information of the image is aggregated onto the corresponding point cloud features to form a semantically enhanced local feature map of the point cloud. The confidence level of the fused point cloud features is estimated and outlier filtering is performed. The point cloud features can also be aggregated in a short time by combining the temporal information of adjacent frames to enhance stability.
[0132] The local image features obtained by the image feature extraction module are used as image semantic features in this step. And construct the local semantic guidance matrix of the image.
[0133] For the current frame point cloud features Obtain its point feature matrix
[0134] Will Locally related image semantic features Input MLP to obtain cross-modal fine-grained correlation coefficients:
[0135]
[0136] The final result is an enhanced point feature representation.
[0137]
[0138] In one specific embodiment, in step S500, the enhanced local feature map, global feature map, and enhanced point features are encoded into a token sequence. The enhanced local feature map is divided into multiple local feature map blocks, each block being encoded as a token; the global feature map is encoded as an independent token; each enhanced point feature is also encoded as a token; modality identifiers and position codes are added to all tokens to form a unified token sequence. Specifically, point cloud features are clustered according to FPS, and image features are divided according to a grid. During embedding, all modalities are uniformly mapped to 2048 dimensions, using learnable absolute position embedding, with the maximum number of positions configured as 445, 455, or 677.
[0139] In one specific embodiment, the local feature map of the image, the global feature map of the image, and the local feature map of the semantically enhanced point cloud are uniformly encoded into a feature token sequence:
[0140] Add a modal identifier and location code to each token:
[0141] The encoded token sequence is input into a multi-layer self-attention Transformer module, and the relationships between tokens are modeled through a multi-head self-attention mechanism:
[0142]
[0143] The final output is the fused multimodal latent space feature representation. This provides input for SMPLX mesh generation.
[0144] In one specific embodiment, the Transformer module employs a TokenFusion Transformer with a 3-layer depth and an 8-head attention mechanism. The Transformer module uses a standard FFN structure (input -> hidden layer - GELU - dropout - hidden layer -> output layer - dropout). The first layer has an input / output dimension of 2048, a hidden layer dimension of 4096, and a dropout value of 0.1. The second intermediate layer has a dimension that increases from 2048 to 512, and the third output layer has a dimension that increases from 512 to 64.
[0145] In one specific embodiment, in step S700, the fused multimodal features are used... As input, the regression head network predicts coarse-grained parameters of the SMPLX model or a low-resolution human body mesh. Let... For shape parameters, For joint rotation parameters, For the global translation parameters, the SMPLX model can be represented as: .
[0146] By employing a multi-stage upsampling and local refinement module within a graph convolutional network, the mesh resolution is progressively improved and the detailed structure optimized. During the refinement process, prior knowledge of human structure and geometric consistency constraints are incorporated to enhance the naturalness and accuracy of the generated mesh. The final output is a high-resolution 3D human mesh model, achieving high-precision estimation of human pose and shape.
[0147] Please refer to Figure 9 , Figure 10 This application also provides a method for training a neural network, comprising:
[0148] Step S710: While acquiring the RGB image and point cloud sequence, the three-dimensional coordinates of the reflective marker points are acquired through a motion capture camera;
[0149] In a unified scenario, the millimeter-wave radar, RGB camera, and motion capture system are simultaneously triggered or started using timestamps to collect data. The intrinsic and extrinsic parameters of the three sensors are calibrated, and the coordinate transformation relationship between the systems is established. Motion capture markers are placed at key points on the subjects, and the marker point sequence is recorded. At the same time, the corresponding radar point cloud and RGB frame and their time information are saved. The collected data are verified for integrity and time alignment to provide synchronous raw data pairs for subsequent automatic annotation and training.
[0150] Step S720: Fit the second human body 3D model based on the 3D coordinates of the reflective marker points and generate the second model parameters;
[0151] Step S730: Input the RGB image and point cloud sequence into the neural network to obtain the first model parameters; and generate the first human three-dimensional model based on the first model parameters.
[0152] Step S740: Based on the joint point error between the first human body 3D model and the second human body 3D model, obtain the first loss;
[0153] Step S750: Based on the vertex error between the first human body 3D model and the second human body 3D model, obtain the second loss;
[0154] Step S760: Based on the vertex acceleration error between the first human 3D model and the second human 3D model, obtain the third loss;
[0155] Step S770: Based on the error between the first model parameters and the standard model parameters, obtain the fourth loss;
[0156] Step S780: Based on the first loss, the second loss, the third loss, and the fourth loss, obtain the total loss;
[0157] Step S790: Update the parameters of the neural network based on the total loss.
[0158] In one specific embodiment, a multi-task loss function is used for optimization during training:
[0159] in,
[0160] This represents the first loss, used to predict the Euclidean distance between the joint and the labeled joint;
[0161] The second loss represents the geometric error between predicted vertices and labeled vertices.
[0162] This represents the third loss, used for vertex motion consistency constraints between adjacent frames;
[0163] The fourth loss is a regularization term based on SMPLX priors (such as attitude priors and shape priors).
[0164] In one specific embodiment, the formula for calculating the third loss is as follows:
[0165]
[0166] in, Represents the vertex coordinates of the human body mesh in frame t.
[0167] In one specific embodiment, the formula for calculating the fourth loss is as follows:
[0168]
[0169] in,
[0170] Represents a priori form;
[0171] This is represented by attitude priors.
[0172] The scope of protection for the human body 3D model generation method and neural network training method in this application is not limited to the execution order of the steps listed in this embodiment. Any solution implemented by adding, subtracting, or replacing steps in the prior art based on the principles of this application is included within the scope of protection of this application.
[0173] Please refer to Figure 11 , Figure 12 This embodiment provides a human body three-dimensional model generation device, including:
[0174] The data acquisition module is used to acquire RGB images and point cloud sequences from radar.
[0175] The feature extraction module is used to extract local feature maps, global feature maps, and enhanced local feature maps of RGB images;
[0176] The point feature generation module is used to encode point cloud sequences into point features of radar points;
[0177] The point feature enhancement module is used to enhance point features using local feature maps to obtain enhanced point features.
[0178] The token sequence generation module is used to encode the enhanced local feature map, the global feature map, and the enhanced point features into a token sequence.
[0179] The human body 3D model generation module generates a human body 3D model based on the token sequence.
[0180] As shown in the table below, the proposed method for generating 3D human models and training neural networks achieves a joint error of 6.8 mm and a vertex error of 8.5 mm on the publicly available dataset mmBody. These figures represent improvements of 2.8% and 4.4% respectively compared to the existing high-level model ImmFusion, fully demonstrating the effectiveness of the proposed method.
[0181]
[0182] In the embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, or methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative. For instance, the division of modules / units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or units may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection of apparatuses or modules or units may be electrical, mechanical, or other forms.
[0183] The modules / units described as separate components may or may not be physically separate. The components shown as modules / units may or may not be physical modules; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules / units can be selected to achieve the objectives of the embodiments of this application, depending on actual needs. For example, the functional modules / units in the various embodiments of this application may be integrated into one processing module, or each module / unit may exist physically separately, or two or more modules / units may be integrated into one module / unit.
[0184] Those skilled in the art will further recognize that the units and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of both. To clearly illustrate the interchangeability of hardware and software, the components and steps of the various examples have been generally described in terms of functionality in the foregoing description. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.
[0185] Please refer to Figure 13This embodiment also provides an electronic device, which is a user's mobile device such as a mobile phone, PAD, wearable device, or smart AI device; the electronic device includes a memory for storing computer programs; and a processor for running the computer programs to implement the human body 3D model generation method in the above embodiments.
[0186] The memory is connected to the processor via the system bus and communicates with it. The memory stores computer programs, and the processor runs the computer programs to enable electronic devices to perform actions such as... Figures 1 to 12 The method for generating a 3D human body model is shown.
[0187] It should also be noted that the system bus mentioned above can be a Peripheral Component Interconnect (PCI) bus or an Extended Industry Standard Architecture (EISA) bus, etc. This system bus can be divided into address bus, data bus, control bus, etc. For ease of representation, only one thick line is used in the diagram, but this does not indicate that there is only one bus or one type of bus. The communication interface is used to enable communication between the database access device and other devices (such as clients, read-write databases, and read-only databases).
[0188] The processors mentioned above can be general-purpose processors, including central processing units (CPUs), network processors (NPs), etc.; they can also be digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
[0189] In addition, this embodiment also provides a storage medium storing program instructions, which, when executed by a processor, implement the human body 3D model generation method in the above embodiments.
[0190] Those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by a program instructing a processor, and the program can be stored in a computer-readable storage medium. The storage medium is a non-transitory medium, such as random access memory, read-only memory, flash memory, hard disk, solid-state drive, magnetic tape, floppy disk, optical disk, and any combination thereof. The aforementioned storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., digital video disc (DVD)), or a semiconductor medium (e.g., solid-state disk (SSD)).
[0191] The descriptions of the processes or structures corresponding to the above figures each have their own emphasis. For parts of a process or structure that are not described in detail, please refer to the relevant descriptions of other processes or structures.
[0192] The above embodiments are merely illustrative of the principles and effects of this application and are not intended to limit this application. Any person skilled in the art can modify or alter the above embodiments without departing from the spirit and scope of this application. Therefore, all equivalent modifications or alterations made by those skilled in the art without departing from the spirit and technical concept disclosed in this application should still be covered by the claims of this application.
Claims
1. A method for generating a three-dimensional human body model, characterized in that, The method for generating a 3D human body model includes: Acquire point cloud sequences from RGB images and radar; Extract the local feature map, global feature map, and enhanced local feature map of the RGB image; The point cloud sequence is encoded into point features of radar points; The point features are enhanced using the local feature map to obtain the enhanced point features. The enhanced local feature map, the global feature map, and the enhanced point features are encoded into a token sequence; A three-dimensional human body model is generated based on the token sequence.
2. The method for generating a three-dimensional human body model according to claim 1, characterized in that, The step of using the local feature map to perform feature enhancement processing on the point features to obtain enhanced point features includes: Cross-modal semantic weights are obtained based on the local feature map and the point cloud sequence; the point features and the cross-modal semantic weights are concatenated and input into a multilayer perceptron to obtain cross-modal fine-grained correlation coefficients; Multiplying the cross-modal fine-grained correlation coefficient and the cross-modal semantic weight yields the semantic modulation point features; The point features and the semantically modulated point features are concatenated along the channel dimension to obtain the enhanced point features.
3. The method for generating a three-dimensional human body model according to claim 1, characterized in that, Generate a 3D human body model based on the token sequence, including: The token sequence is input into the Transformer network to obtain the fused feature sequence; The model parameters of the human three-dimensional model are obtained by decoding the fused feature sequence using a regression head. A three-dimensional human body model is generated based on the model parameters.
4. A method for training a neural network, characterized in that, The neural network is used for: Acquire point cloud sequences from RGB images and radar; Extract the local feature map, global feature map, and enhanced local feature map of the RGB image; The point cloud sequence is encoded into point features of radar points; The point features are enhanced using the local feature map to obtain the enhanced point features. The enhanced local feature map, the global feature map, and the enhanced point features are encoded into a token sequence; The token sequence is input into the Transformer network to obtain the fused feature sequence; The model parameters of the human three-dimensional model are obtained by decoding the fused feature sequence using a regression head. The training method includes: While acquiring the RGB image and the point cloud sequence, the three-dimensional coordinates of the reflective marker points are obtained through a motion capture camera; Fit a second three-dimensional human body model based on the three-dimensional coordinates of the reflective markers and generate second model parameters; The RGB image and the point cloud sequence are input into the neural network to obtain the first model parameters; and a first human 3D model is generated based on the first model parameters. The first loss is obtained based on the joint point error between the first human 3D model and the second human 3D model; The second loss is obtained based on the vertex error between the first and second human 3D models; The third loss is obtained based on the vertex acceleration error between the first and second human 3D models; The fourth loss is obtained based on the error between the first model parameters and the standard model parameters; The total loss is obtained based on the first loss, the second loss, the third loss, and the fourth loss; The parameters of the neural network are updated based on the total loss.
5. A device for generating a three-dimensional human body model, characterized in that, include: The data acquisition module is used to acquire RGB images and point cloud sequences from radar. The feature extraction module is used to extract local feature maps, global feature maps, and enhanced local feature maps of the RGB image; A point feature generation module is used to encode the point cloud sequence into point features of radar points; The point feature enhancement module is used to perform feature enhancement processing on the point features using the local feature map to obtain enhanced point features; The token sequence generation module is used to encode the enhanced local feature map, the global feature map, and the enhanced point features into a token sequence. The human body 3D model generation module generates a human body 3D model based on the token sequence.
6. An electronic device comprising a memory and a processor, characterized in that, The processor is used to execute the computer program stored in the memory to cause the electronic device to perform the human body three-dimensional model generation method as described in any one of claims 1 to 3.
7. A computer-readable storage medium having a computer program stored thereon, characterized in that, When executed by a processor, the computer program implements the method for generating a three-dimensional human body model as described in any one of claims 1 to 3.