A method for reconstructing both hands based on fusion of multi-scale color information and depth information

By generating a dense two-handed mesh through multi-scale fusion dual-stream feature extraction and a pyramid-shaped deep fusion network, the problem of difficulty in fusing color and depth information in existing technologies is solved, and high-precision two-handed reconstruction results are achieved.

CN117576307BActive Publication Date: 2026-06-26ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2023-11-03
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing technologies struggle to effectively integrate color and depth information for hand reconstruction, resulting in inaccurate reconstruction results. Performance improvements are limited, especially in cases of finger joint complexity, self-occlusion, and motion blur.

Method used

A method based on the fusion of multi-scale color and depth information is adopted. A dense two-handed mesh is generated by combining a two-stream feature extraction model and a pyramid-shaped deep fusion network with a GCN decoder, and the complementary advantages of color and depth information are used for reconstruction.

Benefits of technology

It achieves high-precision reconstruction of two-handed meshes with true depth and scale in the camera coordinate system, enhances the robustness of the model, and overcomes the limitation of traditional methods that rely on the output of the root node.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117576307B_ABST
    Figure CN117576307B_ABST
Patent Text Reader

Abstract

The application discloses a kind of based on the dual hand reconstruction method of multi-scale color information and depth information fusion.Utilize dual-flow feature extraction model respectively to extract monocular RGB-D image multi-scale color feature, left and right hand corresponding center point feature and multi-scale depth feature;Then use pyramid type depth fusion network to fuse multi-scale color feature, left and right hand corresponding center point feature and multi-scale depth feature, obtain left and right hand corresponding global fusion feature;Finally, according to left and right hand corresponding global fusion feature, using GCN decoder generates dual hand grid in camera coordinate system.The application can fully fuse color information and depth information to reconstruct both hands, reduce the blur effect in RGB image and the noise in depth image, greatly improve the reconstruction accuracy, and enhance the robustness of model;Meanwhile, the application can accurately reconstruct three-dimensional hand grid with real depth and scale in camera space, overcome the limitation of relying on root node output.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to a hand reconstruction method in the field of human body reconstruction, specifically a hand reconstruction method based on the fusion of multi-scale color information and depth information. Background Technology

[0002] Reconstructing the 3D pose and shape of the human hand from a single viewpoint plays a crucial role in numerous applications, such as human-computer interaction, mixed reality, motion recognition, and simulation. In recent years, a wealth of research has emerged in the field of hand reconstruction using various inputs, including monocular RGB images, monocular RGB-D images, multi-view images, and video sequences. However, due to the inherent complexity of finger joints, self-occlusion, and motion blur issues, effectively addressing the challenges in 3D hand reconstruction remains an ongoing task.

[0003] Currently, mainstream methods for hand reconstruction primarily focus on directly estimating both hands from a single RGB image. These methods are often limited by cluttered backgrounds, varying lighting conditions, and motion blur. To address these issues, traditional methods separate detection and reconstruction; that is, they first extract the hand region from the image using an existing detector and then input it into a reconstruction model. Therefore, these methods can only predict root-aligned 3D hand meshes and have the limitation of relying on the root node output.

[0004] In addition, some methods utilize RGB-D as input and use depth maps to predict sparse 3D keypoints, using depth information as auxiliary supervision. However, these methods cannot effectively fuse color and depth information for joint hand reconstruction. This problem can be attributed to the highly nonlinear nature of gestures and the inherent differences between hands, making it very difficult to obtain satisfactory reconstruction results by directly combining color images and depth maps. Therefore, it is essential to design an efficient fusion strategy specifically for hand reconstruction tasks to use both color and depth information for reconstruction.

[0005] One current technique, as described by Mueller et al. in their paper "Real-time hand tracking underocclusion from an egocentric RGB-D sensor," directly uses the depth map as an additional channel of the RGB image. This method simply changes the model's input channels from three to four. However, the performance improvement achieved by this simple fusion method remains quite limited.

[0006] The second existing technology, the method proposed by Kazakos et al. in their paper "On the fusion of rgb and depth information for hand pose estimation", performs fusion operations at the feature level. However, experimental results show that directly concatenating two features obtained from a shallow CNN does not improve performance.

[0007] The third existing technique, described by Lin et al. in their paper "Multi-level fusion net for hand poseestimation in hand-object interaction," scales the depth map to multiple scales to aggregate features at different resolutions, then uses a feature attention structure to fuse RGB features. However, this method is primarily applied to cropped single-hand images, which can only predict sparse 3D keypoints, not dense meshes. Furthermore, this method processes the depth map as a two-dimensional image, ignoring its inherent three-dimensional features. Summary of the Invention

[0008] To effectively address the limitations of bimanual reconstruction methods based on single-modal information input, this invention proposes a bimanual reconstruction method based on the fusion of multi-scale color and depth information. This method leverages the complementary advantages of color and depth information to accurately reconstruct bimanual meshes with true depth and proportion in the camera coordinate system.

[0009] The technical solution adopted in this invention is:

[0010] S1: Input the monocular RGB-D image of the hands to be reconstructed;

[0011] S2: Use a dual-stream feature extraction model to extract multi-scale color features, center point features corresponding to the left and right hands, and multi-scale depth features from monocular RGB-D images respectively;

[0012] S3: Use a pyramid-shaped deep fusion network to fuse multi-scale color features, center point features corresponding to the left and right hands, and multi-scale depth features to obtain global fusion features corresponding to the left and right hands.

[0013] S4: Based on the global fusion features corresponding to the left and right hands, a GCN-based decoder is used to generate a two-handed mesh in the camera coordinate system.

[0014] Specifically, S2 is:

[0015] The dual-stream feature extraction model includes a color feature extraction module and a depth feature extraction module. The RGB images of the hands to be reconstructed are used as input to the color feature extraction module. The color feature extraction module extracts multi-scale color features from the RGB images. Based on the multi-scale color features, the center points and masks corresponding to the left and right hands are regressed to obtain the center point features and mask images corresponding to the left and right hands. Point clouds are generated based on the depth images of the hands to be reconstructed and the mask images corresponding to the left and right hands to obtain the initial left-hand point cloud and the initial right-hand point cloud. Then, the depth feature extraction module is used to extract the multi-scale depth features corresponding to the initial left-hand point cloud and the initial right-hand point cloud respectively.

[0016] In step S2, the generation of the initial left-handed point cloud includes the following steps:

[0017] First, based on the mask image M of the left hand... l The left-hand region in the depth image of the hands to be reconstructed is obtained. A 3D point set is generated based on the camera intrinsics and the depth information in the left-hand region. Then, the average depth of the 3D point set is calculated. Outliers exceeding the threshold range are filtered out based on the average depth of the point set to obtain the filtered point set. N points are selected from the filtered point set as the initial left-hand point cloud.

[0018] Specifically, S3 is:

[0019] The pyramid-shaped deep fusion network includes a scale feature fusion module, a scale feature stitching module, and a global feature fusion module. The number of scale feature fusion modules is the same as the number of scales for color features and depth features. The inputs of different scale feature fusion modules are color features of different scales. The input of the first scale feature fusion module is also set to the depth features of the corresponding scale. Adjacent scale feature fusion modules are connected by a scale feature stitching module. The scale feature stitching module stitches the output of the previous scale feature fusion module with the depth features of the corresponding scale of the next scale feature fusion module before inputting it into the next scale feature fusion module. The output of the last scale feature fusion module is used as the input of the global feature fusion module. The center point features corresponding to the left / right hand are also used as the input of the global feature fusion module. The global feature fusion module outputs the global fused features corresponding to the left / right hand.

[0020] In the scale feature fusion module, the left / right point cloud of the current scale is first projected back onto the 2D color feature map using camera intrinsic parameters to obtain the mapping relationship between depth features and color features. Then, based on the color features of the current scale that satisfy the mapping relationship, a multilayer perceptron model is used to learn scale scaling parameters and movement parameters to obtain the first scale scaling parameters and the first movement parameters. Based on the first scale scaling parameters and the first movement parameters, the depth features of the current scale are fused with the color features of the current scale that satisfy the mapping relationship through affine transformation to obtain the fused features of the current scale.

[0021] In the scale feature stitching module, the output of the previous scale feature fusion module is sampled, and the number of sampling points is the same as the number of point clouds in the next scale feature fusion module to obtain a feature sample set. The feature sample set is then grouped by ball query to form corresponding sub-regions. The point clouds in each sub-region are sent to PointNet to extract the corresponding local features. Finally, all local features are stitched together with the depth features of the corresponding scale of the next scale feature fusion module and then input into the next scale feature fusion module.

[0022] In the global feature fusion module, the output of the last scale feature fusion module is first fed into PointNet to extract global features; based on the center point features corresponding to the left and right hands, the scale scaling parameters and movement parameters are learned using a multilayer perceptron model to obtain the second scale scaling parameters and the second movement parameters; after performing an affine transformation on the global features based on the second scale scaling parameters and the second movement parameters, the global fused features corresponding to the left and right hands are obtained.

[0023] In the GCN-based decoder, its fully connected neural network consists of multiple regression heads, and the multiple regression heads output at least the root node coordinates, the root-aligned MANO grid and / or GCN grid.

[0024] The beneficial effects of this invention are:

[0025] By adopting the above-mentioned technical solution, the present invention can achieve multi-scale color and depth information fusion based on RGB-D images, generate a dense hand mesh in the camera coordinate system, and achieve high-precision reconstruction of both hands.

[0026] In this invention, depth information is used to compensate for the blurring problem of RGB images, while depth map information is encoded into the disordered point cloud to preserve more geometric details and reduce the impact of noise. The fusion of these two types of information overcomes the limitations of single-modal input, enhances the robustness of the model, and improves reconstruction accuracy.

[0027] This invention can accurately reconstruct a 3D hand mesh with true depth and scale in camera space, overcoming the limitations of traditional methods that rely on root node output. Attached Figure Description

[0028] Figure 1 This is a flowchart of the overall process of a bimanual reconstruction method based on the fusion of multi-scale color information and depth information according to an embodiment of the present invention.

[0029] Figure 2 This is the framework of the two-handed reconstruction method according to an embodiment of the present invention.

[0030] Figure 3 This is a schematic diagram of a pyramid-shaped deep fusion network according to an embodiment of the present invention. Detailed Implementation

[0031] The technical solutions of the present invention will now be clearly and completely described with reference to the accompanying drawings in the embodiments of the present invention.

[0032] like Figure 1 , Figure 2 and Figure 3 As shown, the present invention includes the following steps:

[0033] S1: In this embodiment, three general datasets—H2O, H2O-3D, and RHD—are selected. All three datasets provide RGB-D images of both hands, providing color and depth information for the model. Input: Monocular RGB-D images of the hands to be reconstructed.

[0034] S2: Use a two-stream feature extraction model to extract multi-scale color features F from monocular RGB-D images. F1, F2, and F3 represent color features at three different scales and center point features corresponding to the left and right hands, respectively. And multi-scale depth features P, P1, P2, and P3 are depth features at three different scales, where N = 1024, C = 3, N1 = 512, C1 = 131, N2 = 128, and C2 = 259.

[0035] S2 specifically refers to:

[0036] The dual-stream feature extraction model includes a color feature extraction module and a depth feature extraction module; the monocular RGB images of the hands to be reconstructed are used as input to the color feature extraction module. In this embodiment, the color feature extraction module consists of two decoders and a ResNet50 encoder. The ResNet50 encoder of the color feature extraction module extracts multi-scale color features from the RGB images; based on the multi-scale color features, the two decoders in the color feature extraction module are used to regress the center points and masks corresponding to the left and right hands, respectively, to obtain the center point features and mask images corresponding to the left and right hands. M r The mask image is for the right hand; point clouds are generated based on the depth images of both hands to be reconstructed and the corresponding mask images of the left and right hands, obtaining the initial left-hand point cloud and the initial right-hand point cloud, thus obtaining the point cloud X of both hands. h X h = X l X r These are the initial left-handed point cloud and the initial right-handed point cloud, respectively. It is a real number field of two-dimensional channel N×C; based on the point clouds of both hands, the multi-scale depth features of the left and right hands are extracted using the depth feature extraction module.

[0037] In S2, the generation of the initial left-handed point cloud includes the following steps:

[0038] First, based on the mask image M of the left hand... l Obtain depth image The left-hand region is selected. A 3D point set is generated based on the camera intrinsic parameters and the depth information within the left-hand region. Then, the average depth of the 3D point set is calculated. Outliers exceeding the threshold range [-0.08, +0.08] mm are filtered out based on the average depth of the point set to reduce noise interference, resulting in a filtered point set. N points are selected from the filtered point set as the initial left-hand point cloud. In this embodiment, N = 1024.

[0039] S3: Use a pyramid-shaped deep fusion network to fuse multi-scale color features, center point features corresponding to the left and right hands, and multi-scale depth features to obtain global fusion features corresponding to the left and right hands.

[0040] S3 specifically refers to:

[0041] The pyramid-shaped deep fusion network includes a scale feature fusion module, a scale feature stitching module, and a global feature fusion module. The number of scale feature fusion modules is the same as the number of scales for color and depth features. Different scale feature fusion modules process features at different scales. The input to each scale feature fusion module is a color feature at a different scale. The first scale feature fusion module also receives a depth feature at its corresponding scale as input. Adjacent scale feature fusion modules are connected by a scale feature stitching module. This module stitches the output of the previous scale feature fusion module with the corresponding depth feature from the next scale feature fusion module before inputting it into the next scale feature fusion module. In other words, except for the first scale feature fusion module, the inputs to other scale feature fusion modules are the output of the scale feature stitching module connected to the previous one and the color feature at its corresponding scale. The output of the last scale feature fusion module serves as the input to the global feature fusion module, and the center point features corresponding to the left and right hands also serve as input. The global feature fusion module outputs the globally fused feature.

[0042] In the scale feature fusion module, the left / right point cloud of the current scale is first projected back onto the 2D color feature map using camera intrinsic parameters to obtain the mapping relationship between depth features and color features. Then, based on the color features of the current scale that satisfy the mapping relationship, a multilayer perceptron model is used to learn the scale scaling parameters and the first movement parameters to obtain the first scale scaling parameters and the first movement parameters. Based on the first scale scaling parameters and the first movement parameters, the depth features of the current scale are fused with the color features of the current scale that satisfy the mapping relationship through affine transformation to obtain the fused features of the current scale.

[0043] In the scale feature stitching module, the output of the previous scale feature fusion module is sampled, and the number of sampling points is consistent with the number of point clouds in the next scale feature fusion module to obtain a feature sample set. Then, the feature sample set is grouped by sphere query to form corresponding sub-regions. The number of sub-regions is consistent with the number of point clouds corresponding to the depth information of each scale. The point clouds in each sub-region are sent to PointNet to extract the corresponding local features. Finally, all local features are stitched together with the depth features of the corresponding scale of the next scale feature fusion module and then input into the next scale feature fusion module.

[0044] In this embodiment, the original point cloud input is downsampled to obtain point clouds with different sparsity levels. Let the original point cloud be denoted as... In the PointNet++ model, this is obtained through downsampling. The depth features of X1, X2, and X3 are P1, P2, and P3, respectively.

[0045] For point cloud X i Where i∈[1,2,3], first obtain the uv coordinates of the corresponding image based on the camera intrinsic parameters K, (u,v)←K -1 X i Then obtain the corresponding color features. Color features are fed into a multilayer perceptron to learn affine transformation parameters, namely the scaling parameter α and the shift parameter β. An affine transformation is then performed on the depth features to obtain the transformed features. Here, ⊙ represents element-wise multiplication.

[0046] Drawing inspiration from PointNet++, based on the point cloud X of the current layer i and the corresponding transformed features Obtain N i+1 There are 10 sampling points, with each sampling point as the center of a sphere and a radius of R. i A search is performed to calculate the distances of all points within the sphere to the center point, and these distances are sorted in ascending order. K points are selected as sub-regions of the center point. These sub-regions are then fed into PointNet to extract depth features, which are then max-pooled to obtain the final features. This feature is compared with the next layer feature P. i+1 By splicing, a new fusion feature P is obtained. i+1 =cat(P i+1 , P′ i ).

[0047] In the global feature fusion module, the output of the last scale feature fusion module is first fed into PointNet to extract global features. Based on the center point features corresponding to the left and right hands, the scale scaling parameters and movement parameters are learned using a multilayer perceptron model to obtain the second scale scaling parameters and the second movement parameters. After performing an affine transformation on the global features based on the second scale scaling parameters and the second movement parameters, the global fused features corresponding to the left and right hands are obtained.

[0048] In this embodiment, when i = 3, global feature fusion is performed. The transformed features are then... The data is fed into PointNet to extract global features G. The center point features are then fed into a multilayer perceptron to learn affine transformation parameters, namely the scaling parameter α and the translation parameter β. An affine transformation is then performed on the depth features to obtain the globally fused features.

[0049] In this embodiment, a pyramid-shaped deep fusion network is used in parallel to fuse the center point features, multi-scale depth features, and multi-scale color features corresponding to the left hand, and to fuse the center point features, multi-scale depth features, and multi-scale color features corresponding to the right hand, so as to obtain the global fusion features of the left hand and the global fusion features of the right hand respectively.

[0050] S4: Based on the global fusion features corresponding to the left and right hands, a GCN-based decoder is used to generate a dense mesh of both hands in the camera coordinate system to reconstruct the hands. The GCN-based decoder is a three-layer network architecture, with upsampling operations between each layer to achieve a coarse-to-fine reconstruction process.

[0051] The method for generating dense bifurcated meshes in the camera coordinate system involves using a GCN-based decoder to directly optimize features at each vertex, thereby learning the geometry of the 3D mesh. Specifically:

[0052] S4-1: The globally fused features are mapped into a more compact feature vector through a fully connected layer, and then concatenated with the vertex position encodings to obtain the initial image features G. v ;

[0053] S4-2: Then feed it into the graph convolutional layer to obtain the output features.

[0054] The graph convolution operation for each graph feature is defined as follows:

[0055]

[0056] Among them, G out For the output graph features, C k () is a Chebyshev polynomial of order k. It is a scaled Laplace matrix. In the first, second, and third layers of the network, N is 63, 126, and 252, respectively. k It is a learnable weight matrix. G in G is the input graph feature. out These are the output graph features. In the first, second, and third layer networks, c in They are 512, 256, 128, and c respectively. out The numbers are 256, 128, and 64 respectively.

[0057] S4-3: Upsample the output features of S4-2. Using multiple upsampling layers, the hand mesh is refined from the initial coarse sub-mesh to the final complete MANO mesh.

[0058] S4-4: Repeat S4-2 and S4-3 three times to obtain the following results:

[0059] S4-5: The graph features output from the last layer are fed into the regression head to obtain a dense two-handed mesh in the camera coordinate system. Through multiple regression heads composed of fully connected layers, the graph features of the last layer are mapped to the corresponding optimization objectives, such as root node coordinates, root-aligned MANO mesh, GCN mesh, etc. In this embodiment, the graph features are mapped to a root-aligned MANO network to obtain a dense two-handed mesh in the camera coordinate system.

[0060] This invention proposes a hand reconstruction method based on the fusion of multi-scale color and depth information. This method effectively integrates color and depth information to accurately reconstruct a 3D hand mesh with true depth and proportion in camera space, significantly improving reconstruction accuracy and exhibiting strong robustness. Considering performance and effectiveness, this invention facilitates the application of hand reconstruction methods in the field of human body reconstruction. Furthermore, because this method overcomes the dependence on root nodes in traditional methods, it meets the interactive scenario requirements of AR / VR applications, thus also facilitating the application of hand reconstruction methods in the interactive domain of AR / VR applications.

[0061] Finally, it should be noted that the above embodiments and descriptions are only used to illustrate the technical solutions of the present invention and not to limit it. Those skilled in the art should understand that modifications or equivalent substitutions can be made to the technical solutions of the present invention without departing from the spirit and scope of the disclosure of the technical solutions of the present invention, and all such modifications and substitutions should be covered within the protection scope of the claims of the present invention.

Claims

1. A bimanual reconstruction method based on the fusion of multi-scale color information and depth information, characterized in that, Includes the following steps: S1: Input the monocular RGB-D image of the hands to be reconstructed; S2: Use a dual-stream feature extraction model to extract multi-scale color features, center point features corresponding to the left and right hands, and multi-scale depth features from monocular RGB-D images respectively; Specifically, S2 is: The dual-stream feature extraction model includes a color feature extraction module and a depth feature extraction module. The RGB images of the hands to be reconstructed are used as input to the color feature extraction module. The color feature extraction module extracts multi-scale color features from the RGB images. Based on the multi-scale color features, the center points and masks corresponding to the left and right hands are regressed to obtain the center point features and mask images corresponding to the left and right hands. Point clouds are generated based on the depth images of the hands to be reconstructed and the mask images corresponding to the left and right hands to obtain the initial left-hand point cloud and the initial right-hand point cloud. The depth feature extraction module then extracts the multi-scale depth features corresponding to the initial left-hand point cloud and the initial right-hand point cloud respectively. S3: Use a pyramid-shaped deep fusion network to fuse multi-scale color features, center point features corresponding to the left and right hands, and multi-scale depth features to obtain global fusion features corresponding to the left and right hands. Specifically, S3 is: The pyramid-shaped deep fusion network includes a scale feature fusion module, a scale feature stitching module, and a global feature fusion module; The number of scale feature fusion modules is the same as the number of scales for color features and depth features. The inputs of different scale feature fusion modules are color features of different scales. The input of the first scale feature fusion module is also set to the depth features of the corresponding scale. Adjacent scale feature fusion modules are connected by a scale feature splicing module. The scale feature splicing module splices the output of the previous scale feature fusion module with the depth features of the corresponding scale of the next scale feature fusion module and then inputs it into the next scale feature fusion module. The output of the last scale feature fusion module is used as the input of the global feature fusion module. The center point features corresponding to the left / right hand are also used as the input of the global feature fusion module. The global feature fusion module outputs the global fusion features corresponding to the left / right hand. S4: Based on the global fusion features corresponding to the left and right hands, a GCN-based decoder is used to generate a two-handed mesh in the camera coordinate system.

2. The bimanual reconstruction method based on the fusion of multi-scale color information and depth information according to claim 1, characterized in that, In step S2, the generation of the initial left-handed point cloud includes the following steps: First, based on the mask image of the left hand... The left-hand region in the depth image of the hands to be reconstructed is obtained. A 3D point set is generated based on the camera intrinsics and the depth information in the left-hand region. Then, the average depth of the 3D point set is calculated. Outliers exceeding the threshold range are filtered out based on the average depth of the point set to obtain the filtered point set. N points are selected from the filtered point set as the initial left-hand point cloud.

3. The bimanual reconstruction method based on the fusion of multi-scale color information and depth information according to claim 1, characterized in that, In the scale feature fusion module, the left / right point cloud of the current scale is first projected back onto the 2D color feature map using camera intrinsic parameters to obtain the mapping relationship between depth features and color features. Then, based on the color features of the current scale that satisfy the mapping relationship, a multilayer perceptron model is used to learn scale scaling parameters and movement parameters to obtain the first scale scaling parameters and the first movement parameters. Based on the first scale scaling parameters and the first movement parameters, the depth features of the current scale are fused with the color features of the current scale that satisfy the mapping relationship through affine transformation to obtain the fused features of the current scale.

4. The bimanual reconstruction method based on the fusion of multi-scale color information and depth information according to claim 1, characterized in that, In the scale feature stitching module, the output of the previous scale feature fusion module is sampled, and the number of sampling points is the same as the number of point clouds in the next scale feature fusion module to obtain a feature sample set. The feature sample set is then grouped by ball query to form corresponding sub-regions. The point clouds in each sub-region are sent to PointNet to extract the corresponding local features. Finally, all local features are stitched together with the depth features of the corresponding scale of the next scale feature fusion module and then input into the next scale feature fusion module.

5. The bimanual reconstruction method based on the fusion of multi-scale color information and depth information according to claim 1, characterized in that, In the global feature fusion module, the output of the last scale feature fusion module is first sent to PointNet to extract global features; based on the center point features corresponding to the left / right hand, the scale scaling parameters and movement parameters are learned using a multilayer perceptron model to obtain the second scale scaling parameters and the second movement parameters. After performing an affine transformation on the global features based on the second scale scaling parameter and the second movement parameter, the global fused features corresponding to the left and right hands are obtained.

6. The bimanual reconstruction method based on the fusion of multi-scale color information and depth information according to claim 1, characterized in that, In the GCN-based decoder, its fully connected neural network consists of multiple regression heads, and the multiple regression heads output at least the root node coordinates, the root-aligned MANO grid and / or GCN grid.