A 3D multi-person human pose estimation method based on RTMW and 3D MPPE
By combining RTMW and 3DMPPE methods, the pelvic node depth estimation is optimized and camera elevation angle compensation is performed, which solves the problems of slow response speed and lens angle influence in 3D multi-person human pose estimation and achieves high-precision multi-person human pose estimation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- NORTHEASTERN UNIV CHINA
- Filing Date
- 2025-06-17
- Publication Date
- 2026-06-26
AI Technical Summary
Existing 3D multi-person human pose estimation models have slow response speed in high-density scenes and are affected by occlusion and camera angle, making it difficult to accurately predict absolute coordinates, which limits their versatility in complex scenes.
Based on the RTMW and 3DMPPE method, the pelvic node depth estimation is optimized by combining direct regression and geometric inference, and the absolute coordinates of multiple human poses are generated by using camera elevation angle compensation, thereby reducing the effects of depth blur and lens distortion.
It improves the accuracy and versatility of pose estimation, and can generate camera center coordinates for multiple human poses under occlusion conditions, avoids the influence of lens angle, maintains lightweight characteristics, and improves the accuracy of estimation results.
Smart Images

Figure CN120689935B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of posture estimation technology, and in particular to a 3D multi-person human posture estimation method based on RTMW and 3DMPPE. Background Technology
[0002] 3D multi-person human pose estimation aims to obtain 3D pose information of the human body in the camera coordinate system from provided data, and is a key technology in fields such as security monitoring, sports and health detection, and motion capture. In the field of security monitoring, human pose recognition can help security personnel predict potentially dangerous actions (such as climbing over railings) in public places such as airports and shopping malls, thereby improving the speed of handling such incidents. In the field of sports and health detection, human pose recognition can assist in sports training guidance or fall detection, achieving more convenient and lower-cost posture correction and higher-precision fall warnings. In the field of motion capture, markerless intelligent human pose detection eliminates the need for expensive wearable equipment used in traditional motion capture, significantly reducing shooting costs, and is one of the current development directions of motion capture technology.
[0003] Based on the number of people estimated, human pose estimation can be divided into single-person pose estimation and multi-person pose estimation. Single-person pose estimation typically uses relative coordinates centered on the pelvis, which fails to reflect the positional relationships between multiple people in the image. In 2019, Moon et al. proposed 3DMPPE, which estimates the camera center coordinates of each pelvic node, integrating the original single-person pose estimation into the same camera coordinate system, thus achieving multi-person pose estimation.
[0004] Currently, 3D multi-person human pose estimation is divided into two methods: top-down and bottom-up. The top-down method first uses object detection methods to segment each human body region and then uses a single-person 3D pose estimation model to estimate the pose of each individual, finally merging them into a multi-person human pose estimation. When the single-person pose estimation model performs well, the top-down method can achieve high estimation accuracy. The bottom-up method, on the other hand, first detects keypoints for all people in the image, then groups the detected keypoints and assigns them to different human instances. This method is more computationally efficient than the top-down method, but it may introduce errors in the keypoint grouping process and is not adept at handling complex poses.
[0005] Currently, top-down multi-person pose estimation methods are mainly limited by the estimation efficiency of single-person pose estimation models, resulting in slow response speeds in high-density scenes. The RTMW model proposed by Jiang et al. first uses the SimCC module to replace the heatmap regression method in traditional pose estimation. By predicting pose coordinates in three dimensions separately, it reduces quantization errors while also lowering structural complexity and computational cost. Compared to traditional self-attention modules, the GAU module uses low-dimensional features and adds nonlinear features during computation, reducing computational overhead while enhancing the model's expressive power. The RTMW model, using SimCC and GAU modules, significantly reduces the computational cost of single-person pose estimation, solving the problem of slow response speed in dense scenes for top-down methods. However, the RTMW model has not yet achieved true multi-person pose estimation; its final output coordinates are relative coordinates rather than global absolute coordinates, which limits the universality of pose inference results across different scenarios.
[0006] While ensuring computational efficiency, multi-person human pose estimation also needs to avoid complex occlusion situations to guarantee accuracy. The 3DMPPE model proposed by Moon et al. uses a separate RootNet network, combining 2D image features and geometric constraints to predict the depth value of the root node, thus addressing the problem of missing depth information caused by occlusion. This method of decomposing 3D pose into absolute position and relative pose can reduce the impact of occlusion on overall pose estimation. However, the 3DMPPE model cannot meet the requirements of current application scenarios in terms of accuracy and computational efficiency, and its method of directly regressing the pelvic node depth cannot avoid the complex occlusion problem in multi-person scenes.
[0007] Meanwhile, the task of estimating 3D human pose from 2D images is also affected by the camera angle at the time of shooting. For example, two people of the same height standing on the same horizontal plane may not be on the same horizontal line in the 2D image. Currently, neither the RTMW nor the 3DMPPE model addresses the estimation error caused by the camera angle. To address the limitations of existing methods, there is an urgent need to explore a model that can directly predict the absolute coordinates of 3D multi-person human poses while minimizing the estimation errors caused by depth blur and camera angle, thereby improving its feasibility and generalization ability in practical applications. Summary of the Invention
[0008] The technical problem to be solved by the present invention is to address the shortcomings of the prior art by providing a 3D multi-person human pose estimation method based on RTMW and 3DMPPE. The method is optimized based on the existing RTMW and RootNet process, which can estimate the 3D camera center coordinates of human pose while reducing the impact of depth blur and lens distortion on pose estimation.
[0009] To solve the above-mentioned technical problems, the technical solution adopted by the present invention is as follows:
[0010] A 3D multi-person human pose estimation method based on RTMW and 3DMPPE includes the following steps:
[0011] Step 1: Obtain the 3D and 2D datasets for the current human pose estimation task, and calculate the average human torso length of the datasets used before training;
[0012] Step 2: Use the RTMW model to obtain the 2.5D pose estimate of each human body, i.e., the relative coordinates of the 3D pose estimate of each human body;
[0013] Step 3: Estimate the absolute camera depth of the pelvic nodes and calculate the camera center coordinates of each human pelvic node; use two methods to estimate the depth: direct regression and geometric inference. The direct regression method is implemented using the RootNet network of the 3DMPPE method, while the geometric inference method further optimizes the pelvic node depth information by utilizing geometric information based on the direct regression method; during the RootNet regression of pelvic node depth information, the image feature information extracted by the backbone network is used to simultaneously regress the estimated camera elevation angle required in Step 4.
[0014] Step 4: To address the impact of the shooting angle during RGB image capture on pose estimation, the pose is first rotated using the elevation angle estimated in Step 3. Then, the pose estimates of multiple people are merged, and the elevation angle of the merged pose is compensated. The network parameters are updated through backpropagation based on the error between the predicted and actual values, so that the final synthesized pose is closer to the real scene. Finally, the absolute coordinates of the joints of the human body pose of multiple people are generated.
[0015] Furthermore, the specific method of step 1 is as follows:
[0016] Step 1.1: Obtain the publicly available datasets Human3.6M and 3DPW for the human pose estimation task, and divide both datasets into training and test sets. Use sequences S1, S5, S6, S7, and S8 from the Human3.6M 3D single-person dataset as the training set, and S9 and S11 as the test set. For the 3DPW 3D multi-person dataset, use the data in its provided train folder as the model training set, and the data in its provided validation and test files as the test set. Randomly sample 20% of the Human3.6M training set data and 80% of the 3DPW training set data as the final model training set.
[0017] Step 1.2: Calculate the average length of the human torso in the training set; use the custom Python function load_json_data to load the JSON file storing the labeled data as a dictionary object, and then read the keypoints attribute of each element in the dictionary to obtain the 3D coordinates of the pelvic and neck nodes; then use the np.linalg.norm function provided by the numpy package to calculate the distance between the pelvic and neck nodes, and record the calculated distance value in the empty array torso_lengths; after traversing the labeled data, use the np.mean function provided by the numpy package to calculate the average length of all torsos.
[0018] Furthermore, the specific method for step 2 is as follows:
[0019] The `init_model` function provided by MMpose is used to load the `det_model` model for object detection and the `key_model` model for pose estimation. After loading the image using the `cv2.imread` function from the OpenCV package, the image is input into the `inference_detector` function provided by MMpose to obtain the object detection result `det_result`. The detection result is then converted into a NumPy array using the `cpu` function provided by the PyTorch package and the `numpy` function provided by the NumPy package, and the object detection bounding box attribute values are obtained, thus yielding the object detection bounding box result. The detected object detection bounding box values, along with the image address and the pose estimation model, are input into the `inference_topdown` function provided by MMpose to obtain the object detection result, which includes the 2D pose detection results for multiple people in the image and the 3D pose detection results for each individual person.
[0020] Furthermore, in step 3, the specific details of the direct regression method are as follows:
[0021] Define the function generate_patch_image. The input of this function is the previously obtained object detection box data and the original image NumPy array representation. The output is an array of images cropped according to the object detection box values. The original image NumPy array is obtained by the cv2.imread function provided in the OpenCV package. Based on the pinhole imaging principle, the approximate distance from the camera to the person is calculated according to the camera intrinsic parameters and the relevant information of the image region where the human body is located.
[0022] Let d be the distance between the camera and the human pelvic joint, in mm; and f be the focal length of the camera, in mm. The length of the human body on the image sensor, in mm; according to The function definition yields:
[0023] ;
[0024] in, Let be the angle subtended by the human body along the x-axis from the camera's perspective. This represents the actual physical length of the human body along the x-axis. The length of the human body's physical length along the x-axis projected onto the image sensor;
[0025] set up Let be the unit pixel factor on the x-axis, then we get:
[0026] ;
[0027] in, The pixel length of the image in the x-direction;
[0028] Similarly, on the y-axis, we get:
[0029] ;
[0030] in, This represents the actual physical length of the human body along the y-axis. The length of the human body projected onto the image sensor along the y-axis is its physical length. The pixel length of the image in the y-direction;
[0031] Therefore, we get:
[0032] ;
[0033] in, , For camera internal parameters, , It is assumed to be a fixed value;
[0034] Define a function `root_model` that takes an array of cropped images as input and the calculated distance `d` between the camera and the human pelvic joint, and outputs the corrected pelvic node depth. Within `root_model`, useful global features of the input images are first extracted using a ResNet network. Then, global average pooling is applied to these features, and the pooled feature map is further processed... Convolution, outputting a correction factor ; This describes the degree to which the current human posture or body shape affects the bounding box area; the final absolute depth of the pelvic node is obtained using the following formula:
[0035] ;
[0036] in, This represents the absolute depth of the pelvic nodes obtained through regression methods;
[0037] The global features extracted by ResNet are also fed into two fully connected layers with an output dimension of 512 and a ReLU function for further feature extraction. Then, the extracted elevation angle-related features are mapped to a scalar output elevation angle using the nn.Linear function provided by the PyTorch package. .
[0038] Furthermore, the specific content of the geometric reasoning method in step 3 is as follows:
[0039] Based on the previously obtained 2.5D human pose, and using the approximate distance between the human pelvis node and the neck node as equal to the length of the human torso, the following inference is made:
[0040] ;
[0041] in, This represents the absolute depth of the pelvic nodes obtained through geometric reasoning. Let represent the absolute depth variable of the pelvic node to be solved. This represents the distance between the pelvic node and the cervical node in the camera center coordinate system; The 2.5D coordinates of the neck node are given. Here are the 2.5D coordinates of the pelvic nodes. This means that the absolute coordinates of the corresponding node are calculated using the absolute coordinates of the pelvic node and the corresponding 2.5D coordinates of the node. This represents the average torso length of the training set samples;
[0042] Solving the above equation yields the following result:
[0043] ;
[0044] in,
[0045] ;
[0046] ;
[0047] ;
[0048] ;
[0049] ;
[0050] in, and Let be the focal length of the camera along the x-axis and y-axis. , For the camera's optical center coordinates, Here are the 2D coordinates of the pelvic nodes. The 3D camera center coordinates of the pelvic nodes obtained from the regression;
[0051] Therefore, we define a function `optimize_root_depth`, which takes the pelvic node depth obtained from regression as input. The 2D coordinates of the pelvic nodes, the average torso length, and camera intrinsics are used; the current pelvic node depth is calculated using the above formulas within the function. The value will be used as the basis for the next round of calculations. and Used Value; at the end of each round of calculation, the current value is calculated using the np.abs function provided by NumPy. Value and The difference is considered, and the decision function converges when the difference is less than a specified threshold. The value is the final pelvic node depth obtained through geometric reasoning; otherwise, the calculation continues until the maximum number of iterations is reached.
[0052] Furthermore, in step 4, the specific method for performing attitude rotation compensation using the elevation angle estimated in step 3 is as follows:
[0053] For the previously obtained single-person 3D pose and corresponding 2D pose coordinates, a rotation matrix is used to perform rotation compensation on the 3D pose, so that its rotation angle matches the 2D projection; for each 3D joint coordinate, the following rotation matrix is applied:
[0054] ;
[0055] ;
[0056] in, These are the 3D joint coordinates before rotation. These are the coordinates of the rotated 3D joint points. The elevation angle value is obtained from direct regression in step 3.
[0057] Furthermore, in step 4, the specific method for merging the postures of multiple individuals and performing elevation angle compensation on the merged postures is as follows:
[0058] Step 4.2.1: Based on the obtained camera center coordinates of the human pelvis node and the previously predicted relative coordinates of the single-person 3D joints, obtain the absolute coordinates of each human joint.
[0059] For the depth value of each joint in the single-person 3D pose_3d, first subtract the depth value of the pelvic node pose_3d[:, 2][0], and then add the previously inferred pelvic node depth value. This means obtaining the 3D absolute coordinates of each human body joint; and simultaneously aligning the pelvic nodes of other human bodies with the human body based on the vertical height of the first human body pelvic node.
[0060] Step 4.2.2: For the predicted multiple human poses, based on the distance between the pelvic node and the camera... and the elevation angle obtained by direct regression in step 3 Calculate the vertical offset between each pair of multiple human bodies. The calculation formula is as follows:
[0061] ;
[0062] in, , These are the pelvic node depth values for the first and second human subjects, respectively. , These are the elevation angles obtained directly from the regression of the first and second human figures in step 3, respectively.
[0063] Based on the calculated vertical offset between the other human figures and the first human figure, the other human figures are moved to the corresponding vertical height, that is, the y-axis coordinates of all joints of the other human figures are superimposed. Finally, each human posture needs to be scaled so that the feet of each posture land on the y-plane.
[0064] Furthermore, in step 4, the specific method for updating the network parameters through backpropagation based on the error between the predicted and actual values is as follows:
[0065] Step 4.3.1: Calculate the error between the predicted pose and the true pose. The calculation formula is as follows:
[0066] ;
[0067] in, Let i be the predicted pose of the i-th person. The actual posture of the i-th person;
[0068] Step 4.3.2: Calculate the error between the translation value in the predicted scene and the translation value in the actual scene. The calculation formula is as follows:
[0069] ;
[0070] in, , Let represent the translation value of the i-th person in the predicted scene and the translation value in the actual scene, respectively;
[0071] Step 4.3.3: Calculate the error between the predicted pelvic nodes and the actual pelvic nodes. The calculation formula is as follows:
[0072] ;
[0073] in, , Let represent the predicted pelvic node and the actual pelvic node of the i-th person, respectively;
[0074] Step 4.3.4: Obtain the total error ;
[0075] Step 4.3.5: Update the network parameters based on the error through backpropagation; First, call the loss.backward() function provided in the PyTorch package to automatically calculate the gradient of each parameter in the neural network; then, use the optimizer.step() function to update the model parameters based on these gradients, so that the loss decreases and the model is gradually optimized.
[0076] The beneficial effects of adopting the above technical solution are as follows: The 3D multi-person human pose estimation method based on RTMW and 3DMPPE provided by this invention, when estimating the pelvic node depth, not only obtains the pelvic node depth information through direct regression based on the RootNet method, but also uses geometric information to further infer and optimize the pelvic node depth. This method only requires 2.5D pose and geometric information to infer the depth, avoiding the depth blur problem caused by occlusion in the image. The obtained pelvic node depth, combined with the relative coordinates of the human pose, can obtain the 3D absolute coordinates of the multi-person human pose. At the same time, in the 2D to 3D upscaling process, this invention predicts the elevation angle of the human pelvic node relative to the camera, and uses this value to perform rotation and elevation angle compensation on the human pose, thereby restoring the human pose in the real scene. The optimized method can generate the camera center coordinates of the multi-person human pose under occlusion conditions, and avoids the influence of lens angle to restore the relative relationship between human poses in the real scene. While maintaining the lightweight characteristics of RTMW, it further improves the accuracy of the estimation results. Attached Figure Description
[0077] Figure 1 A flowchart of a 3D multi-person human pose estimation method based on RTMW and 3DMPPE provided in an embodiment of the present invention;
[0078] Figure 2The images provided are experimental pictures and 3D multi-person human pose estimation effect diagrams provided for embodiments of the present invention. Detailed Implementation
[0079] The specific embodiments of the present invention will be described in further detail below with reference to the accompanying drawings and examples. The following examples are for illustrative purposes only and are not intended to limit the scope of the invention.
[0080] like Figure 1 As shown, the method of this embodiment is described below.
[0081] Step 1: Obtain the existing dataset and calculate the average length of the human torso.
[0082] Step 1.1: Obtain commonly used public datasets for human pose estimation tasks, such as Human3.6M and 3DPW, and divide both datasets into training and testing sets. Human3.6M and 3DPW are commonly used 3D human pose datasets. Human3.6M contains 3.6 million frames of 3D pose data from 11 actors performing 17 different actions, filmed indoors. 3DPW collects over 60,000 frames of data from seven different scenes, including five outdoor shooting scenarios such as natural environments and dynamic scenes. Sequences S1, S5, S6, S7, and S8 from the 3D single-person dataset Human3.6M are used as the training set, and S9 and S11 are used as the testing set. For the 3D multi-person dataset 3DPW, the data in its provided train folder is used as the model training set, and the data in its provided validation and test files are used as the testing set. To fully utilize the characteristics of each dataset while avoiding training bias caused by data imbalance, 20% of the Human3.6M training set and 80% of the 3DPW training set were randomly sampled as the final model training set.
[0083] Step 1.2: Calculate the average length of the human torso in the training set. Use the custom Python function `load_json_data` to load the JSON file storing the labeled data as a dictionary object. Then, read the `keypoints` attribute of each element in the dictionary to obtain the 3D coordinates of the pelvic and neck nodes. Next, use the `np.linalg.norm` function provided by the NumPy package to calculate the distance between the pelvic and neck nodes and record the calculated distance values in an empty array `torso_lengths`. After traversing the labeled data, use the `np.mean` function provided by the NumPy package to calculate the average length of all torsos.
[0084] Step 2: Use the RTMpose model to obtain the 2.5D pose estimation of each person, i.e., the relative coordinates of the 3D pose estimation of each person. Use the `init_model` function provided by MMpose to load the `det_model` model for object detection and the `key_model` model for pose estimation. After loading the image using the `cv2.imread` function provided by the OpenCV package, input it into the `inference_detector` function provided by MMpose to obtain the object detection result `det_result`. Use the `cpu` function provided by the PyTorch package and the `numpy` function provided by the NumPy package to convert the detection result into a NumPy array, and obtain its object detection box attribute values, i.e., obtain the object detection box results. Input the detected object detection box values, image address, and pose estimation model together into the `inference_topdown` function provided by MMpose to obtain the object detection results. This result includes the 2D pose detection results for multiple people in the image and the 3D pose detection results for each individual person.
[0085] Step 3: Calculate the camera center coordinates of each human pelvic node.
[0086] Step 3.1: Regress to obtain the absolute depth of the human pelvic node and the camera elevation angle. Define the `generate_patch_image` function, which takes the previously obtained object detection bounding box data and the original image NumPy array as input, and outputs an array of images cropped according to the object detection bounding box values. The original image NumPy array is obtained using the `cv2.imread` function provided in the OpenCV package. Based on the pinhole imaging principle, calculate the approximate distance from the camera to the person based on the camera intrinsic parameters and relevant information of the image region where the human body is located.
[0087] The principle of pinhole imaging is as follows: Let d be the distance (mm) between the camera and the human pelvic joint, and f be the focal length of the camera (mm). The length of the human body on the image sensor (mm). According to The function definition yields:
[0088] ;
[0089] in, Let be the angle subtended by the human body along the x-axis from the camera's perspective. This represents the actual physical length of the human body along the x-axis. The length of the human body's physical length along the x-axis projected onto the image sensor.
[0090] set up Let be the unit pixel factor on the x-axis, then we get:
[0091] ;
[0092] in, The pixel length of the image in the x-direction.
[0093] Similarly, this can also be obtained on the y-axis:
[0094] ;
[0095] in, This represents the actual physical length of the human body along the y-axis. The length of the human body projected onto the image sensor along the y-axis is its physical length. The pixel length of the image in the y-direction.
[0096] Therefore, we get:
[0097] ;
[0098] in, , For camera internal parameters, , It is assumed to be a fixed value.
[0099] Define the function `root_model`, which takes as input an array of cropped images and the calculated distance `d` between the camera and the human pelvic joint, and outputs the corrected pelvic node depth. Within this function, useful global features of the input images are first extracted using a ResNet network. Then, global average pooling is applied to these features, and the pooled feature map is further processed... Convolution, outputting a correction factor . This describes the degree to which the current human posture or body shape affects the bounding box area. The final absolute depth of the pelvic nodes is obtained through a regression method. It can be obtained through the following formula:
[0100] ;
[0101] The global features extracted by ResNet are also fed into two fully connected layers with an output dimension of 512 and a ReLU function for further feature extraction. Then, the extracted elevation angle-related features are mapped to a scalar output elevation angle using the nn.Linear function provided by the PyTorch package. .
[0102] Step 3.2: Further optimize the absolute depth of the pelvic node through geometric reasoning. Using the previously obtained 2.5D human pose, and considering that the distance between the pelvic node and the neck node is approximately equal to the length of the human torso, the following reasoning can be derived:
[0103] ;
[0104] in, This represents the absolute depth of the pelvic nodes obtained through geometric reasoning. Let represent the absolute depth variable of the pelvic node to be solved. This represents the distance between the pelvic node and the cervical node in the camera center coordinate system; The 2.5D coordinates of the neck node are given. Here are the 2.5D coordinates of the pelvic nodes. This means that the absolute coordinates of the corresponding node are calculated using the absolute coordinates of the pelvic node and the corresponding 2.5D coordinates of the node. Let be the average torso length of the training set samples. Solving the above equation yields the following result:
[0105] ;
[0106] ;
[0107] ;
[0108] .
[0109] ;
[0110] .
[0111] in, and Let be the focal length of the camera along the x-axis and y-axis. , For the camera's optical center coordinates, Here are the 2D coordinates of the pelvic nodes. The coordinates of the 3D camera center of the pelvic node obtained from the regression are given.
[0112] Therefore, we define a function `optimize_root_depth`, which takes the depth obtained from regression as input. The 2D coordinates of the pelvic nodes, average torso length, and camera intrinsic parameters are calculated first using the above formulas within the function. The value will be used as the basis for the next round of calculations. and Used The value is calculated using the `np.abs` function provided by NumPy at the end of each round of calculation. Value and The difference is considered, and the decision function converges when the difference is less than a specified threshold. The value is the final pelvic node depth obtained through geometric reasoning; otherwise, the calculation is repeated until the maximum number of iterations is reached.
[0113] Step 4: Generate the absolute coordinates of the joints of multiple human body poses.
[0114] Step 4.1: Using the previously obtained single-person 3D pose and corresponding 2D pose coordinates, apply a rotation matrix to compensate for the rotation of the 3D pose, ensuring the rotation angle matches the 2D projection. Apply the rotation matrix to each 3D joint coordinate as follows:
[0115] ;
[0116] ;
[0117] in, These are the 3D joint coordinates before rotation. These are the coordinates of the rotated 3D joint points. The elevation angle value is obtained from direct regression in step 3.
[0118] Step 4.2: Merge the poses of multiple people and perform elevation angle compensation on the merged poses.
[0119] Step 4.2.1: Based on the obtained camera center coordinates of the human pelvic node and the previously predicted relative coordinates of the individual 3D joints, obtain the absolute coordinates of each human joint. For the depth value of each joint in the individual 3D pose_pose_3d, first subtract the depth value of the pelvic node pose_3d[:, 2][0], and then add the previously inferred depth value of the pelvic node. This allows us to obtain the absolute 3D coordinates of each human body joint. Simultaneously, based on the vertical height of the first human pelvic joint, the pelvic joints of other human bodies are aligned with that human body.
[0120] Step 4.2.2: For the predicted multiple human poses, based on the distance between the pelvic node and the camera... and the elevation angle obtained by direct regression in step 3 Then the vertical offset between any two individuals can be calculated. The calculation formula is as follows:
[0121] ;
[0122] in, , These are the pelvic node depth values for the first and second human subjects, respectively. , The elevation angles are obtained directly from the regression of the first and second human figures in step 3, respectively.
[0123] Based on the calculated vertical offset between the other human figures and the first human figure, the other human figures are moved to the corresponding vertical height, that is, the y-axis coordinates of all joints of the other human figures are superimposed. Finally, each human posture needs to be scaled so that the feet of each posture land on the y-plane.
[0124] Step 4.3: Calculate the error.
[0125] Step 4.3.1: Calculate the error between the predicted pose and the true pose. The calculation formula is as follows:
[0126] ;
[0127] in, Let i be the predicted pose of the i-th person. Let represent the actual posture of the i-th person.
[0128] Step 4.3.2: Calculate the error between the translation value in the predicted scene and the translation value in the actual scene. The calculation formula is as follows:
[0129] ;
[0130] in, , Let represent the translation value in the predicted scene and the translation value in the actual scene for the i-th person, respectively.
[0131] Step 4.3.3: Calculate the error between the predicted pelvic nodes and the actual pelvic nodes. The calculation formula is as follows:
[0132] ;
[0133] in, , Let represent the predicted pelvic nodes and the actual pelvic nodes of the i-th individual, respectively.
[0134] Step 4.3.4: Obtain the total error .
[0135] Step 4.4: Update network parameters based on the error through backpropagation. First, call the `loss.backward()` function provided by the PyTorch package to automatically calculate the gradient of each parameter in the neural network; then, use the `optimizer.step()` function to update the model parameters based on these gradients, thereby reducing the loss and gradually optimizing the model.
[0136] The method in this embodiment is applicable to estimating the absolute coordinates of multiple 3D human bodies from monocular RGB images. By adding a depth estimation module for the pelvic node and performing elevation and rotation compensation for the pose, the method achieves higher accuracy in obtaining the absolute coordinates of the 3D pose.
[0137] like Figure 2 As shown, the method of this embodiment is used to perform 3D multi-person human pose estimation on the experimental images. Figure 2 As shown in the renderings, this embodiment can effectively reconstruct the 3D pose information of multiple people in an image from a single RGB image. The model can not only accurately estimate the spatial position of each individual's joints, but also clearly reconstruct the relative positions and spatial hierarchy between the figures, demonstrating excellent multi-person pose separation and reconstruction capabilities. Furthermore, the overall estimation results are highly consistent with the real-world scene, verifying the effectiveness and practical value of the model in achieving high-precision 3D multi-person human pose estimation under monocular image conditions.
[0138] This embodiment uses the publicly available 3DPW dataset for evaluation and comparison. The improved evaluation results are shown in Table 1. As can be seen from Table 1, on the 3DPW dataset, the MPJPE of the method in this embodiment is 54.62mm, which is 2.68mm lower than the 57.3mm of the 3DMPPE. This indicates that the method in this embodiment has better performance in terms of overall 3D joint positioning accuracy and can more accurately recover the three-dimensional pose of the human body.
[0139] Table 1. Comparison of the error MPJPE of this embodiment with 3DMPPE on the 3DPW dataset.
[0140] method MPJPE 3DMPPE 57.3 This embodiment 54.62
[0141] Table 2 shows a comparison of the depth estimation errors of the root node before and after the improvement. The MPJPE of the method in this embodiment is 5.875 mm, which is significantly lower than the 7.499 mm of the 3DMPPE, and the error is reduced by 1.624 mm. This indicates that the method in this embodiment has higher accuracy in root node depth estimation, which helps to improve the spatial consistency and stability of the overall pose reconstruction.
[0142] Table 2 Comparison of root node depth error MPJPE before and after improvement
[0143] method MPJPE 3DMPPE 7.499 This embodiment 5.875
[0144] In summary, this embodiment outperforms the existing method 3DMPPE in both overall pose estimation accuracy and critical depth estimation, verifying the effectiveness of the proposed improved strategy in 3D multi-person human pose estimation tasks.
[0145] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of the present invention, and not to limit them; although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some or all of the technical features therein; and these modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope defined by the claims of the present invention.
Claims
1. A 3D multi-person human pose estimation method based on RTMW and 3DMPPE, characterized in that: The method includes the following steps: Step 1: Obtain the 3D and 2D datasets for the current human pose estimation task, and calculate the average human torso length of the datasets used before training; Step 2: Use the RTMW model to obtain the 2.5D pose estimate of each human body, i.e., the relative coordinates of the 3D pose estimate of each human body; Step 3: Estimate the absolute camera depth of the pelvic nodes and calculate the camera center coordinates of each human pelvic node; use two methods to estimate the depth: direct regression and geometric inference. The direct regression method is implemented using the RootNet network of the 3DMPPE method, while the geometric inference method further optimizes the pelvic node depth information by utilizing geometric information based on the direct regression method; during the RootNet regression of pelvic node depth information, the image feature information extracted by the backbone network is used to simultaneously regress the estimated camera elevation angle required in Step 4. Step 4: To address the impact of the shooting angle during RGB image capture on pose estimation, first use the elevation angle estimated in Step 3 to perform rotation compensation on the pose, then merge the pose estimates of multiple people, and perform elevation angle compensation on the merged pose. Based on the error between the predicted value and the true value, update the network parameters through backpropagation to make the final synthesized pose closer to the real scene, and finally generate the absolute coordinates of the joints of the human body pose of multiple people. The specific method for performing attitude rotation compensation using the elevation angle estimated in step 3 is as follows: For a single-person 3D pose and its corresponding 2D pose coordinates, a rotation matrix is used to perform rotation compensation on the 3D pose, so that its rotation angle matches the 2D projection; for each 3D joint coordinate, the following rotation matrix is applied: in, These are the 3D joint coordinates before rotation. These are the coordinates of the rotated 3D joint points. The elevation angle value obtained by direct regression in step 3; The specific method for merging the attitude estimates of multiple individuals and performing elevation angle compensation on the merged attitude is as follows: Step 4.2.1: Based on the obtained camera center coordinates of the human pelvis node and the previously predicted relative coordinates of the single-person 3D joints, obtain the absolute coordinates of each human joint. For the depth value of each joint in the single-person 3D pose_3d, first subtract the depth value of the pelvic node pose_3d[:, 2][0], and then add the previously inferred pelvic node depth value. This means obtaining the 3D absolute coordinates of each human body joint; and simultaneously aligning the pelvic nodes of other human bodies with the human body based on the vertical height of the first human body pelvic node. Step 4.2.2: For the predicted multiple human poses, based on the pelvic node depth values... and the elevation angle obtained by direct regression in step 3 Calculate the vertical offset between each pair of multiple human bodies. The calculation formula is as follows: in, , These are the pelvic node depth values for the first and second human subjects, respectively. , These are the elevation angles obtained directly from the regression of the first and second human figures in step 3, respectively. Based on the calculated vertical offset between the other human figures and the first human figure, the other human figures are moved to the corresponding vertical height, that is, the y-axis coordinates of all joints of the other human figures are superimposed. Finally, each human posture needs to be scaled so that the feet of each posture land on the y-plane.
2. The 3D multi-person human pose estimation method based on RTMW and 3DMPPE according to claim 1, characterized in that: The specific method for step 1 is as follows: Step 1.1: Obtain the public datasets Human3.6M and 3DPW for the human pose estimation task, and divide the two datasets into training and testing sets; use sequences S1, S5, S6, S7, and S8 provided by the 3D single-person dataset Human3.6M as the training set content, and use S9 and S11 as the testing set content; for the 3D multi-person dataset 3DPW, use the data in its provided train folder as the model training set content, and use the data in its provided validation and test files as the testing set content; randomly sample 20% of the Human3.6M training set data and 80% of the 3DPW training set data as the final model training set; Step 1.2: Calculate the average length of the human torso in the training set; use the custom Python function load_json_data to load the JSON file storing the labeled data as a dictionary object, and then read the keypoints attribute of each element in the dictionary to obtain the 3D coordinates of the pelvic and neck nodes; then use the np.linalg.norm function provided by the numpy package to calculate the distance between the pelvic and neck nodes, and record the calculated distance value in the empty array torso_lengths; after traversing the labeled data, use the np.mean function provided by the numpy package to calculate the average length of all torsos.
3. The 3D multi-person human pose estimation method based on RTMW and 3DMPPE according to claim 2, characterized in that: The specific method for step 2 is as follows: The `init_model` function provided by MMpose is used to load the `det_model` model for object detection and the `key_model` model for pose estimation. After loading the image using the `cv2.imread` function from the OpenCV package, the image is input into the `inference_detector` function provided by MMpose to obtain the object detection result `det_result`. The detection result is then converted into a NumPy array using the `cpu` function provided by the PyTorch package and the `numpy` function provided by the NumPy package, and the object detection bounding box attribute values are obtained, thus yielding the object detection bounding box result. The detected object detection bounding box values, along with the image address and the pose estimation model, are input into the `inference_topdown` function provided by MMpose to obtain the object detection result, which includes the 2D pose detection results for multiple people in the image and the 3D pose detection results for each individual person.
4. The 3D multi-person human pose estimation method based on RTMW and 3DMPPE according to claim 3, characterized in that: In step 3, the specific details of the direct regression method are as follows: Define the function `generate_patch_image`. The input of this function is the previously obtained object detection bounding box data and the original image NumPy array representation. The output is an array of images cropped according to the object detection bounding box values. The original image NumPy array is obtained using the `cv2.imread` function provided in the OpenCV package. Based on the principle of pinhole imaging, the approximate distance from the camera to the person is calculated by combining the camera's intrinsic parameters with relevant information about the image region where the human body is located. Let d be the distance between the camera and the human pelvic joint, in mm; and f be the focal length of the camera, in mm. The length of the human body on the image sensor, in mm; according to The function definition yields: in, Let be the angle subtended by the human body along the x-axis from the camera's perspective. This represents the actual physical length of the human body along the x-axis. The length of the human body projected onto the image sensor along the x-axis direction is the physical length of the human body. set up Let be the unit pixel factor on the x-axis, then we get: in, The pixel length of the image in the x-direction; Similarly, on the y-axis, we get: in, This represents the actual physical length of the human body along the y-axis. The length of the human body projected onto the image sensor along the y-axis is its physical length. The pixel length of the image in the y-direction; Therefore, we get: in, , For camera internal parameters, , It is assumed to be a fixed value; Define a function `root_model` that takes an array of cropped images as input and the calculated distance `d` between the camera and the human pelvic joint, and outputs the corrected pelvic node depth. Within `root_model`, useful global features of the input images are first extracted using a ResNet network. Then, global average pooling is applied to these features, and the pooled feature map is further processed... Convolution, outputting a correction factor ; This describes the degree to which the current human posture or body shape affects the bounding box area; the final absolute depth of the pelvic node is obtained using the following formula: in, This represents the absolute depth of the pelvic nodes obtained through regression methods; The global features extracted by ResNet are also fed into two fully connected layers with an output dimension of 512 and a ReLU function for further feature extraction. Then, the extracted elevation angle-related features are mapped to a scalar output elevation angle using the nn.Linear function provided by the PyTorch package. .
5. The 3D multi-person human pose estimation method based on RTMW and 3DMPPE according to claim 4, characterized in that: In step 3, the specific content of the geometric reasoning method is as follows: Based on the previously obtained 2.5D human pose, and using the approximate distance between the pelvic and neck nodes as equal to the length of the torso, the following inference is derived: in, This represents the absolute depth of the pelvic nodes obtained through geometric reasoning. Let represent the absolute depth variable of the pelvic node to be solved. This represents the distance between the pelvic node and the cervical node in the camera center coordinate system; The 2.5D coordinates of the neck node are given. 2.5D coordinates of the pelvic nodes This means that the absolute coordinates of the corresponding node are calculated using the absolute coordinates of the pelvic node and the corresponding 2.5D coordinates of the node. This represents the average torso length of the training set samples; Solving the above equation yields the following result: in, in, and Let be the focal length of the camera along the x-axis and y-axis. , For the camera's optical center coordinates, Here are the 2D coordinates of the pelvic nodes. The 3D camera center coordinates of the pelvic nodes obtained from the regression; Therefore, we define a function `optimize_root_depth`, which takes the pelvic node depth obtained from regression as input. The 2D coordinates of the pelvic nodes, the average torso length, and camera intrinsics are used; the current pelvic node depth is calculated using the above formulas within the function. The value will be used as the basis for the next round of calculations. and Used Value; at the end of each round of calculation, the current value is calculated using the np.abs function provided by NumPy. Value and The difference is considered, and the decision function converges when the difference is less than a specified threshold. The value is the final pelvic node depth obtained through geometric reasoning; otherwise, the calculation continues until the maximum number of iterations is reached.
6. The 3D multi-person human pose estimation method based on RTMW and 3DMPPE according to claim 5, characterized in that: In step 4, the specific method for updating the network parameters through backpropagation based on the error between the predicted and actual values is as follows: Step 4.3.1: Calculate the error between the predicted pose and the true pose. The calculation formula is as follows: in, Let i be the predicted pose of the i-th person. The actual posture of the i-th person; Step 4.3.2: Calculate the error between the translation value in the predicted scene and the translation value in the actual scene. The calculation formula is as follows: in, , Let represent the translation value of the i-th person in the predicted scene and the translation value in the actual scene, respectively; Step 4.3.3: Calculate the error between the predicted pelvic nodes and the actual pelvic nodes. The calculation formula is as follows: in, , Let represent the predicted pelvic node and the actual pelvic node of the i-th person, respectively; Step 4.3.4: Obtain the total error ; Step 4.3.5: Update the network parameters based on the error through backpropagation; First, call the loss.backward() function provided in the PyTorch package to automatically calculate the gradient of each parameter in the neural network; then, use the optimizer.step() function to update the model parameters based on these gradients, so that the loss decreases and the model is gradually optimized.