A method and system for measuring the ratio of long bones of human limbs to height based on images.

By using image measurement technology and combining two-dimensional and three-dimensional human posture estimation models with self-supervised training, the problems of high measurement difficulty, large error and applicability of bone size estimation of height in existing technologies have been solved. This has enabled high-precision and convenient measurement of the ratio of long bones of the limbs to height, which is suitable for different groups of people.

CN117224115BActive Publication Date: 2026-06-30XIANGYA HOSPITAL CENT SOUTH UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
XIANGYA HOSPITAL CENT SOUTH UNIV
Filing Date
2023-10-12
Publication Date
2026-06-30

Smart Images

  • Figure CN117224115B_ABST
    Figure CN117224115B_ABST
Patent Text Reader

Abstract

This invention discloses a method and system for measuring the ratio of long bones in human limbs to height based on images. The method includes steps such as acquiring image sequences, selecting target frames, estimating two-dimensional human pose, predicting the ratio of long bones in limbs to height, labeling potential anomalies, and iterative updates. By acquiring human experience data, the method simultaneously verifies the model detection results and the potential anomaly labeling results based on this data. If there is a significant difference between the model detection results and the human experience data, it indicates that the model accuracy is insufficient and needs updating. If the number of abnormal individuals in the potential anomaly labeling results exceeds a preset proportion threshold, it indicates that the weighting of the linear relationship between long bones in limbs and height is incorrect. This invention offers convenient measurement, high measurement accuracy, and is unaffected by factors such as age, gender, or race; it is suitable for obese individuals and has good versatility.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of anthropometry technology, and in particular discloses a method and system for measuring the ratio of long bones of human limbs to height based on images. Background Technology

[0002] In recent years, people have paid increasing attention to height, and how to scientifically measure height has become a topic of great interest. Among the methods for measuring height, estimating height from bone length has received considerable attention because this method can accurately calculate a person's height and can also be used to extrapolate the course of human evolution based on changes in bone length. Currently, there are two main formulas for estimating height using bone length: the "bone length method" and the "proportional method." The "bone length method" primarily involves measuring the length of bones in various parts of the body and then calculating height using a specific formula. The "proportional method," on the other hand, involves measuring the length ratios of various parts of the body and then calculating height using a specific formula. Although the formulas for estimating height using bone length in China can estimate height by measuring bone length or length ratios, both methods have a certain degree of error. The following will compare the bone length method and the proportional method.

[0003] 1. Bone lengthening method

[0004] (1) Difficult to measure

[0005] The bone length method requires measuring the bones in various parts of the human body, which is quite difficult and requires specialized measuring instruments and techniques.

[0006] (2) Affected by factors such as age, gender, and race

[0007] The bone length method formula is based on data from specific races, ages, and genders. Therefore, there will be errors when people of different races, ages, and genders use the same formula to estimate their height.

[0008] (3) Not suitable for obese people

[0009] The bone length method requires measuring bone length, but the bone length of obese people is affected by the fat layer, so it is not applicable to obese people.

[0010] 2. Proportion Method

[0011] (1) The error is relatively large

[0012] The formula for the proportional method is based on statistical data, so it has a large error and is not suitable for individual measurements.

[0013] (2) Affected by factors such as age, gender, and race

[0014] The formula for the proportional method is also based on data from specific races, ages, and genders. Therefore, there will be errors when people of different races, ages, and genders use the same formula to estimate their height.

[0015] (3) Not suitable for obese people

[0016] The proportion method requires measuring the length proportions of various parts of the human body, but the body proportions of obese people are affected by the fat layer, so it is not applicable to obese people.

[0017] Therefore, the aforementioned defects in existing methods for estimating height based on bone structure are technical problems that urgently need to be solved. Summary of the Invention

[0018] This invention provides a method and system for measuring the ratio of long bones of human limbs to height based on images, aiming to solve the above-mentioned defects in existing bone measurement methods for estimating height.

[0019] One aspect of the present invention relates to a method for measuring the ratio of long bones of human limbs to height based on images, comprising the following steps:

[0020] Image sequence acquisition: Acquire image sequences containing the target human body;

[0021] Target frame selection: Select the target frame of the image sequence;

[0022] Two-dimensional human pose estimation: Based on a preset two-dimensional pose estimation model, the image sequence after selecting the target frame is predicted to a two-dimensional pose sequence;

[0023] Limb long bone to height ratio prediction: The two-dimensional human posture information predicted by the two-dimensional human posture estimation model is used as the input of the three-dimensional human posture estimation model to construct a human skeleton ratio prediction model and calculate the ratio of human limb long bones to height.

[0024] Potential anomaly labeling: The ratio of long bones of human limbs to height calculated by the model detection results is compared with the estimated interval under the preset target confidence level to determine whether the ratio of long bones of human limbs to height calculated by the model detection results is abnormal, and individuals with potential anomalies are labeled.

[0025] Update and Iteration: Obtain human experience data, and verify the model detection results and potential anomaly labeling results based on the human experience data. If there is a large difference between the model detection results and the human experience data, it indicates that the model accuracy is insufficient and needs to be updated. If the number of abnormal individuals in the potential anomaly labeling results exceeds the preset proportion threshold, it indicates that the weight setting of the linear relationship between the long bones of the limbs and height is incorrect.

[0026] Furthermore, in the two-dimensional human pose estimation steps, the loss function and evaluation metrics of the two-dimensional pose estimation model during training are consistent with HRNet, namely, Mean Squared Error (MSE) and Object Keypoint Similarity (OKS):

[0027]

[0028] Where MSE is the mean squared error, n is the number of keypoints, and Y... i Let y be the true value of the i-th keypoint. i This is the predicted value for the i-th key point;

[0029]

[0030] Where OKS represents object keypoint similarity, d i It predicts the Euclidean distance between a keypoint and its nearest neighbor keypoint, where s is the standard deviation parameter and k is the standard deviation parameter. i It is a scaling factor, representing the scale of the predicted keypoints, v i It predicts the visibility of keypoints; it is 1 if the keypoint is visible in the actual annotation, and 0 otherwise. δ() is an indicator function; it takes the value 1 when the condition in parentheses is true, and 0 otherwise.

[0031] Furthermore, in the step of predicting the ratio of long bones of limbs to height, when the human skeleton ratio prediction model is trained to reconstruct a complete 2D pose based on a damaged 2D pose, the head structure is a Transformer decoder, and the prediction result has the same tensor shape as the model input. A self-supervised training method using mask modeling is adopted, and the first loss function is defined as follows:

[0032]

[0033] in, Let represent the first loss function, T represent the number of video frames predicted in each iteration, J represent the number of human joints, x and X represent the predicted pose and the labeled pose, respectively, and 2D indicates that the pose only includes two-dimensional coordinates. and This represents the two-dimensional coordinates of the j-th key in the predicted frame and the two-dimensional coordinates of the j-th key in the labeled data. || ||2 represents the L2 norm.

[0034] Furthermore, in the step of predicting the ratio of long bones of limbs to height, when the human skeleton ratio prediction model is trained to predict three-dimensional pose based on two-dimensional pose, the head structure is a multi-layer neural network, and the last dimension of the tensor is reduced to three to represent three-dimensional coordinates. The second loss function is defined as follows:

[0035]

[0036]

[0037] in, Let λ represent the second loss function. 3D , λ V It is a constant used to balance the losses; represents the 3D pose loss value and the pose velocity loss value, respectively; x and X represent the predicted pose and the labeled pose, respectively; T represents the number of video frames for each prediction; J represents the number of human joints; and 3D represents the pose including three-dimensional coordinates. and Represent the 3D coordinates of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively; v and V represent the attitude velocity values ​​calculated based on the predicted results and the labeled data, respectively; v t,j and V t,j These represent the attitude and velocity values ​​of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively. and represents the three-dimensional coordinates of all joints in frame t of the prediction result and the three-dimensional coordinates of all joints in frame t of the labeled data, respectively.

[0038] Furthermore, in the step of predicting the ratio of long bones of limbs to height, the human skeleton ratio prediction model uses global pooling and a multi-layer neural network for the head structure during training. First, global average pooling is performed in the temporal dimension, and then the tensor dimension is reduced to eight through a multi-layer neural network to obtain the predicted length ratio of eight bones. The third loss function is defined as follows:

[0039]

[0040] in, Let I represent the third loss function, where I is eight, indicating the eight bone lengths to be predicted; b and B represent the proportion of predicted bone lengths and the proportion of labeled bone lengths, respectively. Therefore, b i and B i The proportions of the i-th bone length in the prediction results and the proportions of the i-th bone length in the labeled data are respectively expressed.

[0041] Another aspect of the present invention relates to a system for measuring the ratio of long bones of human limbs to height based on images, comprising:

[0042] The image sequence acquisition module is used to acquire image sequences containing the target human body;

[0043] The target frame selection module is used to select the target frame of the image sequence;

[0044] The two-dimensional human pose estimation module is used to predict the two-dimensional pose sequence from the image sequence after selecting the target frame according to the preset two-dimensional pose estimation model.

[0045] The limb long bone to height ratio prediction module is used to take the two-dimensional posture information predicted by the two-dimensional human posture estimation model as the input of the three-dimensional human posture estimation model, construct the human skeleton ratio prediction model, and calculate the ratio of human limb long bones to height.

[0046] The potential anomaly labeling module is used to compare the ratio of long bones of human limbs to height calculated by the model detection results with the estimated interval under the preset target confidence level, to determine whether the ratio of long bones of human limbs to height calculated by the model detection results is abnormal, and to label individuals with potential anomalies.

[0047] The update and iteration module is used to acquire human experience data. Based on the human experience data, the model detection results and potential anomaly labeling results are verified simultaneously. If there is a large difference between the model detection results and the human experience data, it indicates that the model accuracy is insufficient and needs to be updated. If the number of abnormal individuals in the potential anomaly labeling results exceeds the preset proportion threshold, it indicates that the weight setting of the linear relationship between the long bones of the limbs and height is incorrect.

[0048] Furthermore, in the 2D human pose estimation module, the loss function and evaluation metrics of the 2D pose estimation model during training are consistent with HRNet, namely Mean Squared Error (MSE) and Object Keypoint Similarity (OKS):

[0049]

[0050] Where MSE is the mean squared error, n is the number of keypoints, and Y... i Let y be the true value of the i-th keypoint. i This is the predicted value for the i-th key point;

[0051]

[0052] Where OKS represents object keypoint similarity, d i It predicts the Euclidean distance between a keypoint and its nearest neighbor keypoint, where s is the standard deviation parameter and k is the standard deviation parameter. i It is a scaling factor, representing the scale of the predicted keypoints, v i It predicts the visibility of keypoints; it is 1 if the keypoint is visible in the actual annotation, and 0 otherwise. δ() is an indicator function; it takes the value 1 when the condition in parentheses is true, and 0 otherwise.

[0053] Furthermore, in the limb long bone to height ratio prediction module, when the human skeleton ratio prediction model is trained to reconstruct a complete 2D pose based on a damaged 2D pose, the head structure is a Transformer decoder, and the prediction result has the same tensor shape as the model input. A self-supervised training method using mask modeling is adopted, and the first loss function is defined as follows:

[0054]

[0055] in, Let represent the first loss function, T represent the number of video frames predicted in each iteration, J represent the number of human joints, x and X represent the predicted pose and the labeled pose, respectively, and 2D indicates that the pose only includes two-dimensional coordinates. and This represents the two-dimensional coordinates of the j-th key in the predicted frame and the labeled data. ||||2 represents the L2 norm.

[0056] Furthermore, in the limb long bone to height ratio prediction module, when the human skeleton ratio prediction model is trained to predict three-dimensional pose based on two-dimensional pose, the head structure is a multi-layer neural network, reducing the last dimension of the tensor to three to represent three-dimensional coordinates. The second loss function is defined as follows:

[0057]

[0058]

[0059] in, Let λ represent the second loss function. 3D , λ V It is a constant used to balance the losses; represents the 3D pose loss value and the pose velocity loss value, respectively; x and X represent the predicted pose and the labeled pose, respectively; T represents the number of video frames for each prediction; J represents the number of human joints; and 3D represents the pose including three-dimensional coordinates. and Represent the 3D coordinates of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively; v and V represent the attitude velocity values ​​calculated based on the predicted results and the labeled data, respectively; v t,j and V t,j These represent the attitude and velocity values ​​of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively. and represents the three-dimensional coordinates of all joints in frame t of the prediction result and the three-dimensional coordinates of all joints in frame t of the labeled data, respectively.

[0060] Furthermore, in the limb long bone to height ratio prediction module, the human skeleton ratio prediction model uses global pooling and a multi-layer neural network for the head structure during training. First, global average pooling is performed in the temporal dimension, and then the tensor dimension is reduced to eight through a multi-layer neural network to predict the length ratio of eight bones. The third loss function is defined as follows:

[0061]

[0062] in, Let I represent the third loss function, where I is eight, indicating the eight bone lengths to be predicted; b and B represent the proportion of predicted bone lengths and the proportion of labeled bone lengths, respectively. Therefore, b i and B i The proportions of the i-th bone length in the prediction results and the proportions of the i-th bone length in the labeled data are respectively expressed.

[0063] The beneficial effects achieved by this invention are as follows:

[0064] This invention provides a method and system for measuring the ratio of long bones in the limbs to height based on images. The method involves acquiring image sequences, selecting target frames, estimating two-dimensional human pose, predicting the ratio of long bones in the limbs to height, labeling potential anomalies, and iterative updates. This invention offers convenient measurement, high accuracy, and is unaffected by age, gender, or race; it is suitable for obese individuals and has good versatility. Attached Figure Description

[0065] Figure 1 This is a flowchart illustrating the method for measuring the ratio of long bones of human limbs to height based on images provided by the present invention. Detailed Implementation

[0066] To better understand the above technical solutions, the following will provide a detailed explanation of the technical solutions in conjunction with the accompanying drawings and specific implementation methods.

[0067] like Figure 1 As shown, this invention proposes a method for measuring the ratio of long bones of human limbs to height based on images, including the following steps:

[0068] Step S100: Obtain image sequence: Obtain image sequence containing the target human body.

[0069] Obtain a video image sequence containing the target human body.

[0070] Step S200: Target frame selection: Select the target frame of the image sequence.

[0071] Since the target human body is constantly moving within the frame, images where the target is too far or too close are detrimental to keypoint recognition. Furthermore, keypoint recognition is often accompanied by keypoint jitter, making it unreasonable to extract a single frame for keypoint location and calculation. Our solution first selects image frames where the target human body region occupies a certain proportion (e.g., 30%–80%) of the entire image. These frames ensure the human body is clear and largely non-overlapping, avoiding the impact of the target human body being too far or too close to the camera on prediction accuracy. The proportion of the human body region is calculated by dividing the area of ​​the human bounding box obtained by the Person Detector in the 2D human pose estimation model by the area of ​​the entire image. Only image frames of the same human body that meet the above conditions are used to predict 2D pose information using the 2D human pose estimation model, and its output is used as input to the 3D human pose estimation model to obtain the ratio of limb long bones to height.

[0072] Step S300, Two-dimensional human pose estimation: Based on the preset two-dimensional pose estimation model, the image sequence after selecting the target frame is predicted to a two-dimensional pose sequence.

[0073] The 2D (planar image) human pose estimation model uses RGB (Red, Green, Blue) images as input to predict keypoints of the human body in the image. The model outputs the two-dimensional coordinates of these keypoints, from which preliminary skeletal information can be obtained. The model employs HRNet, a state-of-the-art top-down human pose estimation algorithm based on CNN (Convolutional Neural Network), using the same human detector model as HRNet (FPN-DCN from Simple Baseline), which accurately obtains the keypoint coordinates of each person in the image. By removing redundant high-resolution branches from HRNet and modifying the model's prediction method, the overall prediction speed and accuracy are improved. Removing redundant high-resolution branches allows more computation to be allocated to mining high-level semantic information, improving model performance while maintaining computational resources. Modifying the prediction method involves using the coordinate classification method proposed by SimCC instead of the heatmap method to obtain keypoint locations, reducing computational costs while maintaining performance.

[0074] The model's training data comes from mainstream open-source datasets in the field of human pose estimation. The number of keypoints output by the model is consistent with the labeled dataset. Taking the COCO dataset as an example, the output skeletal points include 17 locations: nose, left and right eyes, left and right ears, left and right wrists, left and right elbows, left and right shoulders, left and right ankles, left and right knees, and left and right hip joints. During training, the model's loss function and evaluation metrics are consistent with HRNet, namely Mean Squared Error (MSE) and Object Keypoint Similarity (OKS).

[0075]

[0076] In formula (1), MSE is the root mean square error, n is the number of key points, and Y... i Let y be the true value of the i-th keypoint. i Let be the predicted value of the i-th key point.

[0077]

[0078] In formula (2), OKS represents the object key point similarity, and d i It predicts the Euclidean distance between a keypoint and its nearest neighbor keypoint, where s is the standard deviation parameter and k is the standard deviation parameter. i It is a scaling factor, representing the scale of the predicted keypoints, v i It predicts the visibility of keypoints; it is 1 if the keypoint is visible in the actual annotation, and 0 otherwise. δ() is an indicator function; it takes the value 1 when the condition in parentheses is true, and 0 otherwise.

[0079] Step S400, prediction of the ratio of long bones of limbs to height: The two-dimensional posture information predicted by the two-dimensional human posture estimation model is used as the input of the three-dimensional human posture estimation model to construct a human skeleton ratio prediction model and calculate the ratio of long bones of human limbs to height.

[0080] The human skeleton proportion prediction model is modified from the 3D (Three Dimensions) human pose estimation model to directly predict human skeleton proportions. The 3D human pose estimation model can predict 3D pose from images or 2D poses; this model uses the latter. That is, the model input is the 2D coordinates of human skeleton points, and the output is the 3D coordinates of each skeleton point, where the 2D coordinates are predicted by the 2D human pose estimation model. Compared to 2D pose, 3D pose reduces interference from external factors such as image capture angle, allowing for direct calculation of skeleton proportions. Estimating 3D pose can be considered a prerequisite for calculating the ratio of bone length to height; based on this, this solution modifies the 3D human pose estimation model to obtain the human skeleton proportion prediction model. First, incorporating the calculation into the AI ​​model improves the accuracy of the skeleton proportion calculation results because the model can learn to eliminate various interferences. Furthermore, using skeleton proportions as the prediction result effectively utilizes subsequent measurement data, ensuring the model's continuous updates and iterations.

[0081] The model will be built using the state-of-the-art Transformer-like structure in the field of 3D human pose estimation. It will learn pose-related features by performing self-attention operations in the temporal and spatial dimensions respectively, thereby predicting a relatively accurate 3D pose result. The structure of the 3D human pose estimation model is shown.

[0082] The overall model training process employs a pre-training + fine-tuning approach. Pre-training data comes from various publicly available datasets in the field, and the pre-training tasks include reconstructing a complete 2D pose from a damaged 2D pose and predicting a 3D pose based on a 2D pose. The fine-tuning dataset comes from a self-constructed private dataset, and the training task is bone proportion prediction. The dataset takes as input a sequence of 2D coordinates of bone joints and outputs the length proportions of eight bones: left and right radius and ulna, left and right humerus, left and right femur, and left and right tibia. Switching between different tasks is achieved by changing the head structure.

[0083] In training task one (reconstructing a complete 2D pose from a damaged 2D pose), the head structure is a Transformer Decoder, and the prediction result has the same shape as the tensor input to the model. The model employs a self-supervised training method using Masked Modeling. The first loss function is defined as follows:

[0084]

[0085] In formula (3), Let represent the first loss function, T represent the number of video frames predicted in each iteration, J represent the number of human joints, x and X represent the predicted pose and the labeled pose, respectively, and 2D indicates that the pose only includes two-dimensional coordinates. and This represents the two-dimensional coordinates of the j-th key in the predicted frame and the labeled data. || ||2 represents the L2 norm, which is the square root of the sum of the squares of all values ​​within the symbol, i.e., when A = (a1, a2, ..., a3).

[0086] In training task two (predicting 3D pose based on 2D pose), the head structure is a multi-layer neural network, reducing the last dimension of the tensor to three to represent 3D coordinates. The second loss function is defined as follows:

[0087]

[0088]

[0089] In formulas (4) and (5), Let λ represent the second loss function. 3D , λ V It is a constant used to balance the losses; represents the 3D pose loss value and the pose velocity loss value, respectively; x and X represent the predicted pose and the labeled pose, respectively; T represents the number of video frames for each prediction; J represents the number of human joints; and 3D represents the pose including three-dimensional coordinates. and Represent the 3D coordinates of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively; v and V represent the attitude velocity values ​​calculated based on the predicted results and the labeled data, respectively; v t,j and V t,j These represent the attitude and velocity values ​​of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively. and represents the three-dimensional coordinates of all joints in frame t of the prediction result and the three-dimensional coordinates of all joints in frame t of the labeled data, respectively.

[0090] In training task three (skeletal proportion prediction), the model's head structure consists of global pooling and a multi-layer neural network. First, global average pooling is performed along the temporal dimension, then the tensor dimension is reduced to 8 using a multi-layer neural network, thus enabling the prediction of eight bone length proportions. The third loss function is defined as follows:

[0091]

[0092] In formula (6), Let I represent the third loss function, where I is eight, indicating the eight bone lengths to be predicted; b and B represent the proportion of predicted bone lengths and the proportion of labeled bone lengths, respectively. Therefore, b i and B i The proportions of the i-th bone length in the prediction results and the proportions of the i-th bone length in the labeled data are respectively expressed.

[0093] Step S500, Potential Anomaly Labeling: Compare the ratio of long bones of human limbs to height calculated by the model detection results with the estimated interval under the preset target confidence level to determine whether the ratio of long bones of human limbs to height calculated by the model detection results is abnormal, and label individuals with potential anomalies.

[0094] According to relevant research, the ratio of height to the long bones of the limbs exhibits a relatively clear linear relationship. Therefore, based on the results of relevant research, appropriate weights will be assigned, namely the linear regression correlation coefficient and constant term of the ratio of each long bone to height. Then, using statistical knowledge, the estimated interval of the ratio of the long bones of the limbs at the target confidence level (e.g., 95%) can be calculated for the actual height, thus obtaining the estimated interval of the ratio of the long bones of the limbs to height at the target confidence level. When the linear regression correlation parameters are sufficiently accurate, the estimated interval of the ratio of the long bones of the limbs to height can be used to determine whether the ratio of the long bones of the limbs to height predicted by the model is normal (within the confidence interval) or abnormal (outside the confidence interval).

[0095] The purpose of the potential anomaly test is to identify abnormal populations. When the proportion of abnormal values ​​in a predicted population is greater than a certain threshold (e.g., 20%), the relevant parameters of the test are considered to be problematic. That is, individuals judged as abnormal are considered to be potentially abnormal only when the proportion of potential anomalies is less than the threshold.

[0096] Step S600, Update and Iteration: Obtain human experience data, and verify the model detection results and potential anomaly labeling results based on the human experience data. If there is a large difference between the model detection results and the human experience data, it indicates that the model accuracy is insufficient and needs to be updated. If the number of abnormal individuals in the potential anomaly labeling results exceeds the preset proportion threshold, it indicates that the weight setting of the linear relationship between the long bones of the limbs and height is incorrect.

[0097] The data from the physical examinations were simultaneously validated against both the predictions from the AI ​​model and the results of potential anomaly labeling. For the former, a significant discrepancy between the model's detection results and the physical examination data indicated insufficient model accuracy, necessitating an update. For the latter, potential anomaly labeling of the physical examination data revealed that an anomaly rate exceeding a threshold indicated a need to update the weighting of potential anomaly assessments. This is because abnormal individuals typically constitute a small percentage; when the rate exceeds a certain threshold, it usually indicates a problem with the evaluation method.

[0098] For iterative updates to the model, the ratio of limb long bones to height can be calculated based on physical examination data. The annotations of the limb long bone to height ratios in human images entered in the previous period are modified to match the ratios calculated from the current physical examination data, and the model is then retrained for prediction of the next period. When the next physical examination data is entered, the annotations of the limb long bone to height ratios in human images from the period between the two data entries are modified again, and the model is retrained once more. This iterative modification process ensures that the model can consistently maintain a high accuracy rate in predicting the current state of the population.

[0099] For iterative updates of anomaly assessment weights, based on physical examination data and the linear relationship between limb long bones and height, an estimated range of the ratio of limb long bones to height at a target confidence level (e.g., 95%) can be obtained for a given height. The estimated range is obtained based on the height from the physical examination and compared with the predicted ratio of limb long bones to height over a previous period. A ratio falling within the range is considered normal; otherwise, it is considered abnormal. If the proportion of abnormal individuals within this period exceeds a threshold (e.g., 20%), the weights of the linear relationship between limb long bones and height are considered incorrect and need to be reset; otherwise, the weights are considered correctly set. Generally, the probability of weight errors is high when using this method for the first time because the relationship between limb long bones and height varies among individuals from different regions. The iterative update of weights is based on physical examination data, obtaining limb long bone and height data for all individuals, performing a linear fit on the relationship between each bone length and height, and updating the corresponding weight relationship based on the coefficients and constant terms of the obtained linear relationship. Each time physical examination data is obtained, it is determined whether an update is needed based on the abnormality rate. The weight linearity is continuously updated in this way to ensure that the weight relationship of abnormality judgment always conforms to the population in the current state.

[0100] The image-based method for measuring the ratio of long bones in human limbs to height provided in this embodiment, compared with existing technologies, employs steps such as acquiring image sequences, selecting target frames, estimating two-dimensional human pose, predicting the ratio of long bones in limbs to height, labeling potential anomalies, and iterative updates. This embodiment offers convenient measurement, high measurement accuracy, and is unaffected by factors such as age, gender, or race; it is suitable for obese individuals and has good versatility.

[0101] This invention also relates to a system for measuring the ratio of long bones of human limbs to height based on images, including an image sequence acquisition module, a target frame selection module, a two-dimensional human pose estimation module, a limb long bone to height ratio prediction module, a potential anomaly labeling module, and an update iteration module. The image sequence acquisition module is used to acquire an image sequence containing a target human body; the target frame selection module is used to select target frames from the image sequence; the two-dimensional human pose estimation module is used to predict a two-dimensional pose sequence from the image sequence after selecting the target frames according to a preset two-dimensional pose estimation model; the limb long bone to height ratio prediction module is used to use the two-dimensional pose information predicted by the two-dimensional human pose estimation model as input to a three-dimensional human pose estimation model to construct a human skeleton ratio prediction model. The system calculates the ratio of long bones in the limbs to height. A potential anomaly labeling module compares the ratio calculated by the model with the estimated interval at a preset target confidence level to determine if the ratio is abnormal and labels potentially abnormal individuals. An update and iteration module acquires human experience data and verifies both the model detection results and the potential anomaly labeling results based on this data. If there is a significant difference between the model detection results and the human experience data, the model accuracy is insufficient and needs updating. If the number of abnormal individuals in the potential anomaly labeling results exceeds a preset ratio threshold, the weighting of the linear relationship between long bones in the limbs and height is incorrect.

[0102] The image sequence acquisition module acquires video image sequences containing the target human body.

[0103] In the target frame selection module, since the target human body is constantly moving in the frame, frames where the target human body is too far away or too close are not conducive to key point recognition. Furthermore, the model's recognition is often accompanied by key point jitter. Therefore, selecting a single frame to obtain and calculate key point positions is clearly unreasonable. In our solution, we first select image frames where the target human body region occupies a certain proportion range (e.g., 30%–80%) of the entire image. In these frames, the human body is clear and mostly non-overlapping, avoiding the impact of the target human body being too far or too close to the camera on prediction accuracy. The proportion of the human body region is calculated by dividing the area of ​​the human body bounding box obtained by the Person Detector in the 2D human pose estimation model by the area of ​​the entire image. Only image frames of the same human body that meet the above conditions are used to predict 2D pose information using the 2D human pose estimation model, and its output is used as input to the 3D human pose estimation model to obtain the ratio of limb long bones to height.

[0104] In the 2D human pose estimation module, the 2D (planar graphic) human pose estimation model uses RGB (Red, Green, Blue) images as input to predict keypoints of the human body in the image. The model outputs the 2D coordinates of these keypoints, from which preliminary skeletal information can be obtained. The model uses the state-of-the-art top-down human pose estimation algorithm HRNet based on CNN (Convolutional Neural Network), employing the same human detector model as HRNet (FPN-DCN from Simple Baseline), which accurately obtains the keypoint coordinates of each person in the image. By removing redundant high-resolution branches from HRNet and changing the model's prediction method, the overall prediction speed and accuracy are improved. Removing redundant high-resolution branches allows more computation to be allocated to mining high-level semantic information, improving model performance while maintaining computational resources. Changing the prediction method involves using the coordinate classification method proposed by SimCC instead of the heatmap method to obtain keypoint locations, reducing computational resources while maintaining performance.

[0105] The model's training data comes from mainstream open-source datasets in the field of human pose estimation. The number of keypoints output by the model is consistent with the labeled dataset. Taking the COCO dataset as an example, the output skeletal points include 17 locations: nose, left and right eyes, left and right ears, left and right wrists, left and right elbows, left and right shoulders, left and right ankles, left and right knees, and left and right hip joints. During training, the model's loss function and evaluation metrics are consistent with HRNet, namely Mean Squared Error (MSE) and Object Keypoint Similarity (OKS).

[0106]

[0107] In formula (7), MSE is the root mean square error, n is the number of key points, and Y is the root mean square error. i Let y be the true value of the i-th keypoint. i Let be the predicted value of the i-th key point.

[0108]

[0109] In formula (8), OKS represents the object keypoint similarity, and d i It predicts the Euclidean distance between a keypoint and its nearest neighbor keypoint, where s is the standard deviation parameter and k is the standard deviation parameter. i It is a scaling factor, representing the scale of the predicted keypoints, v i It predicts the visibility of keypoints; it is 1 if the keypoint is visible in the actual annotation, and 0 otherwise. δ() is an indicator function; it takes the value 1 when the condition in parentheses is true, and 0 otherwise.

[0110] In the limb long bone to height ratio prediction module, the human skeleton ratio prediction model will be modified based on the 3D (Three Dimensions) human pose estimation model to directly predict human skeleton ratios. The 3D human pose estimation model can predict 3D pose from images or 2D poses; here, the latter is used. That is, the model input is the 2D coordinates of human skeleton points, and the output is the 3D coordinates of each skeleton point, where the 2D coordinates are predicted by the 2D human pose estimation model. Compared to 2D pose, 3D pose reduces interference from external factors such as image capture angle, thus allowing direct calculation of skeleton ratios. Estimating 3D pose can be considered a prerequisite for calculating the bone length to height ratio; based on this, this solution modifies the 3D human pose estimation model to obtain the human skeleton ratio prediction model. First, incorporating the calculation into the AI ​​model improves the accuracy of the skeleton ratio calculation results because the model can learn to eliminate various interferences. Furthermore, using skeleton ratios as the prediction result effectively utilizes subsequent measurement data, ensuring the model's updates and iterations.

[0111] The model will be built using the state-of-the-art Transformer-like structure in the field of 3D human pose estimation. It will learn pose-related features by performing self-attention operations in the temporal and spatial dimensions respectively, thereby predicting a relatively accurate 3D pose result. The structure of the 3D human pose estimation model is shown.

[0112] The overall model training process employs a pre-training + fine-tuning approach. Pre-training data comes from various publicly available datasets in the field, and the pre-training tasks include reconstructing a complete 2D pose from a damaged 2D pose and predicting a 3D pose based on a 2D pose. The fine-tuning dataset comes from a self-constructed private dataset, and the training task is bone proportion prediction. The dataset takes as input a sequence of 2D coordinates of bone joints and outputs the length proportions of eight bones: left and right radius and ulna, left and right humerus, left and right femur, and left and right tibia. Switching between different tasks is achieved by changing the head structure.

[0113] In training task one (reconstructing a complete 2D pose from a damaged 2D pose), the head structure is a Transformer Decoder, and the prediction result has the same shape as the tensor input to the model. The model employs a self-supervised training method using Masked Modeling. The first loss function is defined as follows:

[0114]

[0115] In formula (9), Let represent the first loss function, T represent the number of video frames predicted in each iteration, J represent the number of human joints, x and X represent the predicted pose and the labeled pose, respectively, and 2D indicates that the pose only includes two-dimensional coordinates. and This represents the two-dimensional coordinates of the j-th key in the predicted frame and the labeled data. || ||2 represents the L2 norm, which is the square root of the sum of the squares of all values ​​within the symbol, i.e., when A = (a1, a2, ..., a3).

[0116] In training task two (predicting 3D pose based on 2D pose), the head structure is a multi-layer neural network, reducing the last dimension of the tensor to three to represent 3D coordinates. The second loss function is defined as follows:

[0117]

[0118]

[0119] In formulas (10) and (11), Let λ represent the second loss function. 3D , λ V It is a constant used to balance the losses; represents the 3D pose loss value and the pose velocity loss value, respectively; x and X represent the predicted pose and the labeled pose, respectively; T represents the number of video frames for each prediction; J represents the number of human joints; and 3D represents the pose including three-dimensional coordinates. and Represent the 3D coordinates of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively; v and V represent the attitude velocity values ​​calculated based on the predicted results and the labeled data, respectively; v t,j and V t,j These represent the attitude and velocity values ​​of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively. and represents the three-dimensional coordinates of all joints in frame t of the prediction result and the three-dimensional coordinates of all joints in frame t of the labeled data, respectively.

[0120] In training task three (skeletal proportion prediction), the model's head structure consists of global pooling and a multi-layer neural network. First, global average pooling is performed along the temporal dimension, then the tensor dimension is reduced to 8 using a multi-layer neural network, thus enabling the prediction of eight bone length proportions. The third loss function is defined as follows:

[0121]

[0122] In formula (12), Let I represent the third loss function, where I is eight, indicating the eight bone lengths to be predicted; b and B represent the proportion of predicted bone lengths and the proportion of labeled bone lengths, respectively. Therefore, b i and B i The proportions of the i-th bone length in the prediction results and the proportions of the i-th bone length in the labeled data are respectively expressed.

[0123] In the potential anomaly labeling module, relevant research indicates a relatively clear linear relationship between height and the ratio of long bones in the limbs. Therefore, appropriate weights are assigned based on the results of relevant studies, namely the linear regression correlation coefficient and constant term of the ratio of each long bone to height. Then, using statistical knowledge, the estimated interval for the ratio of long bones in the limbs at the target confidence level (e.g., 95%) can be calculated for the actual height, thus obtaining the estimated interval for the ratio of long bones in the limbs to height at the target confidence level. When the linear regression correlation parameters are sufficiently accurate, the estimated interval of the ratio of long bones in the limbs to height can be used to determine whether the ratio of long bones in the limbs to height predicted by the model is normal (within the confidence interval) or abnormal (outside the confidence interval).

[0124] The purpose of the potential anomaly test is to identify abnormal populations. When the proportion of abnormal values ​​in a predicted population is greater than a certain threshold (e.g., 20%), the relevant parameters of the test are considered to be problematic. That is, individuals judged as abnormal are considered to be potentially abnormal only when the proportion of potential anomalies is less than the threshold.

[0125] In the update and iteration module, the data calculated based on the AI ​​model's predictions and the potential anomaly labeling results are validated simultaneously based on the physical examination data. For the former, a significant difference between the model's detection results and the physical examination data indicates insufficient model accuracy, requiring an update. For the latter, potential anomaly labeling is performed on the physical examination data; when the anomaly ratio exceeds a threshold, it indicates that the weights for potential anomaly assessment need updating. This is because abnormal individuals are usually a small minority; when the ratio exceeds a certain level, it typically indicates a problem with the evaluation method.

[0126] For iterative updates to the model, the ratio of limb long bones to height can be calculated based on physical examination data. The annotations of the limb long bone to height ratios in human images entered in the previous period are modified to match the ratios calculated from the current physical examination data, and the model is then retrained for prediction of the next period. When the next physical examination data is entered, the annotations of the limb long bone to height ratios in human images from the period between the two data entries are modified again, and the model is retrained once more. This iterative modification process ensures that the model can consistently maintain a high accuracy rate in predicting the current state of the population.

[0127] For iterative updates of anomaly assessment weights, based on physical examination data and the linear relationship between limb long bones and height, an estimated range of the ratio of limb long bones to height at a target confidence level (e.g., 95%) can be obtained for a given height. The estimated range is obtained based on the height from the physical examination and compared with the predicted ratio of limb long bones to height over a previous period. A ratio falling within the range is considered normal; otherwise, it is considered abnormal. If the proportion of abnormal individuals within this period exceeds a threshold (e.g., 20%), the weights of the linear relationship between limb long bones and height are considered incorrect and need to be reset; otherwise, the weights are considered correctly set. Generally, the probability of weight errors is high when using this method for the first time because the relationship between limb long bones and height varies among individuals from different regions. The iterative update of weights is based on physical examination data, obtaining limb long bone and height data for all individuals, performing a linear fit on the relationship between each bone length and height, and updating the corresponding weight relationship based on the coefficients and constant terms of the obtained linear relationship. Each time physical examination data is obtained, it is determined whether an update is needed based on the abnormality rate. The weight linearity is continuously updated in this way to ensure that the weight relationship of abnormality judgment always conforms to the population in the current state.

[0128] The image-based system for measuring the ratio of long bones in human limbs to height provided in this embodiment, compared with existing technologies, employs an image sequence acquisition module, a target frame selection module, a two-dimensional human pose estimation module, a limb long bone to height ratio prediction module, a potential anomaly labeling module, and an update iteration module. The image-based system for measuring the ratio of long bones in human limbs to height provided in this embodiment is convenient to use, has high measurement accuracy, and is unaffected by factors such as age, gender, or race; it is suitable for obese individuals and has good versatility.

[0129] Although preferred embodiments of the invention have been described, those skilled in the art, upon learning the basic inventive concept, can make other changes and modifications to these embodiments. Therefore, the appended claims are intended to be interpreted as including both the preferred embodiments and all changes and modifications falling within the scope of the invention. Clearly, those skilled in the art can make various alterations and modifications to the invention without departing from its spirit and scope. Thus, if these modifications and modifications of the invention fall within the scope of the claims and their equivalents, the invention is also intended to include these modifications and modifications.

Claims

1. A method for measuring the ratio of long bones of human limbs to height based on images, characterized in that, Includes the following steps: Image sequence acquisition: Acquire image sequences containing the target human body; Target frame selection: Select the target frame of the image sequence; Two-dimensional human pose estimation: Based on a preset two-dimensional human pose estimation model, the image sequence after selecting the target frame is predicted to a two-dimensional pose sequence; Limb long bone to height ratio prediction: The two-dimensional posture information predicted by the two-dimensional human posture estimation model is used as the input of the three-dimensional human posture estimation model to construct a human skeleton ratio prediction model and calculate the ratio of human limb long bones to height. Potential anomaly labeling: The ratio of long bones of human limbs to height calculated by the model detection results is compared with the estimated interval under the preset target confidence level to determine whether the ratio of long bones of human limbs to height calculated by the model detection results is abnormal, and individuals with potential anomalies are labeled. Update and Iteration: Obtain human experience data, and verify the model detection results and potential anomaly labeling results based on the human experience data. If there is a large difference between the model detection results and the human experience data, it indicates that the model accuracy is insufficient and needs to be updated. If the number of abnormal individuals in the potential anomaly labeling results exceeds the preset proportion threshold, it indicates that the weighting of the linear relationship between the long bones of the limbs and height is incorrect. In the step of predicting the ratio of long bones of limbs to height, when the human skeleton ratio prediction model is trained to reconstruct a complete two-dimensional pose based on a damaged two-dimensional pose, the head structure is a Transformer decoder, and the prediction result has the same shape as the tensor input to the model. A self-supervised training method using mask modeling is adopted, and the first loss function is defined as follows: ; in, Let represent the first loss function, T represent the number of video frames predicted in each iteration, J represent the number of human joints, x and X represent the predicted pose and the labeled pose, respectively, and 2D indicates that the pose only includes two-dimensional coordinates. and This represents the two-dimensional coordinates of the j-th key in the t-th frame of the prediction result and the two-dimensional coordinates of the j-th key in the labeled data. Represents the L2 norm; In the step of predicting the ratio of long bones of limbs to height, when the human skeleton ratio prediction model is trained to predict three-dimensional pose based on two-dimensional pose, the head structure is a multi-layer neural network, and the last dimension of the tensor is reduced to three to represent three-dimensional coordinates. The second loss function is defined as follows: ; ; in, This represents the second loss function. , It is a constant used to balance the losses; , represents the 3D pose loss value and the pose velocity loss value, respectively; x and X represent the predicted pose and the labeled pose, respectively; T represents the number of video frames for each prediction; J represents the number of human joints; and 3D represents the pose including three-dimensional coordinates. and represents the 3D coordinates of the j-th joint in the t-th frame of the prediction result and the 3D coordinates of the j-th joint in the labeled data, respectively; v and V represent the attitude velocity values ​​calculated based on the prediction result and the labeled data, respectively. and These represent the attitude and velocity values ​​of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively. and These represent the 3D coordinates of all joints in the prediction result and the 3D coordinates of all joints in the labeled data in the t-th frame, respectively. In the step of predicting the ratio of long bones of limbs to height, the human skeleton ratio prediction model uses global pooling and a multi-layer neural network for the head structure during training. First, global average pooling is performed in the temporal dimension, and then the tensor dimension is reduced to eight through a multi-layer neural network to predict the length ratio of eight bones. The third loss function is defined as follows: ; in, Let I represent the third loss function, where I is eight, indicating the eight bone lengths to be predicted; b and B represent the proportion of predicted bone lengths and the proportion of labeled bone lengths, respectively. and The proportions of the i-th bone length in the prediction results and the proportions of the i-th bone length in the labeled data are respectively expressed.

2. The method for measuring the ratio of long bones of human limbs to height based on images as described in claim 1, characterized in that, In the steps of the two-dimensional human pose estimation, the loss function and evaluation metrics of the two-dimensional human pose estimation model during training are consistent with HRNet, namely, mean squared error (MSE) and object keypoint similarity (OKS). ; Where MSE is the mean squared error, n is the number of keypoints, and Y... i Let y be the true value of the i-th keypoint. i This is the predicted value for the i-th key point; ; Where OKS represents object keypoint similarity, d i It predicts the Euclidean distance between a keypoint and its nearest neighbor keypoint, where s is the standard deviation parameter and k is the standard deviation parameter. i It is a scaling factor, representing the scale of the predicted keypoints, v i It predicts the visibility of keypoints; it is 1 if the keypoint is visible in the actual annotation, and 0 otherwise. ( ) is an indicator function that takes the value 1 when the condition inside the parentheses is true, and 0 otherwise.

3. A system for measuring the ratio of long bones of human limbs to height based on images, characterized in that, include: The image sequence acquisition module is used to acquire image sequences containing the target human body; The target frame selection module is used to select the target frame of the image sequence; The two-dimensional human pose estimation module is used to predict the image sequence after selecting the target frame into a two-dimensional pose sequence according to the preset two-dimensional human pose estimation model. The limb long bone to height ratio prediction module is used to take the two-dimensional posture information predicted by the two-dimensional human posture estimation model as the input of the three-dimensional human posture estimation model, construct the human skeleton ratio prediction model, and calculate the ratio of human limb long bones to height. The potential anomaly labeling module is used to compare the ratio of the long bones of the human limbs to the height calculated by the model detection results with the estimated interval under the preset target confidence level, to determine whether the ratio of the long bones of the human limbs to the height calculated by the model detection results is abnormal, and to label individuals with potential anomalies. The update and iteration module is used to acquire human experience data. Based on the human experience data, the model detection results and potential anomaly labeling results are verified simultaneously. If there is a large difference between the model detection results and the human experience data, it indicates that the model accuracy is insufficient and needs to be updated. If the number of abnormal individuals in the potential anomaly labeling results exceeds the preset proportion threshold, it indicates that the weighting of the linear relationship between the long bones of the limbs and height is incorrect. In the limb long bone to height ratio prediction module, when the human skeleton ratio prediction model is trained to reconstruct a complete 2D pose based on a damaged 2D pose, the head structure is a Transformer decoder, and the prediction result has the same tensor shape as the model input. A self-supervised training method using mask modeling is adopted, and the first loss function is defined as follows: ; in, Let represent the first loss function, T represent the number of video frames predicted in each iteration, J represent the number of human joints, x and X represent the predicted pose and the labeled pose, respectively, and 2D indicates that the pose only includes two-dimensional coordinates. and This represents the two-dimensional coordinates of the j-th key in the t-th frame of the prediction result and the two-dimensional coordinates of the j-th key in the labeled data. Represents the L2 norm; In the limb long bone to height ratio prediction module, when the human skeleton ratio prediction model is trained to predict three-dimensional pose based on two-dimensional pose, the head structure is a multi-layer neural network, and the last dimension of the tensor is reduced to three to represent three-dimensional coordinates. The second loss function is defined as follows: ; ; in, This represents the second loss function. , are constants used to balance losses; , represents the 3D pose loss value and the pose velocity loss value, respectively; x and X represent the predicted pose and the labeled pose, respectively; T represents the number of video frames for each prediction; J represents the number of human joints; and 3D represents the pose including three-dimensional coordinates. and represents the 3D coordinates of the j-th joint in the t-th frame of the prediction result and the 3D coordinates of the j-th joint in the labeled data, respectively; v and V represent the attitude velocity values ​​calculated based on the prediction result and the labeled data, respectively. and These represent the attitude and velocity values ​​of the j-th joint in the predicted frame and the j-th joint in the labeled data, respectively. and These represent the 3D coordinates of all joints in the prediction result and the 3D coordinates of all joints in the labeled data in the t-th frame, respectively. In the limb long bone to height ratio prediction module, the human skeleton ratio prediction model uses global pooling and a multi-layer neural network for head structure prediction during training. First, global average pooling is performed in the temporal dimension, and then the tensor dimension is reduced to eight through a multi-layer neural network to predict eight bone length ratios. The third loss function is defined as follows: ; in, Let I represent the third loss function, where I is eight, indicating the eight bone lengths to be predicted; b and B represent the proportion of predicted bone lengths and the proportion of labeled bone lengths, respectively. and The proportions of the i-th bone length in the prediction results and the proportions of the i-th bone length in the labeled data are respectively expressed.

4. The system for measuring the ratio of long bones of human limbs to height based on images as described in claim 3, characterized in that, In the two-dimensional human pose estimation module, the loss function and evaluation metrics of the two-dimensional human pose estimation model during training are consistent with HRNet, namely, mean squared error (MSE) and object keypoint similarity (OKS). ; Where MSE is the mean squared error, n is the number of keypoints, and Y... i Let y be the true value of the i-th keypoint. i This is the predicted value for the i-th key point; ; Where OKS represents object keypoint similarity, d i It predicts the Euclidean distance between a keypoint and its nearest neighbor keypoint, where s is the standard deviation parameter and k is the standard deviation parameter. i It is a scaling factor, representing the scale of the predicted keypoints, v i It predicts the visibility of keypoints; it is 1 if the keypoint is visible in the actual annotation, and 0 otherwise. ( ) is an indicator function that takes the value 1 when the condition inside the parentheses is true, and 0 otherwise.