Systems and methods for body modeling
By adjusting the 3D human body model generated by a pre-trained artificial neural network based on image information, the problem of inaccurate posture and body shape reflection in existing technologies has been solved, resulting in a more accurate personalized human body model and improving the effectiveness of medical applications.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- SHANGHAI UNITED IMAGING INTELLIGENCE CO LTD
- Filing Date
- 2022-08-01
- Publication Date
- 2026-06-19
AI Technical Summary
Existing 3D human body model generation technology cannot accurately reflect the actual posture and body shape of individual patients, leading to deviations in positioning and treatment during medical applications.
A 3D human model is generated using a pre-trained artificial neural network. By capturing image information of the patient, especially the location of key body points and body shape parameters, the model is adjusted to refine the posture and body shape parameters. Combined with iterative and alternating optimization techniques, the accuracy of the model is improved.
It improves the accuracy of 3D human body models, enabling them to better reflect the actual posture and body shape of individual patients, thereby enhancing positioning and treatment effects in medical applications.
Smart Images

Figure CN115272582B_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of body modeling. Background Technology
[0002] A realistic 3D model (e.g., a mesh) of the patient's body, reflecting their shape and posture, can be used in a variety of medical applications, including patient localization, surgical navigation, and unified medical record analysis. For example, in radiotherapy and medical imaging, the success of a procedure often depends on the ability to position and maintain the patient in the desired location so that the procedure can be performed with precision and accuracy. Real-time knowledge of an individual patient's physical characteristics (such as their shape and posture) in these situations offers numerous benefits, including, for example, faster and more accurate patient localization based on scans or treatment protocols, and more consistent results. In other example cases, such as during surgical procedures, information about an individual patient's shape can provide insights and guidance for both treatment planning and execution. This information can, for example, be used for localization and navigation around the patient's treatment site. When presented visually in real time, this information can also provide a means of monitoring the patient's condition during the procedure. Summary of the Invention
[0003] 3D human models of patients can be constructed using pre-trained artificial neural networks and based on images of the patient. However, these human models may not accurately represent the true pose and / or body shape of the patient's body depicted in the images. This paper describes systems, methods, and apparatuses for generating individualized (e.g., personalized) human models based on one or more images of a person (e.g., two-dimensional (2D) images). The systems, methods, and / or apparatuses may utilize one or more processors configured to obtain a 3D model of a person, such as a skinned multi-person linear (SMPL) model of a person, wherein the 3D model can be generated based on one or more images of the person using one or more neural networks, and wherein the one or more neural networks can be pre-trained (e.g., using a benchmark training dataset) to generate the 3D model. The one or more processors described herein may also be configured to obtain one or more images of the person for generating the 3D model and determine at least one of a first set of body keypoint locations (e.g., anatomical keypoints such as joint locations) or a first set of body shape parameters of the person based on the one or more images of the person. The one or more processors described herein may then adjust the 3D model of the person based on the first set of body keypoint locations or at least one of the first set of body shape parameters of the person. For example, one or more processors can determine at least one of a second set of body keypoint locations or a second set of body shape parameters based on a 3D model of a person, and adjust the 3D model of the person by minimizing at least one of the differences between the first set of body keypoint locations and the second set of body keypoint locations or the differences between the first set of body shape parameters and the second set of body shape parameters. The first set of body keypoint locations and the first set of body shape parameters can be determined independently of the second set of body keypoint locations or the second set of body shape parameters.
[0004] In the example, the difference between a first set of body keypoint locations and a second set of body keypoint locations may include a first Euclidean distance, and the difference between a first set of body shape parameters and a second set of body shape parameters may include a second Euclidean distance. In the example, a system or device including one or more processors may also include at least one visual sensor configured to capture one or more images of the person described herein. The visual sensor may include, for example, a color sensor, a depth sensor, or an infrared sensor.
[0005] In the examples, one or more processors described herein may also be configured to adjust (e.g., refine) the parameters (e.g., weights) of one or more neural networks based on at least one of a first set of body keypoint locations or a first set of body shape parameters of the person. For example, one or more processors may be configured to adjust (e.g., refine) the parameters of one or more neural networks and a 3D model of the person in an iterative and / or alternating manner. In the examples, one or more processors described herein may also be configured to output a representation of the adjusted (e.g., refined) 3D model of the person to a receiving device. Attached Figure Description
[0006] The examples disclosed herein can be understood in more detail from the following description, which is given by way of example in conjunction with the accompanying drawings.
[0007] Figure 1 This is a diagram illustrating an example environment in which the systems, methods, and apparatuses disclosed herein can be applied.
[0008] Figure 2 This is a simplified block diagram illustrating an example of a neural network used for image-based reconstruction of 3D human body models.
[0009] Figure 3A This is a diagram illustrating example techniques for refining 3D human models predicted by pre-trained neural networks and / or the neural networks themselves.
[0010] Figure 3B This is a diagram illustrating an example of jointly optimizing a 3D human body model and a neural network used to generate the 3D human body model.
[0011] Figure 3C This diagram illustrates progressive improvements that can be made to a 3D human body model using the techniques described herein.
[0012] Figure 4 This is a simplified flowchart illustrating example operations associated with refining a 3D human body model based on an image.
[0013] Figure 5 This is a simplified flowchart illustrating an example method for training a neural network to perform one or more tasks described herein.
[0014] Figure 6 This is a simplified block diagram illustrating an example system or device for performing one or more tasks described herein. Detailed Implementation
[0015] The present disclosure is illustrated by way of example rather than limitation in the figures.
[0016] Figure 1This is a diagram illustrating an example environment 100 in which the estimated 3D human body model can be adjusted using the methods and apparatus disclosed herein. As shown, environment 100 may be a scanning room configured to provide a medical scanning or imaging procedure using a medical scanner 102 (e.g., a computed tomography (CT) scanner, a magnetic resonance imaging (MRI) machine, a positron emission tomography (PET) scanner, an X-ray machine, etc.), but environment 100 may also be associated with the performance of other types of medical procedures, including, for example, radiotherapy, surgery, etc. (e.g., environment 100 may be an operating room, a treatment room, etc.).
[0017] Environment 100 may include at least one sensing device 104 (e.g., an image capture device) configured to capture images (e.g., 2D or 3D images) of a patient 106, such as the patient standing in front of a medical scanner 102, lying on a scanning bed or treatment bed, etc. Sensing device 104 may include one or more sensors, including one or more cameras (e.g., digital cameras, vision sensors, etc.), one or more red, green, and blue (RGB) sensors (or other types of vision sensors, etc.), one or more depth sensors, one or more RGB plus depth (RGB-D) sensors, one or more thermal sensors, such as infrared (FIR) or near-infrared (NIR) sensors, etc. Depending on the type of sensor used, the images captured by sensing device 104 may include, for example, one or more 2D photographs of patient 106, one or more 2D RGB images of patient 106, etc. In example embodiments, sensing device 104 may be mounted or placed at various different locations within environment 100.
[0018] Sensing device 104 may include one or more processors configured to process images of patient 106 captured by the sensors described herein. Alternatively or additionally, the images of patient 106 captured by sensing device 104 may be processed by processing device 108 communicatively coupled to sensing device 104 and configured to receive images of patient 106 captured by sensing device 104. Processing device 108 may be coupled to sensing device 104, for example, via a communication network 110 (e.g., coupled to sensors included in sensing device 104), which may be a wired or wireless communication network. Thus, even when processing device 108 is... Figure 1 As shown in the same environment 100 as sensing device 104 and medical scanner 102, those skilled in the art will also understand that processing device 108 may also be located in a location away from environment 100, for example, in a separate room or a different facility.
[0019] In response to acquiring (e.g., capturing or receiving) an image of patient 106, sensing device 104 and / or processing device 108 may utilize a neural network to analyze the image (e.g., at the pixel level) and generate a 3D human model of patient 106 based on the acquired image, wherein the neural network may be pre-trained to generate the 3D human model (e.g., based on a model learned by the neural network through a training process). The 3D human model may include parametric models, such as a skinned multi-person linear (SMPL) model that may indicate the patient 106's body shape (e.g., via multiple body shape parameters β), pose (e.g., via multiple pose parameters θ), and / or other anatomical features. The 3D human model may be presented as, for example, a 3D mesh.
[0020] Sensing device 104 and / or processing device 108 can be configured to refine the 3D human model generated by a pre-trained neural network based on additional information about the patient 106 available to sensing device 104 and / or processing device 108. For example, independent of the human model construction process described above, sensing device 104 and / or processing device 108 can be configured to extract information about the patient 106's body features (e.g., body keypoint locations and / or body shape) from one or more images of the patient 106 captured by sensing device 104, and use the extracted information to adjust the 3D human model of the patient 106 generated by the neural network. For example, body shape and / or pose parameters (β, θ) included in the 3D human model can be adjusted. The images used to perform the adjustment can be, for example, the same images used by the neural network to generate the 3D human model.
[0021] In the example, sensing device 104 and / or processing device 108 may also be configured to refine the parameters of the neural network based on additional information used to adjust the 3D human body model. For example, sensing device 104 and / or processing device 108 may be configured to refine (e.g., optimize) the parameters of the neural network and the body shape and / or pose parameters (β, θ) of the 3D human body model generated by the neural network in an alternating manner based on additional information. Refinement (e.g., for one or both of the neural network and the 3D human body model generated by the neural network) may be performed online (e.g., at inference time), for example, based on live images of the patient 106 captured by sensing device 104.
[0022] Sensing device 104 and / or processing device 108 may be configured to display a 3D human model of patient 106 (e.g., a raw 3D model and / or a refined 3D model) on display device 112. Sensing device 104 and / or processing device 108 may also be configured to provide (e.g., via display device 112) a user interface for adjusting information (e.g., body keypoint locations, body contours, etc.) that can be used to refine the 3D human model and / or neural network. For example, the user interface may be configured to receive user adjustments to body keypoint locations, body contours, etc., for refining the 3D human model and / or neural network. In this way, by providing a person (e.g., a clinician) with the ability to adjust / correct values associated with automatically determined anatomical features of patient 106, sensing device 104 and / or processing device 108 can protect themselves from obvious errors. As described herein, the adjusted / corrected values can then be used to refine / optimize the 3D human model and / or neural network.
[0023] The 3D human body model generated by sensing device 104 and / or processing device 108 can be used to facilitate multiple downstream medical applications and services, including, for example, patient positioning, medical protocol design, unified or related diagnosis and treatment, patient monitoring, surgical navigation, etc. For example, processing device 108 can determine, based on the 3D human body model, whether the patient 106's pose and / or posture meets the requirements of a predetermined protocol (e.g., when the patient 106 is standing in front of medical scanner 102 or lying on the scanning bed), and (e.g., via display device 112) provide real-time confirmation or adjustment instructions to help the patient 106 enter the desired pose and / or posture. Processing device 108 can also control (e.g., adjust) one or more execution parameters of medical scanner 102, such as the height of the scanning bed, based on the body shape of the patient 106 indicated by the 3D human body model 106. As another example, sensing device 104 and / or processing device 108 may be coupled to a medical record library 114 configured to store patient medical records, including scanned images of patient 106 obtained through other imaging modalities (e.g., CT, MR, X-ray, SPECT, PET, etc.). Processing device 108 may use a 3D human body model as a reference to analyze the patient 106's medical records stored in library 114 to gain a comprehensive understanding of the patient's physical condition. For example, processing device 108 may align scanned images of patient 106 from library 114 with the 3D human body model to allow for presentation (e.g., via display device 112) and analysis of the scanned images with reference to the anatomical features (e.g., body shape and / or posture) of patient 106 as indicated by the 3D human body model.
[0024] Figure 2 An example of a neural network 200 for reconstructing (e.g., constructing) a 3D human model based on an image 202 (e.g., a 2D image) of a patient is illustrated. As shown, given a patient (e.g., Figure 1 The neural network takes an input image 202 of the patient (106), extracts features 206 from the image through a series of convolution operations 204, and infers parameters for reconstructing / estimating a 3D human body model by performing a regression operation 208 based on the extracted features. The inferred parameters may include pose parameters θ and / or body shape parameters β, which may indicate the pose and body shape of the patient's body as shown in image 202, respectively.
[0025] Neural network 200 may be a convolutional neural network (CNN) comprising multiple layers, including, for example, an input layer, one or more convolutional layers, one or more pooling layers, one or more fully connected layers, and / or an output layer. Each convolutional layer may include multiple filters (e.g., kernels) designed to detect (e.g., extract) features 206 from the input image 202. The filters may be associated with corresponding weights, which, when applied to the input, produce an output indicating whether a particular feature has been detected. The features 206 extracted through the convolutional operation may indicate the locations of multiple body keypoints of the patient (e.g., anatomical keypoints, such as joint locations). For example, features 206 may indicate the 23 joint locations of the patient's skeletal apparatus and the patient's root joints, which neural network 200 can use to infer 72 pose-related parameters θ (e.g., each of the 23 joints has 3 parameters, and the root joint has 3 parameters). The neural network 200 can also be configured to determine the body shape parameter β, for example, by performing principal component analysis (PCA) on the input image 202 and providing one or more PCA coefficients (e.g., the first 10 coefficients in the PCA space) determined during the process as the body shape parameter β.
[0026] Using the pose parameters θ and body shape parameters β determined by the neural network 200, a 3D human model of the patient can be constructed, for example, by factoring the parameters into body shape vectors. and pose vector And derive multiple vertices (e.g., 6890 vertices) for constructing a representation (e.g., a 3D mesh) of the 3D human model from the shape vectors and pose vectors. Each of these vertices may include corresponding position, normal, texture, and / or shadow information, and the 3D mesh may be generated, for example, by connecting multiple vertices with edges to form polygons (e.g., triangles), connecting multiple polygons to form surfaces, using multiple surfaces to determine the 3D shape, and applying textures and / or shadows to the surfaces and / or shapes.
[0027] The weights of the neural network 200 can be learned through a training process that may include: inputting a large number of images from a training dataset into the neural network (e.g., an instance of the neural network), causing the neural network to make predictions about a desired 3D human model (e.g., pose and / or body shape parameters associated with the 3D human model), calculating the difference or loss between the prediction and the gold standard (e.g., based on a loss function, such as a loss function based on mean squared error (MSE), and updating the weights of the neural network to minimize the difference or loss (e.g., by backpropagating the loss through the neural network via stochastic gradient descent).
[0028] Once trained and given an image 202 of the patient (e.g., at inference time), the neural network 200 is able to estimate the 3D human body model described herein. However, this estimated 3D human body model may reflect the distribution of body shapes included in the training dataset (e.g., a benchmark dataset). Therefore, if the patient's body shape does not conform to the distribution of the training dataset, this estimated 3D human body model may be biased towards the patient. For example, the distribution of body shapes in the benchmark dataset may reflect the body shape of people with average weight. Thus, if the patient is overweight (e.g., has a larger body size than average), the 3D human body model estimated by the neural network 200 may not accurately represent the patient's body shape. This phenomenon may be referred to herein as estimation bias. Additionally, the neural network 200 may encounter other types of prediction errors or defects during the inference process. For example, if a patient's joint is occluded in the input image 202 (e.g., occluded by another object) or blends into the background of the input image 202 (e.g., due to similarity in color and / or brightness), the neural network 200 may miss that joint during the modeling process and produce incorrect results regarding any one or both of the patient's pose and body shape. Therefore, it may be necessary to refine the 3D human model generated by the neural network 200 and / or the neural network 200 itself after training.
[0029] Figure 3A An example is shown for refining the neural network 300 (e.g., Figure 2The example techniques illustrated are a 3D human body model 302 (e.g., a 3D mesh) predicted by a neural network 200 and / or the neural network 300 itself. As discussed herein, the 3D human body model 302 can be estimated by the neural network 300 based on an image 304 of a person. However, due to issues related to estimation bias and / or depth ambiguity, the 3D human body model 302 may not accurately reflect the body shape and / or posture of the person shown in the image 304. For example, color similarity between a person's left arm and the tree trunk behind the person may cause the 3D human body model 302 to incorrectly show the person's left arm pointing downwards instead of upwards, and estimation bias caused by training of the neural network 300 may cause the 3D human body model 302 to show a more elongated body shape than the actual body shape of the person.
[0030] Defects in the 3D human body model 302 can be corrected by obtaining additional information about the human body's posture and / or body shape and using that additional information to adjust the posture and / or body shape parameters of the 3D human body model 302 (e.g., Figure 2 The refinement can be achieved by using θ and / or β to construct a refined 3D human body model 308. In this example, the refinement can be accomplished through an iterative process during which the original 3D human body model 302 can be progressively adjusted (e.g., through one or more intermediate models 306a, 306b, etc.) before a refined 3D human body model 306 is obtained. In this example, additional information for refining the 3D human body model 302 may include the locations 310 of key points of the human body (e.g., anatomical key points such as joint locations) determined from the input image 304 and / or body shape information 312 (e.g., body contours or body contour lines) determined based on a depth image or depth map 314.
[0031] Body keypoint locations 310 can be determined independently of the construction of the 3D human model 302. For example, body keypoint locations 310 can be determined using a different neural network (e.g., a 2D keypoint estimation neural network) than the neural network (e.g., neural network 300) used to generate the original 3D human model 302. For example, since 2D keypoint annotations may be richer and / or easier to obtain than 3D annotations, such a 2D keypoint estimation neural network can be trained using a larger dataset than the dataset used to train neural network 300. Therefore, independently determined body keypoint locations 310 can more accurately represent the anatomical keypoints of the person depicted in image 304. Body shape information 312 can also be determined independently of the construction of the 3D human model 302. For example, body shape information 312 may include body shape contours or body shape outlines, and depth maps 314 for determining body shape contours or body shape outlines can be obtained while the person has the pose and / or body shape shown in image 304 (e.g., can be obtained by...). Figure 1The corresponding sensing device 104 shown acquires a depth map 314 simultaneously with the image 304. The depth map 314 may include information indicating the corresponding depth values of the pixels in the image 304. Thus, by identifying those pixels that have the same depth values as the pixels on the surface of a human body, the depth map 314 can be used to obtain the outline or contour line of a human body even if parts of the human body are occluded and blended with the background object of the image 304 (e.g., because the occlusion or blending of some pixels may not affect the depth values of those pixels).
[0032] Body keypoint locations 310 and / or body shape information 312 can be used to guide the adjustment (e.g., optimization) of pose parameters θ and / or body shape parameters β of the 3D human body model 302. For example, in response to obtaining the 3D human body model 302, a set of body keypoint locations (e.g., 2D keypoints or body keypoint locations corresponding to body keypoint locations 310) and / or body shape contours (or outlines) can be determined based on the 3D human body model 302. This set of keypoint locations can be determined, for example, based on the vertices included in the 3D human body model 302 and the mapping relationship between the vertices and the 3D body keypoint locations (e.g., the 3D human body model 302 may include information indicating which vertices are 3D body keypoint locations). Using the mapping relationship, multiple 3D body keypoint locations can be determined based on the vertices of the 3D human body model 302, and the 3D body keypoint locations can be projected into a 2D image frame (e.g., using a predetermined camera and / or projection parameters) to obtain the set of keypoint locations. Similarly, given the vertices of a 3D human body model 302, the outline of the human body can also be obtained, for example, using a predefined camera and / or projection parameters.
[0033] Then, the set of body keypoint locations and / or shape contours determined based on the 3D human body model 302 can be compared with the independently determined body keypoint locations 310 and / or body shape information 312 to determine the difference or loss (e.g., Euclidean distance) between the two sets of body keypoint locations and / or the two body shape contours. If a loss (e.g., Euclidean distance) exists (e.g., the loss is greater than a predetermined threshold), the 3D human body model 302 can be adjusted, for example, based on gradient descent of the loss (e.g., for body shape parameters β and / or pose parameters θ) to obtain model 306a. Then, another set of body keypoint locations and / or body shape contours can be determined based on the adjusted model 306a (e.g., using the techniques described herein) and compared with the body keypoint locations 310 and / or body shape information 312 to determine another difference or loss (e.g., another Euclidean distance) between the two sets of body keypoint locations or the two body shape contours. If a loss exists (e.g., the Euclidean distance is greater than a predetermined threshold), model 306a can be further adjusted to obtain another intermediate model 306b, and the above operation can be repeated until the body keypoint locations and / or body shape contours determined from the adjusted model (e.g., 3D human body model 308) are aligned (e.g., approximately aligned) with the body keypoint locations 310 and / or body shape information 312. For example, if the difference between the body location and / or shape contour (e.g., the Euclidean distance) is less than a predetermined threshold, it can be determined that alignment has occurred.
[0034] In addition to tuning the 3D human model predicted using a pre-trained neural network 300, the neural network 300 itself can also be tuned (e.g., optimized) based on additional information (e.g., body keypoint locations and / or body shape contours) obtained from the input image 304 and / or depth map 314. Figure 3B This demonstrates a jointly optimized 3D human body model (e.g., Figure 3A The parameters Q of the 3D human body model 302 and the neural network (e.g., Figure 3A Example of parameter P for a neural network (300). Parameter Q may include the body shape parameter β and / or pose parameter θ of the optimized 3D human model, while parameter P may include the weights of the optimized neural network. In the example, the optimization of parameters P and Q can be as follows: Figure 3B The process is executed in a multi-step, alternating manner. For example, the parameters of a 3D human body model are represented as follows: And the neural network parameters are represented as Where β and θ can represent the body shape and pose parameters described in this paper, s can represent one or more scaling parameters s, and t can represent one or more translation parameters, the neural network parameters can then be updated based on the following equation (e.g., in Figure 3B (as shown at step P)
[0035] (1)
[0036] in, This can represent network parameters that have been updated. The vectors are: I can represent the input image 304, x can represent the predicted body keypoint locations (e.g., joints) based on image I, f can represent a combination of functions used to map mesh parameters Θ to vertices V and vertices V to 3D body keypoint locations (e.g., joints) X, π can represent a camera model used to project 3D body keypoint locations (e.g., joints) onto 2D points, and minL2D can represent the effort to minimize the loss function L2D representing the deviation of the predicted body keypoint locations from the gold standard.
[0037] Given The neural network can predict the updated value of the grid parameter Θ: Should This can then be used as an initial parameter to optimize the mesh parameter Θ (e.g., in...). Figure 3B The Q step shown is As shown below:
[0038] (2)
[0039] Where M can represent SMPL mapping, L shape and L θ (θ) can represent the corresponding loss function associated with the estimation of body shape and / or posture (e.g., based on part-based segmentation labels, such as a six-part segmentation strategy including head, torso, left / right arm, and left / right leg), and π, x, and minL 2D It can have the same meaning as described above.
[0040] Equation (2) This can then be used as an explicit regularization term to further optimize the neural network parameters, for example, by modifying equation (2) as follows (e.g., in...). Figure 3B (as shown at step P)
[0041] (3)
[0042] The various symbols may have the same meaning as described herein. Given further adjusted network parameters... (For example, contained in vectors) After that, the mesh parameter Θ can be further optimized to (For example, in) Figure 3B The Q-step shown can be repeated, resulting in iterative alternating optimizations of Θ and α, respectively.
[0043] The optimization techniques described in this paper can be used to improve pre-trained 3D body estimation neural networks (e.g., Figure 2 Neural network 200 and Figure 3A The performance of a neural network (300) can be improved through drop-in embedding. This can address problems associated with overfitting, estimation bias, etc., allowing for improvements to the results generated by the neural network and / or the neural network itself to provide accurate fits for different body sizes. Figure 3B For example, optimization techniques can be applied alternately between P and Q steps, resulting in improvements in both human model parameters and network parameters.
[0044] Figure 3C Examples of how to use about Figure 3A and Figure 3B The described technology represents a progressive improvement to 3D human body models. For example... Figure 3C As shown, the 3D human body model can be made more personalized (e.g., more suitable) for the posture and body shape of the individual depicted in the input image 404.
[0045] Figure 4 Example operations associated with adjusting a 3D human body model based on human images are illustrated. At 402, a system or device configured to perform the operation can obtain a 3D model of a human, wherein the 3D model can be generated based on one or more images of the human (e.g., 2D images) using one or more neural networks, and wherein the one or more neural networks can be pre-trained to generate the 3D model. At 404, the system or device can obtain one or more images of a human depicting one or more features (e.g., pose, body shape, etc.) of the human (e.g., Figure 2 (202). At 406, the system or device can analyze one or more images of a person to determine at least one of a first set of body keypoint locations or a first set of body shape parameters based on the images. At 408, the system or device can adjust a 3D model of the person based on at least one of the first set of body keypoint locations or the first set of body shape parameters determined in 406. For example, the system or device can compare the first set of body keypoint locations or the first set of body shape parameters with a second set of body keypoint locations or the second set of body shape parameters determined based on the 3D model, and adjust the 3D model to minimize the difference between the two sets of body keypoint locations or the two body shapes.
[0046] For the sake of simplicity, the operations are depicted and described in a specific order herein. However, it should be understood that these operations can occur in various orders, simultaneously, and / or with other operations not presented or described herein. Furthermore, it should be noted that... Figure 4This document does not describe all operations that the system or device is capable of performing. It should also be noted that not all exemplified operations require performance by the system or device.
[0047] Figure 5 Examples of training neural networks (e.g., according to one or more embodiments described herein) are illustrated. Figure 2 Neural network 200 or Figure 3A The following are examples of operations performed simultaneously with the neural network (300). For example, at 502, the parameters of the neural network (e.g., weights associated with various filters or kernels of the neural network) can be initialized. The parameters can be initialized, for example, based on samples collected from one or more probability distributions or parameter values from another neural network with a similar architecture. At 504, the neural network can receive training images of a person (e.g., 2D images of a person). At 506, the neural network can predict a 3D model based on the training images. At 508, the neural network can compare the predicted model with a gold standard model and determine the loss based on this comparison. The loss can be determined, for example, based on the mean squared error between the predicted model and the gold standard model, the L1 norm, the L2 norm, etc. At 510, the neural network can determine whether one or more training termination criteria have been met. For example, if the aforementioned loss is below a predetermined threshold, or if the change in loss between two training iterations (e.g., between consecutive training iterations) is below a predetermined threshold, the training termination criteria can be considered met. If it is determined at 510 that the training termination criteria have been met, training can end. Otherwise, before training returns to 506, the neural network can adjust its parameters at 512 by backpropagating the loss through the neural network (e.g., gradient descent based on the loss).
[0048] For the sake of simplicity, the training steps are depicted and described in a specific order herein. However, it should be understood that training operations can occur in various orders, simultaneously, and / or with other operations not presented or described herein. Furthermore, it should be noted that not all operations that may be included in the training process are depicted and described herein, and not all exemplified operations need to be performed.
[0049] The systems, methods, and / or apparatuses described herein may be implemented using one or more processors, one or more storage devices, and / or other suitable auxiliary devices (such as display devices, communication devices, input / output devices, etc.). Figure 6This is a block diagram illustrating an example device 600 that can be configured to perform the model and neural network optimization tasks described herein. As shown, device 600 may include a processor (e.g., one or more processors) 602, which may be a central processing unit (CPU), graphics processing unit (GPU), microcontroller, reduced instruction set computer (RISC) processor, application-specific integrated circuit (ASIC), application-specific instruction set processor (ASIP), physical processing unit (PPU), digital signal processor (DSP), field-programmable gate array (FPGA), or any other circuitry or processor capable of performing the functions described herein. Device 600 may also include communication circuitry 604, memory 606, mass storage device 608, input device 610, and / or communication link 612 (e.g., communication bus) through which one or more components shown in the figure exchange information.
[0050] Communication circuitry 604 can be configured to send and receive information using one or more communication protocols (e.g., TCP / IP) and one or more communication networks, including local area networks (LANs), wide area networks (WANs), the Internet, and wireless data networks (e.g., Wi-Fi, 3G, 4G / LTE, or 5G networks). Memory 606 may include a storage medium (e.g., a non-transitory storage medium) configured to store machine-readable instructions that, when executed, cause processor 602 to perform one or more functions described herein. Examples of machine-readable media may include volatile or non-volatile memory, including but not limited to semiconductor memory (e.g., electrically programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM)), flash memory, etc.). Mass storage device 808 may include one or more disks, such as one or more internal hard disks, one or more removable disks, one or more magneto-optical disks, one or more CD-ROMs or DVD-ROMs, etc., on which instructions and / or data may be stored for operation of processor 602. Input device 610 may include a keyboard, mouse, voice-controlled input device, touch-sensitive input device (e.g., touch screen), etc., for receiving user input from device 600.
[0051] It should be noted that device 600 can operate as a standalone device or can be connected to other computing devices (e.g., networked or grouped) to perform the functions described herein. And even in Figure 6 Only one example of each component is shown in the figure, and those skilled in the art will understand that device 600 may include multiple instances of one or more components shown in the figure.
[0052] Although this disclosure has been described according to certain embodiments and generally associated methods, changes and variations of the embodiments and methods will be apparent to those skilled in the art. Therefore, the above description of exemplary embodiments does not limit this disclosure. Other changes, substitutions, and modifications are possible without departing from the spirit and scope of this disclosure. Furthermore, unless specifically stated otherwise, discussions using terms such as “analyze,” “determine,” “enable,” “identify,” and “modify” refer to the actions and processes of a computer system or similar electronic computing device that manipulate and transform data representing physical (e.g., electronic) quantities within the registers and memories of the computer system into other data representing physical quantities within the computer system's memory or other such information storage, transmission, or display devices.
[0053] It should be understood that the above description is intended to be illustrative and not restrictive. Many other embodiments will become apparent to those skilled in the art after reading and understanding the above description.
Claims
1. An apparatus for obtaining a human model, comprising: One or more processors, which are configured as follows: A three-dimensional (3D) model of a person is obtained, wherein the 3D model is generated based on one or more images of the person using one or more neural networks, and wherein the one or more neural networks are pre-trained to generate the 3D model; Obtain the one or more images of the person; The first set of body key point locations or the first set of body shape parameters of the person are determined based on one or more images of the person, and the first set of body key point locations or the first set of body shape parameters are determined independently of the 3D model generated by one or more neural networks. The one or more processors are further configured to: Based on the 3D model of the person, determine at least one of the second set of body key point locations or the second set of body shape parameters of the person; and The 3D model of the person is adjusted by minimizing at least one of the differences between the first set of body key point positions and the second set of body key point positions or the differences between the first set of body shape parameters and the second set of body shape parameters of the person; Wherein, the difference between the first set of body key point positions and the second set of body key point positions of the person includes a first Euclidean distance, or the difference between the first set of body shape parameters and the second set of body shape parameters of the person includes a second Euclidean distance.
2. The apparatus of claim 1, wherein, The first set of body key point positions and the first set of body shape parameters of the person are determined independently of the second set of body key point positions and the second set of body shape parameters of the person.
3. The device of claim 1, further comprising at least one visual sensor, color sensor, depth sensor, or infrared sensor configured to capture the one or more images of the person.
4. The apparatus of claim 1, wherein, The one or more processors are configured to adjust the parameters of the one or more neural networks based on at least one of the first set of body key point locations of the person or the first set of body shape parameters of the person.
5. The apparatus of claim 4, wherein, The one or more processors are configured to alternately adjust the parameters of the one or more neural networks and the 3D model of the person.
6. The apparatus of claim 1, wherein, The 3D model of the person includes a skinned multi-person linear (SMPL) model.
7. The apparatus of claim 1, wherein, The one or more processors are further configured to output a representation of the 3D model of the person to a receiving device after adjusting the 3D model of the person based on at least one of the first set of body key point locations or the first set of body shape parameters of the person.
8. A method for obtaining a human body model, the method comprising: A three-dimensional (3D) model of a person is obtained, wherein the 3D model is generated based on one or more images of the person using one or more neural networks, and wherein the one or more neural networks are pre-trained to generate the 3D model; Obtain the one or more images of the person; Based on one or more images of the person, at least one of a first set of body key point locations or a first set of body shape parameters of the person is determined, wherein the first set of body key point locations or the first set of body shape parameters is determined independently of the 3D model generated by one or more neural networks; Based on the 3D model of the person, determine at least one of the second set of body key point locations or the second set of body shape parameters of the person; and The 3D model of the person is adjusted by minimizing at least one of the differences between the first set of body key point positions and the second set of body key point positions or the differences between the first set of body shape parameters and the second set of body shape parameters of the person; Wherein, the difference between the first set of body key point positions and the second set of body key point positions of the person includes a first Euclidean distance, or the difference between the first set of body shape parameters and the second set of body shape parameters of the person includes a second Euclidean distance.
9. The method according to claim 8, wherein the first set of body key point positions and the first set of body shape parameters of the person are determined independently of the second set of body key point positions and the second set of body shape parameters of the person.
10. The method of claim 8, further comprising at least one visual sensor, color sensor, depth sensor, or infrared sensor for capturing the one or more images of the person.
Citation Information
Patent Citations
System and method for human posture and shape estimation
CN112419419A
Machine learning systems and methods of estimating body shape from images
US10679046B1