A semi-supervised based 3D skeleton estimation method, device and equipment
By using a semi-supervised method to perform 3D skeleton estimation using 2D skeleton data, the problems of high cost of 3D skeleton data acquisition and poor generalization ability are solved, and efficient 3D skeleton estimation is achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- XIAMEN MEITUZHIJIA TECH
- Filing Date
- 2022-06-24
- Publication Date
- 2026-06-26
AI Technical Summary
Existing technologies for 3D skeleton data acquisition and annotation are costly, and the algorithm models have poor generalization ability, making it difficult to meet user needs.
A semi-supervised approach is adopted to estimate 3D skeletons using 2D skeleton data. A neural network is trained through feature extraction and loss function, and the output results are controlled by human skeleton angle constraints and latent variables to reduce the dependence on 3D label data.
It greatly reduces the cost of collecting and annotating 3D skeleton data, improves the generalization ability of the algorithm, and meets user needs.
Smart Images

Figure CN115170730B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of deep learning technology, and in particular to a semi-supervised 3D skeleton estimation method, apparatus, and device. Background Technology
[0002] With the widespread use of various artificial intelligence technologies, the number of industrial products related to human skeletons is increasing. Human-computer interaction is the gateway to the metaverse, and human body interaction, human hand interaction, and human eye interaction are all potential interaction methods with huge potential market value. There are also many academic researchers. Human skeleton estimation and recognition technologies have been deployed and researched by many related companies and institutions, especially after the open-source release of OpenPose technology in 2017, various domestic manufacturers followed suit. Simultaneously, with the continuous advancement of hardware computing power, some companies have developed their own 3D human skeleton tracking technologies to meet market demands. However, current scholars and companies need to solve the data acquisition problem first when developing 3D skeleton technology. Some institutions or companies collect and label data themselves, developing a series of algorithms for data capture and labeling standards, which is time-consuming, labor-intensive, and has limited stimulation environment conditions. Another approach is to use computers to synthesize a large number of 3D human models to randomly generate a large amount of data for simulation. This data differs somewhat from data captured in real life, resulting in poor generalization ability of the algorithm model. The cost and time required for data collection and annotation of 3D data are vastly different from those of 2D data. 3D data is more expensive and of lower quality, and data issues have been a long-standing pain point in deep learning implementation projects. Summary of the Invention
[0003] In view of this, the purpose of this invention is to propose a semi-supervised 3D skeleton estimation method, apparatus, and device to at least solve the problem that the collection of training data is difficult and cannot meet user needs in related technologies.
[0004] To achieve the above objectives, this invention provides a semi-supervised 3D skeleton estimation method, the method comprising:
[0005] Acquire an image to be processed, wherein the image to be processed includes a human skeleton or a hand skeleton;
[0006] The image to be processed is input into the skeleton estimation model to obtain the target three-dimensional skeleton coordinates corresponding to the human skeleton or the hand skeleton. The skeleton estimation model is obtained by extracting features from the training image data and training a neural network with the obtained skeleton point UVD coordinates, skeleton point spatial coordinates, and latent variable camera mapping image parameters and a preset loss function. The skeleton point UVD coordinates include the image coordinates of the skeleton point and the depth value corresponding to the image coordinates of the skeleton point.
[0007] Preferably, the skeleton estimation model includes a neural network obtained by extracting features from training image data, training the obtained skeleton point UVD coordinates, skeleton point spatial coordinates, and latent variable camera mapping image parameters with a preset loss function, including:
[0008] The skeleton point UVD coordinates, the skeleton point spatial coordinates, and the latent variable camera mapping image parameters and preset label data are calculated according to the preset loss function. All the obtained loss values are then weighted and summed for backpropagation to train the skeleton estimation model.
[0009] Preferably, the label data includes three-dimensional label data; the step of calculating the UVD coordinates of the skeleton points, the spatial coordinates of the skeleton points, and the various loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function includes:
[0010] Based on the preset loss function, calculate the first loss value between the spatial coordinates of the skeleton points and the three-dimensional label data.
[0011] Preferably, the label data includes two-dimensional label data; the step of calculating the UVD coordinates of the skeleton points, the spatial coordinates of the skeleton points, and the various loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function includes:
[0012] Based on the preset loss function, calculate the second loss value between the skeleton point uvd coordinates and the two-dimensional label data.
[0013] Preferably, the label data includes two-dimensional label data; the calculation of the loss function based on the skeleton point UVD coordinates, the skeleton point spatial coordinates, the latent variable camera mapping image parameters, and the preset label data includes:
[0014] The spatial coordinates of the skeleton points are obtained by performing a weak perspective transformation on the spatial coordinates of the latent variable camera mapping image parameters.
[0015] Based on the preset loss function, a third loss function is calculated for the image spatial coordinates and the two-dimensional label data.
[0016] Preferably, the label data includes two-dimensional label data; the step of calculating the UVD coordinates of the skeleton points, the spatial coordinates of the skeleton points, and the various loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function includes:
[0017] Based on the preset loss function, calculate the fourth loss value between the image spatial coordinates and the uvd coordinates of the skeleton point.
[0018] Preferably, the step of calculating the UVD coordinates of the skeleton point, the spatial coordinates of the skeleton point, and the loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function includes:
[0019] After calculating the angle of the parent-child joint of the skeleton based on the spatial coordinates of the skeleton points, it is compared with the preset angle. If it is greater than the preset angle, the angle of the parent-child joint of the skeleton is used as the fifth loss value.
[0020] To achieve the above objectives, the present invention also provides a semi-supervised 3D skeleton estimation device, the device comprising:
[0021] An acquisition unit is used to acquire an image to be processed, wherein the image to be processed includes a human skeleton or a hand skeleton;
[0022] The processing unit is used to input the image to be processed into a skeleton estimation model to obtain the target three-dimensional skeleton coordinates corresponding to the human skeleton or the hand skeleton. The skeleton estimation model is obtained by extracting features from training image data and training a neural network with the obtained skeleton point UVD coordinates, skeleton point spatial coordinates, and latent variable camera mapping image parameters and a preset loss function. The skeleton point UVD coordinates include the image coordinates of the skeleton point and the depth value corresponding to the image coordinates of the skeleton point.
[0023] To achieve the above objectives, the present invention also proposes an apparatus comprising a processor, a memory, and a computer program stored in the memory, the computer program being executed by the processor to implement the steps of a semi-supervised 3D skeleton estimation method as described in the above embodiments.
[0024] To achieve the above objectives, the present invention also proposes a computer-readable storage medium storing a computer program that is executed by a processor to implement the steps of a semi-supervised 3D skeleton estimation method as described in the above embodiments.
[0025] Beneficial effects:
[0026] The above solution achieves the task of 3D skeleton estimation by making full use of 2D data, which greatly saves the cost of collecting and labeling 3D skeleton data while meeting user needs.
[0027] The above approach utilizes 2D skeleton data for 3D skeleton estimation. By using a large amount of 2D skeleton data with little or no 3D label data, and by incorporating some pre-defined knowledge specific to human 3D and learning some latent variables to control the error between the output and the auxiliary labels, the goal of training 3D human skeleton coordinates using a deep learning framework is achieved, thus better meeting user requirements. Attached Figure Description
[0028] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0029] Figure 1 This is a flowchart illustrating a semi-supervised 3D skeleton estimation method provided in an embodiment of the present invention.
[0030] Figure 2 This is a schematic diagram of the network structure of a skeleton estimation model provided in an embodiment of the present invention.
[0031] Figure 3 This is a schematic diagram of a semi-supervised 3D skeleton estimation device provided in an embodiment of the present invention.
[0032] The realization of the invention's objective, its functional characteristics, and advantages will be further explained in conjunction with the embodiments and with reference to the accompanying drawings. Detailed Implementation
[0033] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only a part of the embodiments of the present invention, not all of them. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to represent selected embodiments of the invention. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.
[0034] In the description of this invention, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.
[0035] In the specific training of the skeleton estimation model in this scheme, assisted supervised training is carried out using 2D skeleton data without 3D human skeleton labels, and constraints are imposed using common sense human skeleton angles, such as constraining the angle between the thigh and the calf to be no greater than 180°. This achieves the goal of training the model to output 3D human skeleton coordinates by using a large amount of 2D skeleton data with little or no 3D label data. In addition, some artificially set prior knowledge specific to human 3D and learning some latent variables to control the error between the output results and the auxiliary labels are used.
[0036] The present invention will be described in detail below with reference to the embodiments.
[0037] Reference Figure 1 The diagram shown is a flowchart of a semi-supervised 3D skeleton estimation method provided in an embodiment of the present invention.
[0038] In this embodiment, the method includes:
[0039] S11, Obtain the image to be processed, wherein the image to be processed includes a human skeleton or a hand skeleton.
[0040] S12, the image to be processed is input into the skeleton estimation model to obtain the target three-dimensional skeleton coordinates corresponding to the human skeleton or the hand skeleton. The skeleton estimation model is obtained by extracting features from the training image data and training a neural network with the obtained skeleton point uvd coordinates, skeleton point spatial coordinates and latent variable camera mapping image parameters and a preset loss function. The skeleton point uvd coordinates include the image coordinates of the skeleton point and the depth value corresponding to the image coordinates of the skeleton point.
[0041] In this embodiment, the image to be processed refers to the image for which 3D skeleton estimation is to be performed. Specifically, the image to be processed includes regions of a human skeleton or a hand skeleton. The content included in the image to be processed needs to be determined according to the application scenario. For example, if the application scenario is the estimation of a finger skeleton, then the image to be processed needs to include at least one image of a hand; if the application scenario is the estimation of a human skeleton, then the image to be processed needs to include at least one image of a human body. The specific content included in the image to be processed needs to be flexibly adjusted according to the application scenario and is not limited to the content given in the above embodiment.
[0042] Reference Figure 2 The diagram shows the network structure of the skeleton estimation model. Training image data (using a human skeleton as an example) is input into the skeleton estimation model. The model's feature extraction module (backbone) extracts features, outputting a heatmap (UVDmap) and fully connected layers representing the x, y, and z spatial coordinates of the skeleton points. It also outputs latent variable camera mapping image parameters (s, tx, ty) to project the x, y, and z spatial coordinates of the skeleton points onto the image to obtain pixel coordinates. Various losses between the output and the data labels are then calculated, and finally, the weighted sum of these losses is used for backpropagation to train the skeleton estimation model.
[0043] The skeleton estimation model is obtained by extracting features from training image data, training a neural network with the obtained skeleton point UVD coordinates, skeleton point spatial coordinates, and latent variable camera mapping image parameters, and a preset loss function.
[0044] The skeleton point UVD coordinates, the skeleton point spatial coordinates, and the latent variable camera mapping image parameters and preset label data are calculated according to the preset loss function. All the obtained loss values are then weighted and summed for backpropagation to train the skeleton estimation model.
[0045] Furthermore, the label data includes 3D label data; the step of calculating the UVD coordinates of the skeleton points, the spatial coordinates of the skeleton points, and the various loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function includes:
[0046] Based on the preset loss function, calculate the first loss value between the spatial coordinates of the skeleton points and the three-dimensional label data.
[0047] To further explain, the skeleton estimation model outputs the xyz spatial coordinates of the skeleton points (which are spatial coordinates relative to the root node of the human skeleton), and the 3d-xyz-loss is only calculated when training with 3D labeled data.
[0048] The label data includes two-dimensional label data; the calculation of the skeleton point UVD coordinates, the skeleton point spatial coordinates, and the various loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function includes:
[0049] Based on the preset loss function, calculate the second loss value between the skeleton point uvd coordinates and the two-dimensional label data.
[0050] In this embodiment, the skeleton estimation model outputs a 2D heatmap, which is then processed by argmax to find the location of the maximum value of the feature map corresponding to each skeleton point coordinate. This is used to obtain the proportional image coordinates on the image as uv. In other words, argmax refers to finding the location coordinates of the maximum value corresponding to the skeleton point coordinates for the feature map output by the model. Here, mapping the location of the feature map to the corresponding location coordinates in the original image gives the uv (i.e., the image coordinates of the skeleton points in the original image). At the same time, the depth value d (in meters) corresponding to the location of the maximum value is obtained from the d-map output by the skeleton estimation model. These values are combined to obtain the uvd coordinates of each skeleton point (uv represents the coordinates of the skeleton point in the image, plus the distance of the point from the camera in actual space). Then, these uvd coordinates are calculated with the 2D skeleton label data to obtain the 2D loss.
[0051] The label data includes two-dimensional label data; the calculation of the loss function based on the skeleton point UVD coordinates, the skeleton point spatial coordinates, the latent variable camera mapping image parameters, and the preset label data includes:
[0052] The spatial coordinates of the skeleton points are obtained by performing a weak perspective transformation on the spatial coordinates of the latent variable camera mapping image parameters.
[0053] Based on the preset loss function, a third loss function is calculated for the image spatial coordinates and the two-dimensional label data.
[0054] In this embodiment, the skeleton points obtained by the skeleton estimation model are transformed into image space by weak perspective transformation using the output latent variable camera mapping image parameters s, tx, ty, i.e. u' = sx + tx, v' = sy + ty. Then, 2d-uv-loss is calculated by combining it with 2D skeleton label data.
[0055] The label data includes two-dimensional label data; the calculation of the skeleton point UVD coordinates, the skeleton point spatial coordinates, and the various loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function includes:
[0056] Based on the preset loss function, calculate the fourth loss value between the image spatial coordinates and the uvd coordinates of the skeleton point.
[0057] In this embodiment, the xyz spatial coordinates are transformed into image space coordinates u' v'z as described above, and then the diff-loss is calculated by combining them with the skeleton point uvd coordinates output by the skeleton estimation model.
[0058] The step of calculating the UVD coordinates of the skeleton point, the spatial coordinates of the skeleton point, and the loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function includes:
[0059] After calculating the angle of the parent-child joint of the skeleton based on the spatial coordinates of the skeleton points, it is compared with the preset angle. If it is greater than the preset angle, the angle of the parent-child joint of the skeleton is used as the fifth loss value.
[0060] In this embodiment, the 3D human skeleton coordinates and the angles of the parent and child joints can be obtained within a reasonable range using forward and backward human dynamics. (For example, to find the angle between the lower leg and the thigh, the direction vector of the lower leg is the vector from the ankle coordinates (x1, y1, z1) to the knee coordinates (x2, y2, z2): (x1-x2, y1-y2, z1-z2); the direction vector of the thigh is the vector from the knee (x2, y2, z2) to the skeletal coordinates at the root of the thigh (x3, y3, z3): (x3-x2, y3-y2, z3-z2). The angle θ between these two direction vectors is calculated. Based on common sense, the angle between the lower leg and the thigh will not exceed 180°, and the lower leg will not kick in front of the knee.) This angle range is used as a preset constraint angle. After the xyz space coordinates are output by the skeleton estimation model, the angles of the parent and child joints are calculated and compared with the corresponding angle range. Angles exceeding the range are used as angle-loss for backpropagation.
[0061] Finally, all the above losses are weighted and summed to obtain the final loss, which is then used for backpropagation to train the skeleton estimation model. Note that since the preset loss function can be flexibly set (e.g., 2d-loss can be L1 loss, L2 loss, smooth-L1 loss, wing loss, etc.), this embodiment does not specifically specify which method of loss calculation is used.
[0062] Reference Figure 3 The diagram shown is a schematic diagram of a semi-supervised 3D skeleton estimation device provided in an embodiment of the present invention.
[0063] In this embodiment, the device 30 includes:
[0064] The acquisition unit 31 is used to acquire an image to be processed, wherein the image to be processed includes a human skeleton or a hand skeleton;
[0065] The processing unit 32 is used to input the image to be processed into the skeleton estimation model to obtain the target three-dimensional skeleton coordinates corresponding to the human skeleton or the hand skeleton. The skeleton estimation model is obtained by extracting features from training image data and training a neural network with the obtained skeleton point uvd coordinates, skeleton point spatial coordinates and latent variable camera mapping image parameters and a preset loss function. The skeleton point uvd coordinates include the image coordinates of the skeleton point and the depth value corresponding to the image coordinates of the skeleton point.
[0066] Furthermore, the processing unit 32 includes:
[0067] The loss calculation unit is used to calculate the UVD coordinates of the skeleton point, the spatial coordinates of the skeleton point, and the loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function, and to perform weighted summation of all the obtained loss values and backpropagation to train the skeleton estimation model.
[0068] Furthermore, the label data includes three-dimensional label data; the loss calculation unit is used for:
[0069] Based on the preset loss function, calculate the first loss value between the spatial coordinates of the skeleton points and the three-dimensional label data.
[0070] Furthermore, the label data includes two-dimensional label data; the loss calculation unit is used for:
[0071] Based on the preset loss function, calculate the second loss value between the skeleton point uvd coordinates and the two-dimensional label data.
[0072] Furthermore, the label data includes two-dimensional label data; the loss calculation unit is used for:
[0073] The spatial coordinates of the skeleton points are obtained by performing a weak perspective transformation on the spatial coordinates of the latent variable camera mapping image parameters.
[0074] Based on the preset loss function, a third loss function is calculated for the image spatial coordinates and the two-dimensional label data.
[0075] Furthermore, the label data includes two-dimensional label data; the loss calculation unit is used for:
[0076] Based on the preset loss function, calculate the fourth loss value between the image spatial coordinates and the uvd coordinates of the skeleton point.
[0077] Furthermore, the loss calculation unit is used for:
[0078] After calculating the angle of the parent-child joint of the skeleton based on the spatial coordinates of the skeleton points, it is compared with the preset angle. If it is greater than the preset angle, the angle of the parent-child joint of the skeleton is used as the fifth loss value.
[0079] Each unit module of the device 30 can execute the corresponding steps in the above method embodiment, so the details of each unit module will not be elaborated here. Please refer to the description of the corresponding steps above for details.
[0080] This invention also provides a device comprising the semi-supervised 3D skeleton estimation apparatus described above, wherein the semi-supervised 3D skeleton estimation apparatus can employ... Figure 3The structure of the embodiment, correspondingly, can be executed Figure 1 The technical solutions of the method embodiments shown are similar in implementation principle and technical effect. For details, please refer to the relevant records in the above embodiments, which will not be repeated here.
[0081] The device includes: a mobile phone, digital camera, or tablet computer, or other device with a camera function; or a device with an image processing function; or a device with an image display function. The device may include components such as a memory, processor, input unit, display unit, and power supply.
[0082] The memory can be used to store software programs and modules. The processor executes various functional applications and data processing by running the software programs and modules stored in the memory. The memory can mainly include a program storage area and a data storage area. The program storage area can store the operating system, application programs required for at least one function (such as image playback function), etc.; the data storage area can store data created according to the use of the device. In addition, the memory can include high-speed random access memory, and can also include non-volatile memory, such as at least one disk storage device, flash memory device, or other volatile solid-state storage device. Accordingly, the memory can also include a memory controller to provide access to the memory for the processor and input units.
[0083] The input unit can be used to receive input numerical, character, or image information, and to generate keyboard, mouse, joystick, optical, or trackball signal inputs related to user settings and function control. Specifically, in addition to a camera, the input unit of this embodiment may also include a touch-sensitive surface (e.g., a touch screen) and other input devices.
[0084] The display unit can be used to display information input by the user or information provided to the user, as well as various graphical user interfaces of the device. These graphical user interfaces can be composed of graphics, text, icons, video, and any combination thereof. The display unit may include a display panel, optionally configured as an LCD (Liquid Crystal Display), OLED (Organic Light-Emitting Diode), or other similar display panel. Furthermore, a touch-sensitive surface may cover the display panel. When the touch-sensitive surface detects a touch operation on or near it, it transmits the information to the processor to determine the type of touch event. Subsequently, the processor provides corresponding visual output on the display panel based on the type of touch event.
[0085] This invention also provides a computer-readable storage medium, which may be a computer-readable storage medium included in the memory described in the above embodiments; or it may be a standalone computer-readable storage medium not assembled into a device. The computer-readable storage medium stores at least one instruction, which is loaded and executed by a processor to implement... Figure 1 The method shown is a semi-supervised 3D skeleton estimation method. The computer-readable storage medium can be a read-only memory, a hard disk, or an optical disk, etc.
[0086] It should be noted that the various embodiments in this specification are described in a progressive manner, with each embodiment focusing on the differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the device embodiments, equipment embodiments, and storage medium embodiments, since they are basically similar to the method embodiments, the descriptions are relatively simple, and relevant parts can be referred to the descriptions of the method embodiments.
[0087] Furthermore, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.
[0088] The foregoing description illustrates and describes preferred embodiments of the present invention. It should be understood that the present invention is not limited to the forms disclosed herein and should not be construed as excluding other embodiments. It can be used in various other combinations, modifications, and environments, and can be altered within the scope of the inventive concept by means of the foregoing teachings or techniques or knowledge in related fields. Any modifications and variations made by those skilled in the art that do not depart from the spirit and scope of the present invention should be within the protection scope of the appended claims.
Claims
1. A semi-supervised 3D skeleton estimation method, characterized in that, The method includes: Acquire an image to be processed, wherein the image to be processed includes a human skeleton or a hand skeleton; The image to be processed is input into the skeleton estimation model to obtain the target three-dimensional skeleton coordinates corresponding to the human skeleton or the hand skeleton. The skeleton estimation model is obtained by extracting features from the training image data and training a neural network with the obtained skeleton point uvd coordinates, skeleton point spatial coordinates and latent variable camera mapping image parameters and a preset loss function. The skeleton point uvd coordinates include the image coordinates of the skeleton point and the depth value corresponding to the image coordinates of the skeleton point. The skeleton estimation model is obtained by extracting features from training image data, training a neural network with the obtained skeleton point UVD coordinates, skeleton point spatial coordinates, and latent variable camera mapping image parameters, and a preset loss function. The model includes: The skeleton point UVD coordinates, the skeleton point spatial coordinates, and the latent variable camera mapping image parameters, along with preset label data, are calculated according to the preset loss function. The preset label data includes two-dimensional label data. All the obtained loss values are weighted, summed, and backpropagated to train the skeleton estimation model. Specifically, a first loss value is calculated based on the skeleton point UVD coordinates and the two-dimensional label data according to the preset loss function; a weak perspective transformation is performed on the skeleton point spatial coordinates using the latent variable camera mapping image parameters to obtain image spatial coordinates; a second loss function is calculated based on the image spatial coordinates and the two-dimensional label data according to the preset loss function; the skeleton parent-child joint angle is calculated based on the skeleton point spatial coordinates and compared with a preset angle. If it is greater than the preset angle, the skeleton parent-child joint angle is used as a third loss value.
2. The 3D skeleton estimation method based on semi-supervised learning according to claim 1, characterized in that, The preset label data includes 3D label data; the calculation of the skeleton point UVD coordinates, the skeleton point spatial coordinates, and the various loss values of the latent variable camera mapping image parameters and the preset label data according to the preset loss function includes: Based on the preset loss function, a fourth loss value is calculated between the spatial coordinates of the skeleton points and the three-dimensional label data.
3. The semi-supervised 3D skeleton estimation method according to claim 1, characterized in that, The step of calculating the UVD coordinates of the skeleton point, the spatial coordinates of the skeleton point, and the various loss values of the latent variable camera-mapped image parameters and the preset label data according to the preset loss function includes: Based on the preset loss function, calculate the fifth loss value between the image spatial coordinates and the uvd coordinates of the skeleton point.
4. A semi-supervised 3D skeleton estimation device, characterized in that, The apparatus for using a semi-supervised 3D skeleton estimation method according to any one of claims 1-3 includes: An acquisition unit is used to acquire an image to be processed, wherein the image to be processed includes a human skeleton or a hand skeleton; The processing unit is used to input the image to be processed into a skeleton estimation model to obtain the target three-dimensional skeleton coordinates corresponding to the human skeleton or the hand skeleton. The skeleton estimation model is obtained by extracting features from training image data and training a neural network with the obtained skeleton point UVD coordinates, skeleton point spatial coordinates, and latent variable camera mapping image parameters and a preset loss function. The skeleton point UVD coordinates include the image coordinates of the skeleton point and the depth value corresponding to the image coordinates of the skeleton point.
5. A device, characterized in that, The system includes a processor, a memory, and a computer program stored in the memory, the computer program being executed by the processor to implement the steps of a semi-supervised 3D skeleton estimation method as described in any one of claims 1 to 3.
6. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that is executed by a processor to implement the steps of a semi-supervised 3D skeleton estimation method as described in any one of claims 1 to 3.