Method, apparatus, electronic device, medium, and product for determining hand pose

By using multi-camera shooting and image processing technology, a hand mesh is constructed and vertex data is corrected, which solves the problem of accuracy in determining hand shape and improves gesture control and interaction effects.

CN118486045BActive Publication Date: 2026-06-26BEIJING UNICORN TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING UNICORN TECH CO LTD
Filing Date
2023-03-31
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

In some scenarios, it is difficult to accurately determine the hand shape, which affects gesture control and interaction effects.

Method used

By using multiple cameras to capture images of the hand from different angles, key point data of the hand are determined through image feature extraction, a hand mesh is constructed, and vertex data is corrected using virtual point cloud data to achieve an accurate definition of the hand shape.

Benefits of technology

It improves the accuracy of hand shapes, ensuring the effectiveness of gesture control and interaction.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN118486045B_ABST
    Figure CN118486045B_ABST
Patent Text Reader

Abstract

The present disclosure provides a method, device, electronic device, medium and product for determining hand posture. The specific implementation scheme is: determining hand key point data by using image features of multiple hand images taken at the same time; determining each vertex data in the hand mesh based on the hand key point data; determining virtual point cloud data based on parameters of each camera and image features of each of the multiple hand images; and correcting each vertex data in the hand mesh by using the virtual point cloud data to determine the hand posture of the target hand.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of machine vision technology, and more particularly to a method, apparatus, electronic device, medium, and product for determining hand shape. Background Technology

[0002] In some scenarios, there is a need to determine and apply hand shapes, such as the need to determine the shape of a user's hand. Summary of the Invention

[0003] Embodiments of this disclosure provide a method, apparatus, electronic device, medium, and product for determining hand shape.

[0004] According to one aspect of the present disclosure, a method for determining hand shape is provided, comprising: determining hand key point data using image features of multiple hand images captured at the same time; wherein the multiple hand images are captured by multiple cameras positioned at different angles of the target hand; the hand key points include at least one point for characterizing the location of fingertips, knuckles, wrist, and palm; determining vertex data in a hand mesh based on the hand key point data; wherein the hand mesh includes a mesh structure distributed on the surface of the hand, the mesh structure including multiple vertices; determining virtual point cloud data based on parameters of the multiple cameras and image features of the multiple hand images; and correcting the vertex data in the hand mesh using the virtual point cloud data to determine the hand shape of the target hand.

[0005] According to another aspect of the present disclosure, an apparatus for determining hand shape is provided, comprising: a first determining module, configured to determine key point data of the hand using image features of multiple hand images captured at the same time; wherein the multiple hand images are captured by multiple cameras positioned at different angles of the target hand; the key points of the hand include at least one point for characterizing the location of the fingertips, knuckles, wrist, and palm; a second determining module, configured to determine vertex data of a hand mesh based on the key point data of the hand; wherein the hand mesh includes a mesh structure distributed on the surface of the hand, and the mesh structure includes multiple vertices; a third determining module, configured to determine virtual point cloud data based on the parameters of the multiple cameras and the image features of the multiple hand images; and a fourth determining module, configured to correct the vertex data of the hand mesh using the virtual point cloud data to determine the hand shape of the target hand.

[0006] According to another aspect of this disclosure, a computer-readable storage medium is provided that stores computer program instructions, which, when executed by a processor, implement the method described above for determining hand shape.

[0007] According to another aspect of this disclosure, an electronic device is provided, comprising: a memory for storing a computer program; and a processor for executing the computer program stored in the memory, wherein when the computer program is executed, it implements the method described above for determining hand shape.

[0008] According to another aspect of this disclosure, a computer program product is provided, including computer program instructions that, when executed by a processor, implement the above-described method for determining hand shape.

[0009] The technical solutions of this disclosure will be further described in detail below with reference to the accompanying drawings and embodiments. Attached Figure Description

[0010] The above and other objects, features, and advantages of this disclosure will become more apparent from the more detailed description of the embodiments thereof in conjunction with the accompanying drawings. The drawings are provided to further illustrate the embodiments of this disclosure and form part of the specification. They are used together with the embodiments of this disclosure to explain the disclosure and do not constitute a limitation thereof. In the drawings, the same reference numerals generally represent the same components or steps.

[0011] Figure 1 This is a flowchart illustrating a method for determining hand shape provided in an exemplary embodiment of this disclosure.

[0012] Figure 2-1 This is a schematic diagram of key points of the hand in an exemplary embodiment of this disclosure.

[0013] Figure 2-2 This is a schematic diagram of a hand mesh in an exemplary embodiment of this disclosure.

[0014] Figure 3 This is a flowchart illustrating a method for determining hand shape provided in another exemplary embodiment of this disclosure.

[0015] Figure 4 This is a flowchart illustrating the method of determining individual vertex data in a hand mesh in an exemplary embodiment of this disclosure.

[0016] Figure 5 This is a schematic diagram of the hand state corresponding to the hand template in an exemplary embodiment of this disclosure.

[0017] Figure 6 This is a flowchart illustrating a method for correcting the initial vertex features and initial vertex coordinates of a vertex in an exemplary embodiment of this disclosure.

[0018] Figure 7-1 This is a schematic diagram illustrating the principle of determining the hand shape in an exemplary embodiment of this disclosure.

[0019] Figure 7-2This is a schematic diagram illustrating the iterative optimization of the position coordinates and vertex features of vertices in a hand mesh according to an exemplary embodiment of this disclosure.

[0020] Figure 8 This is a schematic diagram of the process of obtaining virtual point cloud data by location embedding in an exemplary embodiment of this disclosure.

[0021] Figure 9 This is a schematic diagram of the space faced by the camera during shooting in an exemplary embodiment of this disclosure.

[0022] Figure 10 This is a schematic diagram illustrating the principle of obtaining virtual point cloud data through location embedding in an exemplary embodiment of this disclosure.

[0023] Figure 11 This is a flowchart illustrating a method for determining point features in a virtual point cloud in an exemplary embodiment of this disclosure.

[0024] Figure 12 This is a schematic diagram illustrating the principle of obtaining virtual point cloud data through projection aggregation in an exemplary embodiment of this disclosure.

[0025] Figure 13 This is a flowchart illustrating the feature fusion algorithm in an exemplary embodiment of this disclosure.

[0026] Figure 14 This is a schematic diagram of the structure of a device for determining hand shape provided in an exemplary embodiment of the present disclosure.

[0027] Figure 15 This is a structural diagram of an electronic device provided in an exemplary embodiment of this disclosure. Detailed Implementation

[0028] Hereinafter, exemplary embodiments according to the present disclosure will be described in detail with reference to the accompanying drawings. Obviously, the described embodiments are merely some embodiments of the present disclosure, and not all embodiments of the present disclosure, and it should be understood that the present disclosure is not limited to the exemplary embodiments described herein.

[0029] It should be noted that, unless otherwise specifically stated, the relative arrangement, numerical expressions, and values ​​of the components and steps set forth in these embodiments do not limit the scope of this disclosure.

[0030] Those skilled in the art will understand that the terms "first," "second," etc., in the embodiments of this disclosure are only used to distinguish different steps, devices, or modules, and do not represent any specific technical meaning, nor do they indicate a necessary logical order between them.

[0031] It should also be understood that in the embodiments disclosed herein, "a plurality of" may refer to two or more, and "at least one" may refer to one, two or more.

[0032] It should also be understood that any component, data or structure mentioned in the embodiments of this disclosure can generally be understood as one or more unless expressly defined or given to the contrary in the context.

[0033] Furthermore, the term "and / or" in this disclosure is merely a description of the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can represent: A existing alone, A and B existing simultaneously, or B existing alone. Additionally, the character " / " in this disclosure generally indicates that the preceding and following related objects have an "or" relationship.

[0034] It should also be understood that the description of the various embodiments in this disclosure emphasizes the differences between the various embodiments, and the similarities or similarities can be referred to each other. For the sake of brevity, they will not be described in detail.

[0035] At the same time, it should be understood that, for ease of description, the dimensions of the various parts shown in the accompanying drawings are not drawn according to actual scale.

[0036] The following description of at least one exemplary embodiment is merely illustrative and is in no way intended to limit this disclosure or its application or use.

[0037] Techniques, methods, and equipment known to those skilled in the art may not be discussed in detail, but where appropriate, such techniques, methods, and equipment should be considered part of the specification.

[0038] It should be noted that similar labels and letters in the following figures indicate similar items; therefore, once an item is defined in one figure, it does not need to be discussed further in subsequent figures.

[0039] Exemplary Overview

[0040] In certain scenarios (hereinafter referred to as predetermined scenarios for ease of explanation), there is a need to determine and apply hand gestures. For example, in the context of head-mounted display devices, if the device provides gesture control functionality, the user's hand gesture can be determined, and corresponding control icons can be rendered in the virtual space of the head-mounted display device based on the determined hand gesture for interaction with the displayed content. Alternatively, in the context of terminal devices (such as mobile phones, PCs, etc.), if the device provides gesture control functionality, the hand gestures appearing within the visible area of ​​the terminal device (the area captured by multiple cameras on the terminal device) can be determined for interaction with the displayed content in the terminal device's virtual space.

[0041] The hand shape can be used to determine the corresponding gesture, and the hand shape can be represented by a hand mesh. The hand mesh can be represented by multiple vertices distributed across the hand surface. With the coordinates of each vertex determined, a unique hand mesh can be obtained, and the hand shape can be determined from this mesh. The more accurate the coordinates of each vertex, the more accurate the gesture reflected by the resulting hand mesh. Therefore, the key to determining the hand shape is determining the coordinates of each vertex in the hand mesh.

[0042] It should be noted that head-mounted display devices can also be called head-mounted displays (HMDs) or head-mounted displays. HMDs can be used to achieve augmented reality (AR), virtual reality (VR), and mixed reality (MR) effects. Because HMDs can create a unique sense of immersion, users subjectively feel as if they are in a space isolated from reality; this space can be considered the virtual space of the HMD. Optionally, the HMD can be AR glasses, VR glasses, MR glasses, etc.

[0043] Of course, the intended scenarios are not limited to the use cases of head-mounted display devices. Intended scenarios can also include robot grasping scenarios, sign language recognition scenarios, etc., which will not be listed here.

[0044] It should be noted that the accuracy of the hand shape determined in the intended scenario greatly affects the application effect of the hand shape. Therefore, it is necessary to take certain measures to ensure the accuracy of the determined hand shape.

[0045] Exemplary methods

[0046] Figure 1 This is a flowchart illustrating a method for determining hand shape provided in an exemplary embodiment of this disclosure. Figure 1 The method shown may include steps 110, 120 and 130, which are described below.

[0047] Step 110: Use multiple hand images captured at the same time to determine key hand data.

[0048] In one optional embodiment of this disclosure, multiple hand images are obtained by capturing images of the target hand using multiple cameras positioned at different angles. Optionally, the relative poses of the multiple cameras are fixed.

[0049] In one optional embodiment of this disclosure, a multi-camera system can be set up in the predetermined scenario. The multi-camera system may include multiple cameras with different angles and fixed relative poses. In the scenario of using a head-mounted display device, the user's hand can be used as the target hand. The multiple cameras configured on the head-mounted display device can simultaneously capture images of the user's hand, thereby obtaining multiple hand images.

[0050] In another optional embodiment of this disclosure, the predetermined scenario can be a robot grasping scenario, in which a camera array can be set up. The camera array can include multiple cameras with different angles and fixed relative poses. The robot's robotic hand can be used as the target hand, and the multiple cameras can capture images of the robotic hand at the same time, thereby obtaining multiple images of the hand.

[0051] For ease of understanding, the embodiments disclosed herein are illustrated using the case of N cameras and N hand images. The N hand images correspond one-to-one with the N cameras. Each of the N hand images can be a three-channel image, such as an RGB image, where R represents red, G represents green, and B represents blue. The dimensions of each of the N hand images can be represented as H1*W1*3, where H1 represents the image height, W1 represents the image width, and 3 represents the number of channels.

[0052] By processing and analyzing N hand images, keypoint data can be determined. Keypoints can include, but are not limited to, points representing major bone joints and finger joints of the hand. Assuming there are M keypoints, the keypoint data can include M 3D coordinates corresponding to each of the M keypoints. For example... Figure 2-1 As shown, M can be 21, but of course, M can also take other values.

[0053] Step 120: Based on the hand key point data, determine the vertex data in the hand mesh.

[0054] In one alternative embodiment of this disclosure, the hand mesh includes a mesh structure distributed on the surface of the hand, the mesh structure including multiple vertices.

[0055] like Figure 2-2As shown, the hand mesh can include a mesh structure distributed on the surface of the hand. This mesh structure can include multiple vertices and may further include edges connecting adjacent vertices. Assuming there are W vertices, step 120 can determine the vertex data for each of the W vertices, thus obtaining W vertex data. The hand mesh can also be called a hand mesh. W can be 778 or other values. Vertex data may include, but is not limited to, vertex coordinates and vertex features.

[0056] In one optional embodiment of this disclosure, a hand mesh can be constructed from scratch based on hand key point data, or the vertex data of a preset template hand mesh can be adjusted one by one based on the hand key point data to obtain the hand mesh.

[0057] It should be noted that vertex features, as well as other features mentioned below (such as image features, point features, etc.), belong to the information flow and are used to implicitly contain certain semantic information. The dimensions of various features can be the same or different, and can be set according to performance requirements and time consumption. The higher the dimension, the better the performance, but the longer the consumption time; the lower the dimension, the worse the performance, but the shorter the consumption time.

[0058] Step 130: Correct the vertex data in the hand mesh to determine the hand shape of the target hand.

[0059] In one optional embodiment of this disclosure, the W vertex data determined in step 120 can be corrected using methods based on traditional machine learning, deep learning, or other types of correction algorithms. The corrected W vertex data effectively define the hand shape presented by the hand mesh, which can then be used as the true hand shape of the target hand.

[0060] In the embodiments of this disclosure, multi-view images (i.e., multiple hand images) of the target hand can be obtained through multi-view image capture for determining hand key point data. The multi-view images contain information about the target hand from different perspectives, which helps ensure the accuracy and reliability of the hand key point data. Referring to the hand key point data, the data of each vertex in the hand mesh can be determined efficiently and reliably. Combined with vertex data correction operations, the corrected hand mesh can effectively reflect the true shape of the target hand. Therefore, the embodiments of this disclosure can better guarantee the accuracy of the hand shape determined in a predetermined scenario.

[0061] Figure 3 This is a flowchart illustrating a method for determining hand shape provided in another exemplary embodiment of this disclosure. Figure 3The method shown may include steps 310, 320, 330 and 340, each of which will be explained below.

[0062] Step 310: Use the image features of multiple hand images taken at the same time to determine the key point data of the hand.

[0063] In one optional embodiment of this disclosure, multiple hand images are obtained by capturing the target hand using multiple cameras positioned at different angles. Key hand points may include at least one point representing the location of the fingertips, knuckles, wrist, and palm. Optionally, the relative poses of the multiple cameras are fixed.

[0064] In one optional embodiment of this disclosure, before step 310, N hand images corresponding to N cameras can be acquired. The size of each of the N hand images can be represented as H1*W1*3. The specific acquisition method can be referred to the relevant description of step 110 above, and will not be repeated here.

[0065] In an optional embodiment of this disclosure, prior to step 310, features can be extracted from the N hand images using an image feature extraction network, such as a convolutional neural network (CNN) for image feature extraction, to obtain N image features.

[0066] In an optional embodiment of this disclosure, N image features can be one-to-one corresponded to N hand images. Each of the N image features can be in the form of a feature map. The dimensions of each of the N image features can be represented as H2*W2*C, where H2 represents the feature height, W2 represents the feature width, and C represents the number of channels. To reduce computational and storage requirements, H2 and W2 can be H1 and W1 downsampled by 2. n The value obtained by multiplying, for example, the values ​​obtained by downsampling H1 and W1 by 8, 16, and 32 times. C can usually be represented as 2. m (m is a positive integer), for example, C can be 128, 256, etc. Generally speaking, the larger the value of C, the greater the computational load, the better the effect, but the lower the efficiency. Therefore, C can be set to an appropriate value according to the actual situation.

[0067] In one optional embodiment of this disclosure, the number of hand key points involved in step 310 can be M. Some of the M hand key points can be used to represent the location of the fingertips, another part of the M hand key points can be used to represent the location of the finger joints, yet another part of the M hand key points can be used to represent the location of the wrist, and yet another part of the M hand key points can be used to represent the location of the palm.

[0068] In one optional embodiment of this disclosure, for each of the N image features, the 2D coordinates corresponding to each of the M hand keypoints can be obtained by calculation using methods such as heatmap extremum methods, probabilistic graph methods, and direct regression methods. Optionally, when using the heatmap extremum method for calculation, for each of the N image features, a heatmap corresponding to each of the M hand keypoints can be calculated. The size of each heatmap can be represented as H3*W3*M, where H3 represents the height of the heatmap and W3 represents the width of the heatmap. Each heatmap can be used to determine the 2D coordinates corresponding to one of the M hand keypoints. H3 and H2 can be the same, and W3 and W2 can be the same.

[0069] In one optional embodiment of this disclosure, the parameters of N cameras can be obtained in advance through camera calibration. The parameters of any camera may include its intrinsic and extrinsic parameters. The intrinsic parameters are the inherent parameters of the camera, such as focal length, optical center, magnification, etc. The extrinsic parameters may include at least one of the camera's pose in the reference coordinate system and the relative pose of the camera with other cameras.

[0070] In one optional embodiment of this disclosure, by combining the parameters of each of the N cameras and the 2D coordinate sets corresponding to each of the N image features, the 3D coordinates corresponding to each of the M hand key points can be roughly obtained by using the Direct Linear Transformation (DLT) method or a deep learning-based method to form hand key point data.

[0071] Step 320: Based on the hand key point data, determine the vertex data of each vertex in the hand mesh.

[0072] In one alternative embodiment of this disclosure, the hand mesh includes a mesh structure distributed on the surface of the hand, the mesh structure including multiple vertices.

[0073] It should be noted that the representation and acquisition method of the hand mesh, as well as the composition of the vertex data, can be referred to the relevant introduction of step 120 above, and will not be repeated here.

[0074] Step 330: Based on the parameters of each of the multiple cameras and the image features of each of the multiple hand images, determine the virtual point cloud data.

[0075] In an optional embodiment of this disclosure, in step 330, based on the parameters of each of the N cameras, N image features can be aggregated to obtain virtual point cloud data in 3D space. The virtual point cloud data can be point cloud data used to represent the target hand and the surrounding environment of the target hand.

[0076] In one optional embodiment of this disclosure, the virtual point cloud can be a sparse point cloud. The virtual point cloud may include multiple virtual point cloud points. The virtual point cloud data may include the point coordinates and point features of each of the multiple virtual point cloud points. Optionally, the number of virtual point cloud points may be the same as or different from W mentioned above.

[0077] Step 340: Using virtual point cloud data, the vertex data in the hand mesh are corrected to determine the hand shape of the target hand.

[0078] In one optional embodiment of this disclosure, virtual point cloud data can be used to correct the W vertex data in the hand mesh once, and the hand shape defined by the W vertex data after correction can be used as the real hand shape of the target hand.

[0079] In another optional embodiment of this disclosure, virtual point cloud data can be used to iteratively correct the W vertex data in the hand mesh multiple times. The hand shape defined by the W vertex data after multiple corrections can be used as the real hand shape of the target hand.

[0080] In the embodiments of this disclosure, multi-view images (i.e., multiple hand images) of the target hand can be obtained through multi-view image capture. The image features of the multi-view images can be used to determine the key point data of the hand. The image features of the multi-view images carry the feature information of the target hand from different perspectives, which helps to ensure the accuracy and reliability of the key point data of the hand. Referring to the key point data of the hand, the vertex data of each vertex in the hand mesh can be determined efficiently and reliably. In addition, by combining the camera parameters and the image features of the multi-view images, virtual point cloud data can be determined for the correction of vertex data. In this way, the corrected hand mesh can effectively reflect the real hand shape of the target hand. Therefore, the embodiments of this disclosure can better ensure the accuracy of the hand shape determined in the predetermined scene.

[0081] In an alternative embodiment of this disclosure, in step 320, for each vertex in the hand mesh, the following can be performed: Figure 4 Steps 3201 and 3203 are shown.

[0082] Step 3201: Based on the hand key point data, determine the initial anchor point position coordinates corresponding to the vertex.

[0083] In one alternative embodiment of this disclosure, the initial anchor point position coordinates are the position coordinates of the hand key point closest to the vertex.

[0084] In one optional embodiment of this disclosure, for W vertices and M hand keypoints, the relative positional relationships between each vertex and each hand keypoint can be predefined. In step 3201, for each of the W vertices, based on the predefined relative positional relationships, the hand keypoint closest to that vertex can be efficiently and quickly determined among the M hand keypoints. The position coordinates of the searched hand keypoint can be used as the initial anchor point position coordinates corresponding to that vertex.

[0085] Step 3203: Determine the offset of the vertex relative to the initial anchor point position coordinates, and apply the offset to the initial anchor point position coordinates corresponding to the vertex to obtain the initial vertex coordinates.

[0086] In one alternative embodiment of this disclosure, the initial vertex characteristics of the vertex are random initialization values.

[0087] In one alternative embodiment of this disclosure, a first mapping function may be pre-learned for obtaining the offset from the initial anchor point position coordinates.

[0088] In an optional embodiment of this disclosure, in step 3203, the initial anchor point coordinates determined in step 3201 are used as input to the first mapping function, which then calculates and outputs the corresponding offset. Assume the initial anchor point coordinates determined in step 3201 are represented as x... j The first mapping function is denoted as f m The offset calculated by the first mapping function is represented as f. m (x) i The initial vertex coordinates of this vertex are represented as v. init The initial vertex coordinates of this vertex can be calculated using the following formula:

[0089] v init =x j +f m (x) i

[0090] In one optional embodiment of this disclosure, the first mapping function may output the offset for each vertex individually. Alternatively, the hand keypoint data may be input together into the first mapping function, and then the first mapping function outputs the offset of each vertex. For example, the first mapping function may input M-dimensional hand keypoint data and output W-dimensional offsets, where M is the number of hand keypoints and W is the number of vertices.

[0091] In one alternative embodiment of this disclosure, the vertex features can be a set of parameters that can be learned. Optionally, the features can be initialized using a random Gaussian distribution, and the initialized features can be used as the initial vertex features of the vertex.

[0092] In the embodiments of this disclosure, by utilizing the relative positional relationship between each vertex and each key point, and combining the offset application operation and the initialization operation, the determination of the initial vertex coordinates and initial vertex features can be achieved efficiently and reliably, so as to obtain vertex data including the initial vertex coordinates and initial vertex features. Therefore, the vertex data determination operation is very convenient and fast to implement.

[0093] Of course, the implementation of step 320 is not limited to this. For example, a second mapping function can be pre-learned to obtain vertex coordinates from the initial anchor point position coordinates. In this way, after determining the initial anchor point position coordinates corresponding to the vertex based on the hand keypoint data, the initial anchor point position coordinates are substituted into the second mapping function, and the second mapping function can calculate and output the corresponding initial vertex coordinates.

[0094] In one optional embodiment of this disclosure, the second mapping function may output the initial vertex coordinates for each vertex individually. Alternatively, the hand keypoint data may be input together into the second mapping function, and then the second mapping function may output the initial vertex coordinates for each vertex. For example, the second mapping function may input M-dimensional hand keypoint data and output W-dimensional initial vertex coordinates.

[0095] In one optional embodiment of this disclosure, a vertex feature can be preset and the preset vertex feature can be directly used as the initial vertex feature of the vertex.

[0096] In one optional example, each vertex data includes: the initial vertex coordinates and initial vertex features of that vertex. The virtual point cloud data includes: the point coordinates and point features of each of the multiple virtual point cloud points.

[0097] In one optional embodiment of this disclosure, step 340 includes:

[0098] For any vertex, perform the following operation:

[0099] Using the hand template, the point coordinates and features of multiple virtual point cloud points, the initial vertex features and initial vertex coordinates of the vertex are corrected to obtain the corrected vertex features and corrected vertex coordinates.

[0100] In one alternative embodiment of this disclosure, the relative positional relationships of the vertices in the hand mesh are defined in the hand template.

[0101] In one optional embodiment of this disclosure, the hand template can correspond to Figure 5 The hand is shown in an extended position. The hand template can include the template coordinates of W vertices in the hand mesh, so that the relative positional relationship of each vertex can be represented by the difference between the template coordinates of each vertex.

[0102] Of course, the hand template can also correspond to other states besides the straight hand position, such as the hand being in a naturally extended position, the hand making an OK gesture, or the hand making a V-shape gesture.

[0103] In an optional embodiment of this disclosure, the initial vertex features of the vertex may include, in addition to the learnable parameters mentioned above, a hand template and the initial vertex coordinates. Thus, if the learnable parameters are represented as q... w The hand template is represented as v tmpl , is a constant, and the initial vertex coordinates of this vertex are represented as v. init If the current vertex feature of this vertex is denoted as q, then:

[0104] q = Concat(q) w v tmpl v init )

[0105] Here, Concat represents the feature concatenation operation. In q... w When taking an initial value, q is the initial vertex feature of that vertex.

[0106] In one optional embodiment of this disclosure, the initial vertex features and initial vertex coordinates of the vertex are corrected using a hand template, the point coordinates and point features of multiple virtual point cloud points, to obtain the corrected vertex features and corrected vertex coordinates of the vertex, including:

[0107] The current vertex features and current vertex coordinates of the given vertex are updated at least once, and each iteration includes... Figure 6 Steps 3401, 3403, 3405, and 3407 are shown.

[0108] Step 3401: Based on the hand template and the current vertex features of the vertex, determine the first reference vertex features of the vertex.

[0109] In one alternative embodiment of this disclosure, the initial value of the current vertex feature is the initial vertex feature.

[0110] In an optional embodiment of this disclosure, the first reference vertex feature can be calculated using the following formula:

[0111] q' = selfAttn(q)

[0112] Where q' represents the first reference vertex feature, selfAttn represents the self-attention mechanism, which can utilize the relative positional relationships of each vertex defined by the hand template during operation, and q represents the current vertex feature.

[0113] Substituting the current vertex features of the vertex into the above formula for calculation, the hand template can play a certain role in relative position constraint through the operation of the self-attention mechanism, so as to form the first reference vertex features that conform to the relative position constraint.

[0114] Of course, the methods for determining the features of the first reference vertex are not limited to this. For example, other mechanisms besides self-attention can also be used to determine the features of the first reference vertex. Furthermore, the formula used to obtain the features of the first reference vertex can be multiplied or divided by a preset coefficient on the right side of the equals sign.

[0115] Step 3403: Based on the current vertex coordinates of the vertex and the respective point coordinates of the multiple virtual point cloud points, determine a preset number of neighboring points among the multiple virtual point cloud points that are adjacent to the vertex.

[0116] In one alternative embodiment of this disclosure, the initial value of the current vertex coordinates is the initial vertex coordinates.

[0117] In an optional embodiment of this disclosure, the preset number can be represented as K. Based on the current vertex coordinates of the vertex and the corresponding point coordinates of each virtual point cloud point, multiple distances between the vertex and each virtual point cloud point can be calculated. These multiple distances are then sorted by size, and the K smallest distances are selected to obtain the K nearest neighbors. The K virtual point cloud points that correspond one-to-one with the selected K distances can be used as the preset number of neighboring points determined in step 3403. The K nearest neighbor algorithm is used to determine the preset number of neighboring points.

[0118] Step 3405: ​​Using the point features, point coordinates, current vertex coordinates of the vertex, and first reference vertex features of each of the preset number of neighboring points, a second reference vertex feature is obtained, and the current vertex feature of the vertex is updated using the second reference vertex feature.

[0119] In an optional embodiment of this disclosure, the second reference vertex features can be calculated using the following formula:

[0120]

[0121]

[0122] in, Indicates the features of the second reference vertex. Let f represent the set of K nearest neighbor virtual point cloud points, sfm is the softmax (normalization) operation, α, β, γ, ψ, θ represent a learnable multilayer perceptron (MLP) network, ⊙ represents the Hadamard product of matrices (i.e., multiplication of corresponding positions), and f j Let q′ represent the point feature of the j-th virtual point cloud point among K neighboring virtual point cloud points. i Represents the first reference vertex feature, v i This represents the current vertex coordinates of that vertex. This represents the average coordinates obtained by averaging the coordinates of K neighboring virtual point cloud points.

[0123] By substituting the point features, point coordinates, current vertex coordinates, and first reference vertex features of a preset number of neighboring points into the above formula for calculation, the second reference vertex features can be obtained efficiently and reliably. The current vertex features of the vertex can then be updated to the second reference vertex features.

[0124] Of course, the methods for determining the features of the second reference vertex are not limited to this. For example, other networks besides MLPs can also be used to determine the features of the second reference vertex. Furthermore, the formula for obtaining the features of the second reference vertex can be multiplied or divided by a preset coefficient on the right side of the equals sign.

[0125] Step 3407: Based on the current vertex coordinates and current vertex features of the vertex, obtain the reference vertex coordinates, and update the current vertex coordinates of the vertex using the reference vertex coordinates.

[0126] In an optional embodiment of this disclosure, the coordinates of the reference vertex can be calculated using the following formula:

[0127]

[0128] in, v represents the coordinates of the reference vertex. i This represents the current vertex coordinates of the given vertex, and ffn represents the feedforward network. This indicates the current vertex feature of the vertex after updating using the second reference vertex feature.

[0129] By substituting the current vertex coordinates and current vertex features into the above formula for calculation, the reference vertex coordinates can be obtained efficiently and reliably. The current vertex coordinates of the current vertex can then be updated to the reference vertex coordinates.

[0130] Of course, the methods for determining the reference vertex coordinates are not limited to this. For example, other networks besides feedforward networks can also be used to determine the reference vertex coordinates. Furthermore, for the formula above used to obtain the reference vertex coordinates, the right side of the equals sign can be... Multiply or divide by a preset coefficient.

[0131] Here, a cross-attention mechanism is introduced. Since the neighboring virtual point cloud points can represent the features of the current vertex well and are stable, the vertex features and vertex coordinates of the vertex are updated using the K-nearest neighbor virtual point cloud data of the vertex.

[0132] The embodiments of this disclosure employ a self-attention mechanism. Guided by a hand template, a first reference vertex feature conforming to relative position constraints can be efficiently and reliably obtained from the current vertex feature. A cross-attention mechanism, using the K-nearest neighbor algorithm, further refines the first reference vertex feature by incorporating the point features and coordinates of neighboring virtual point cloud points, resulting in a second reference vertex feature. This second reference vertex feature is then used to update the current vertex coordinates, thus ensuring the accuracy and reliability of the updated current vertex coordinates.

[0133] Of course, the method of using the hand template, the point coordinates and features of multiple virtual point cloud points to modify the initial vertex features and initial vertex coordinates of the vertex to obtain the modified vertex features and modified vertex coordinates is not limited to this. Figure 6 The illustrated embodiment. For example, after determining the first reference vertex feature of a vertex based on the hand template and the current vertex feature of that vertex, the current vertex feature of that vertex can be directly updated using the first reference vertex feature. Then, combined with the current vertex coordinates of that vertex, the reference vertex coordinates are obtained, and the current vertex coordinates of that vertex are updated using the reference vertex coordinates.

[0134] In the embodiments of this disclosure, by referring to the relative positional relationships defined by the hand template, and the point coordinates and point features of each of the multiple virtual point cloud points, the vertex data in the hand mesh can be effectively corrected through at least one iteration to ensure the correction effect.

[0135] like Figure 7-1 As shown, the methods in the embodiments of this disclosure may include the following stages.

[0136] In the first stage, three hand images can be obtained by capturing three cameras, and then processed by the first CNN (corresponding to...). Figure 7-1A convolutional neural network (CNN) is used to extract features from three hand images to obtain three image features. For each image feature, a second CNN can be used to calculate the coordinates of the corresponding 2D hand keypoints in each image. For example, a heatmap extremum method can be used to calculate the corresponding heatmap. Combining the parameters of the three cameras and these heatmaps, 3D hand keypoint data x can be obtained by using methods such as DLT (Direct Linear Transform). Of course, the heatmap extremum method can also be replaced by probabilistic graphical methods, direct regression methods, etc. DLT can be replaced by triangulation, depth models, etc., to restore depth information from the 2D hand keypoints.

[0137] In the second stage, based on the hand keypoint data, the position coordinates v and vertex features q of each vertex in the hand mesh can be determined. The position coordinates v and vertex features q of each vertex in the hand mesh can then be used... This can be represented as follows. Additionally, the point coordinates p and point features f of multiple virtual point cloud points can be determined. Using multiple decoders (e.g., L decoders), the position coordinates v and vertex features q of each vertex can be iteratively optimized. Each decoder can run a self-attention mechanism, a cross-attention mechanism, and a feedforward network. Through the self-attention mechanism, vertex features q can be decoded and calculated (corresponding to the formula for calculating the first reference vertex feature involved in step 3401 above). Through the cross-attention mechanism, vertex features q can be corrected and updated using the point coordinates p, point features f of the virtual point cloud points, and the position coordinates v of the vertices (corresponding to the formula for calculating the second reference vertex feature involved in step 3405 above). Optionally, in the cross-attention mechanism, for the vertex features q of each vertex, only the point coordinates p and point features f of the K nearest neighbor virtual point cloud points can be used. The point coordinates p and point features f of the K nearest neighbor virtual point cloud points can be used... This can be represented as, and the average coordinates obtained by averaging the coordinates of the K nearest neighbor virtual point cloud points can be used... This is represented by a feedforward network. The vertex coordinate correction amount is calculated based on the corrected vertex features. The vertex coordinate correction amount is summed with the current vertex position to obtain the updated vertex coordinates (corresponding to the formula used to calculate the reference vertex coordinates in step 3407 above). After passing through a decoder, a v is output. 更新 It should be noted that the specific iterative optimization process can be found in [reference needed]. Figure 7-2 The updated position coordinates v will be obtained through one iteration of optimization (e.g., the l-th iteration). l and vertex features q l (correspond Figure 7-2 Chinese v lThe next iteration of optimization (e.g., the (l+1)th iteration) can use the updated position coordinates v. l and vertex features q l As input for the self-attention mechanism.

[0138] In an alternative embodiment of this disclosure, in step 330, the following can be performed for each hand image: Figure 8 Steps 3301, 3303, 3305, 3307, 3309, and 3311 are shown.

[0139] Step 3301: Divide the space in front of the camera that captures the hand image into a three-dimensional grid.

[0140] In one alternative embodiment of this disclosure, the three-dimensional mesh includes multiple mesh regions, each mesh region having a corresponding position index.

[0141] In an optional embodiment of this disclosure, the camera that captures the image of the hand can be... Figure 9 Cameras 1 to N are included. For camera 1 among the N cameras, the space it faces when capturing an image of a hand is a truncated cone. A top view of the truncated cone can be shown as follows... Figure 9 As shown in trapezoid 910, the intersection of the spaces faced by each of the N cameras during shooting can be called the cross space of the N camera truncated cones.

[0142] In an optional embodiment of this disclosure, for ease of calculation, the space faced by the camera capturing the hand image during shooting can be transformed into a regularly shaped space (hereinafter referred to as the target space) through shape adjustment processing (e.g., transformation, cropping). The size of the target space can be H4*W4*D, where H4 is the height of the target space, W4 is the width of the target space, and D can be a preset number of discrete depth values. H4 and W4 can be equal to H1 and W1 respectively, or they can be different from H1 and W1. By dividing the target space into grids, a three-dimensional grid comprising multiple grid regions can be formed. Each grid region can be a cube region with a length, width, and height of unit length. For each grid region, a position index can be determined based on its distribution position in the three-dimensional grid. For example, a target three-dimensional coordinate system can be constructed for the three-dimensional grid, and the 3D coordinates of the grid region in the target three-dimensional coordinate system can be determined, with the determined 3D coordinates used as the position index of the grid region. Of course, the position index can also be in the form of a sequence number, a vector, etc., which will not be listed here.

[0143] Step 3303: Based on the parameters of the camera that captured the hand image and the position index of each of the multiple grid regions in the three-dimensional grid, determine the position coordinates of the multiple grid regions in the reference coordinate system.

[0144] In one optional embodiment of this disclosure, the reference coordinate system may be the world coordinate system, or other coordinate systems rigidly connected to the world coordinate system, or a coordinate system rigidly connected to one of the multiple cameras.

[0145] In an optional embodiment of this disclosure, assuming the position index is in the form of 3D coordinates, in step 3303, for each of the multiple grid regions, the position coordinates of that grid region in the reference coordinate system can be calculated using the following formula:

[0146] A'=TK1A

[0147] Where A' represents the position coordinates of the grid region in the reference coordinate system, T represents the extrinsic parameter matrix of the camera that captured the hand image, K1 represents the intrinsic parameter matrix of the camera that captured the hand image, and A represents the position index of the grid region in the 3D grid.

[0148] It should be noted that if the position index is in the form of a sequence number, vector, etc., then in step 3303, for each grid region in multiple grid regions, the position index of the grid region can be converted into 3D coordinates first, and then the converted 3D coordinates can be substituted into the formula to calculate the position coordinates.

[0149] Step 3305: Assign the position coordinates of each of the multiple grid regions as feature values ​​to the corresponding grid regions to obtain the position code of the three-dimensional grid.

[0150] It should be noted that, through the eigenvalue assignment operation, each of the multiple grid regions not only possesses 3D coordinates in the target 3D coordinate system but also eigenvalues. 3D coordinates can be viewed as three-dimensional information, and eigenvalues ​​can be seen as fourth-dimensional information; that is, each grid region possesses a total of four-dimensional information. The 3D coordinates and eigenvalues ​​of multiple grid regions can collectively form a four-dimensional position code, thus obtaining the position code of the 3D grid.

[0151] Step 3307: Reconstruct the image features of the hand image to obtain reconstructed image features, so that the reconstructed image features are consistent with the dimension of the position encoding.

[0152] In an optional embodiment of this disclosure, in step 3307, the image features of the hand image can be converted from three-dimensional to four-dimensional, and the conversion result can be used as the reshaped image features. Thus, if the size of the image features before reshaping is H2*W2*C, the size of the image features after reshaping can be expressed as H2*W2*(C / 3)*3.

[0153] Step 3309: Based on the location encoding and reconstructed image features, determine the point coordinates and point features of each point in the sub-virtual point cloud corresponding to the hand image.

[0154] In one optional embodiment of this disclosure, step 3309 includes:

[0155] For any location index: the location coordinates at that location index in the location encoding and the feature value at that location index in the reconstructed image features are fused to obtain the location embedding feature value at that location index. The location coordinates are the point coordinates of the virtual point cloud point corresponding to that location index, and the location embedding feature value is the point feature of the virtual point cloud point corresponding to that location index.

[0156] In one optional embodiment of this disclosure, assuming the position coordinates at a certain position index in the position encoding are (a, b, c), and the feature value at that position index in the reconstructed image features is (d, e, f), then the position embedding feature value at that position index can be (a+d, b+e, c+f). Therefore, (a, b, c) can be used as the point coordinates of the virtual point cloud point corresponding to that position index, and (a+d, b+e, c+f) can be used as the point feature of the virtual point cloud point corresponding to that position index.

[0157] The previous paragraph introduced the case of fusing position coordinates and feature values ​​through addition. In practice, the fusion of position coordinates and feature values ​​can also be achieved through weighted fusion, mean fusion, or other operational logic.

[0158] like Figure 10 As shown, the image features before reshaping can be represented as F 2d The reconstructed image features can be represented as F 2d The position code can be represented as M. 3d By fusing the location encoding with the reconstructed image features, the location embedding feature F can be obtained. 3d F 3d It can include multiple location-embedded feature values. Thus, from M... 3d and F 3d By extracting elements from the corresponding positions, the point coordinates and features of the same virtual point cloud point can be obtained. Using positional embedding aggregation, the point coordinates and features of the same virtual point cloud point can be obtained efficiently and reliably.

[0159] Step 3311: Merge the sub-virtual point clouds corresponding to each hand image to obtain virtual point cloud data.

[0160] In an optional embodiment of this disclosure, for each of the N hand images, the coordinates and features of several virtual point cloud points can be determined. These coordinates and features can form a sub-virtual point cloud corresponding to that hand image. The set of sub-virtual point clouds corresponding to each of the N hand images can be used as virtual point cloud data. Thus, the sampling space of the virtual point cloud data can be considered as... Figure 9 The virtual point cloud is the union of the spaces faced by the N cameras during shooting. If the virtual point cloud includes Z virtual point cloud points, then Z = N*H4*W4*D.

[0161] In the embodiments of this disclosure, by dividing the space faced by the camera during shooting, a three-dimensional mesh comprising multiple mesh regions can be obtained efficiently and reliably. By utilizing the camera parameters (which reflect camera-specific information) and the position indexes corresponding to the mesh regions, the position coordinates corresponding to the mesh regions can be determined efficiently and reliably. Combined with feature value assignment and image feature reshaping operations, the point coordinates and point features of virtual point cloud points can be determined efficiently and reliably using the position encoding obtained through feature value assignment and the image feature reshaping results. This allows for the subsequent merging of sub-virtual point clouds to obtain virtual point cloud data. Thus, through position embedding aggregation, virtual point cloud data can be effectively acquired.

[0162] In an optional embodiment of this disclosure, in response to the inconsistency between the size of the reshaped image features and the size of the position encoding, prior to step 3309, the method provided by the embodiments of this disclosure further includes: performing a size alignment operation on the position encoding of the three-dimensional mesh and the reshaped image features.

[0163] In one optional embodiment of this disclosure, if the size of the positional encoding of the 3D mesh is larger than the size of the reconstructed image features, the size of the two can be made consistent by downsampling the positional encoding of the 3D mesh. If the size of the positional encoding of the 3D mesh is smaller than the size of the reconstructed image features, the size of the two can be made consistent by upsampling the positional encoding of the 3D mesh.

[0164] After size alignment, the size of the reconstructed image features is consistent with the size of the location encoding. The reconstructed image features and the location encoding can be fused correctly, ensuring the location embedding feature F mentioned above. 3d Success was achieved.

[0165] In an optional embodiment of this disclosure, step 330 includes Figure 11Steps 3313, 3315, and 3317 are shown.

[0166] Step 3313: Based on the key hand point data, determine the position of the preset hand part in the reference coordinate system.

[0167] In one optional embodiment of this disclosure, the reference coordinate system can be the world coordinate system, or other coordinate systems rigidly connected to the world coordinate system, or a coordinate system rigidly connected to one of the multiple cameras. Preset hand parts include, but are not limited to, the base of the middle finger, the palm, and the wrist.

[0168] In one optional embodiment of this disclosure, the hand key point data may include 3D coordinates for representing hand key points of a preset hand part, the spatial position indicated by the 3D coordinates can be used as the position of the preset hand part in a reference coordinate system.

[0169] Step 3315: Using the preset hand position as the center, sample within a preset range under the reference coordinate system to obtain the point coordinates of each point in the virtual point cloud.

[0170] In an optional embodiment of this disclosure, a sphere with a preset radius, centered at a preset hand position, can be determined in a reference coordinate system. The area including the surface and interior of the sphere can be used as the preset range. Thus, the preset area can be as follows: Figure 12 As shown in the spherical region 1310, the spherical region 1310 can be considered as the outer sphere region of the target hand.

[0171] In another optional embodiment of this disclosure, a cuboid with a preset length, preset width, and preset height, centered on a preset hand part, can be determined in a reference coordinate system. The range including the surface and interior of the cuboid can be used as the preset range.

[0172] In both of the above optional implementations, random uniform sampling can be performed within a preset range, or sampling can be performed according to certain sampling rules, thereby obtaining S sampling points. Each sampling point can be used to determine a point in the virtual point cloud (i.e., a virtual point cloud point). The coordinates of each point in the virtual point cloud are the coordinates of the corresponding sampling point in the reference coordinate system. If the density of the virtual point cloud points is basically the same, since S is the number of virtual point cloud points within the sphere or cuboid area covering the hand, the number of points S is much smaller than Z. Therefore, when performing step 340, the computational load is smaller and the speed is faster.

[0173] Step 3317: Based on the parameters of each of the multiple cameras, the point coordinates of each point in the virtual point cloud, and the image features of each of the multiple hand images, determine the point features of each point in the virtual point cloud.

[0174] In one optional embodiment of this disclosure, step 3317 includes:

[0175] For any point in the virtual point cloud: For each hand image, using the parameters of the camera that captured the hand image and the point's coordinates, project the point onto the hand image to determine the projection point of the point on the image features of the hand image, thus obtaining the point's features in the hand image. Fuse the features of the point in multiple hand images to determine the point's feature.

[0176] In one optional embodiment of this disclosure, any point in the virtual point cloud can be considered a point in a reference coordinate system. Since the camera parameters can include intrinsic and extrinsic parameters, for each of the N hand images, by utilizing the intrinsic and extrinsic parameters of the camera corresponding to that hand image, the point can be efficiently and reliably projected from the reference coordinate system (see [link to details]). Figure 12 The projection point can be determined by tracing the direction indicated by the middle arrow 1320 to the pixel coordinate system corresponding to the camera. Given the projection point, its features in the hand image can be obtained by bilinear interpolation of the pixel values ​​in the neighborhood of the projection point. In this way, the features of the point in the virtual point cloud can be obtained in each of N hand images.

[0177] In one optional embodiment of this disclosure, the point features are determined by fusing the features of the point in multiple hand images, including:

[0178] From multiple hand images, the pixel coordinate system of a reference hand image is selected as the baseline coordinate system. The reference hand image can be any one of the multiple hand images.

[0179] Using parameters from multiple cameras, the features of this point in other hand images besides the reference hand image are transformed to the reference coordinate system, thus obtaining the transformed features of this point in the other hand images.

[0180] The features of this point in the reference hand image and the corresponding transformation features in other hand images are fused together to obtain the point features.

[0181] In one optional embodiment of this disclosure, one hand image can be randomly selected from N hand images as a reference hand image, and the pixel coordinate system of the reference hand image can be used as a reference coordinate system. Of course, the method for determining the reference hand image is not limited to this. For example, the quality of the N hand images can be evaluated, and the hand image with the best quality can be selected as the reference hand image.

[0182] In one optional embodiment of this disclosure, the parameters of any camera may include its intrinsic and extrinsic parameters. Based on the parameters of the N cameras, the transformation relationship between the pixel coordinate systems corresponding to the N cameras can be determined (which may be in the form of a transformation matrix). Based on the determined transformation relationship, the features of the point in other hand images besides the reference hand image can be uniformly transformed to the reference coordinate system, thereby obtaining N-1 transformed features.

[0183] In an optional embodiment of this disclosure, the N-1 transformation features and the features of the point in the reference hand image can be fused using a feature fusion algorithm. The specific process of the feature fusion algorithm can be found in [reference needed]. Figure 13 . Figure 13 The S×256 features at the bottom center can be used to feature each sampling point (i.e., virtual point cloud points) in the reference hand image. Figure 13 The feature S×(N-1)×256 at the bottom center can be used to represent the transformation features corresponding to each sampling point in N-1 hand images other than the reference hand image. Figure 13 The top S×256 features can be used to represent the point features of all sampled points. Figure 13 It can be seen that parameter-sharing MLPs (corresponding to...) can be used in feature fusion algorithms. Figure 13 In the context of multilayer perceptrons (MLPs), the sampled features can be processed and fused using matrix multiplication. Finally, the MLP can be used for feature mapping to obtain point features.

[0184] In this embodiment, for any point in the virtual point cloud, the point can be projected back to N image features based on camera parameters. Then, using algorithms such as bilinear interpolation, the individual features of that point in each of the N hand images are determined, allowing for subsequent feature fusion. During the fusion process, all features can be transformed to a reference coordinate system before fusion to ensure effective feature fusion. In this way, by using a projection aggregation method, suitable point features are determined for that point in the virtual point cloud, ensuring a good level of accuracy and reliability of the point features.

[0185] In practice, during the fusion process, it is not necessary to transform all features to the reference coordinate system before fusion. Instead, the features can be fused directly to improve the speed of point feature determination.

[0186] In the embodiments of this disclosure, based on key hand point data, the position of a preset hand part in a reference coordinate system can be determined efficiently and reliably. By sampling within a certain range in the reference coordinate system centered on the preset hand part position, the corresponding points in the reference coordinate system can be directly used as virtual point cloud points, thereby efficiently and reliably acquiring the point coordinates of the virtual point cloud points. Furthermore, by utilizing camera parameters, acquired point coordinates, and image features, the point features of the virtual point cloud points can be acquired efficiently and reliably. Thus, through projection aggregation, virtual point cloud data can be effectively acquired.

[0187] In an optional embodiment of this disclosure, the steps of determining hand key point data (e.g., step 310), determining vertex data in the hand mesh (e.g., step 320), determining virtual point cloud data (e.g., step 330), and correcting vertex data in the hand mesh (e.g., step 340) are all implemented by a target neural network.

[0188] The method provided in the embodiments of this disclosure may further include: training an initial neural network using a training set to obtain a target neural network.

[0189] In one optional embodiment of this disclosure, the training set includes multiple training samples, each training sample including: multiple training hand images, the ground truth value of the three-dimensional position of each sample vertex in the sample hand mesh corresponding to the multiple training hand images, and the ground truth value of the two-dimensional position of each sample vertex projected in each training hand image.

[0190] In one optional embodiment of this disclosure, the training loss of the initial neural network is: a three-dimensional loss value and a two-dimensional loss value for each sample vertex. The three-dimensional loss value is the deviation between the estimated three-dimensional position of each sample vertex output by the initial neural network after inputting a training sample and the true three-dimensional position in that training sample. The two-dimensional loss value is the deviation between the estimated two-dimensional position of each sample vertex output by the initial neural network after inputting a training sample and the true two-dimensional position in that training sample.

[0191] By running the target neural network obtained through training, steps 310, 320, 330, and 340 can be executed.

[0192] It should be noted that the initial neural network can be an untrained neural network or a trained but non-converged neural network. The target neural network can be a trained and converged neural network. Similar to the target neural network, the initial neural network can also be used to perform steps 310, 320, 330, and 340 described above.

[0193] In one optional embodiment of this disclosure, the number of training samples in the training set can be 3,000, 5,000, 10,000, 100,000, or other numbers. The number of training hand images in each training sample can be the same as the number of cameras used when obtaining the training set, and can be 2, 3, 4, 5, or other numbers, which will not be listed here.

[0194] In one optional embodiment of this disclosure, the two-dimensional position ground value of each training sample in the plurality of training samples can be calculated based on the three-dimensional position ground value of each sample vertex and the parameters of the camera used to capture the corresponding training hand image.

[0195] In an optional embodiment of this disclosure, for each of the multiple training samples, the mean absolute error (MAE) loss function, the mean square error (MSE) loss function, or other loss functions can be used to calculate the deviation between the estimated 3D position of each sample vertex output by the initial neural network after inputting the training sample and the true 3D position in the training sample. The calculated deviation can be called the loss of the 3DMesh vertex, and the calculated deviation can be expressed as Lv3D.

[0196] In an optional embodiment of this disclosure, for each of the multiple training samples, the MAE loss function, MSE loss function or other loss function can be used to calculate the deviation between the two-dimensional position estimate of each sample vertex output by the initial neural network after inputting the training sample and the two-dimensional position true value in the training sample. The calculated deviation can be called the 2D reprojection loss, and the calculated deviation can be expressed as Lv2D.

[0197] By applying the aforementioned loss function, the training loss of the initial neural network can be obtained. The training loss can include the Lv3D and Lv2D corresponding to each sample vertex. For each sample vertex, the comprehensive loss value can be obtained by summing the Lv3D and Lv2D corresponding to that vertex, or by performing a weighted summation of the Lv3D and Lv2D corresponding to that vertex. Based on the comprehensive loss values ​​corresponding to each sample vertex, methods such as stochastic gradient descent and steepest gradient descent can be used to adjust the internal parameters of the initial neural network until convergence, thereby obtaining the target neural network.

[0198] In the embodiments of this disclosure, a neural network can be trained end-to-end using a training set comprising multiple training samples. During training, the loss of 3D Mesh vertices and the 2D reprojection loss can be referenced, and through gradient backpropagation, the final trained target neural network can accurately determine the hand shape.

[0199] In one embodiment of this disclosure, in order to implement the method provided by the embodiments of this disclosure, an image feature extraction module (such as the image feature extraction network mentioned above), a 2D to 3D upscaling module, a multi-view feature aggregation module, and a cross-point set Transformer module may be set up.

[0200] The image feature extraction module can be used to extract features from multi-view images (including N hand images) to obtain the N image features mentioned above.

[0201] The image feature extraction module can perform calculations based on N image features to obtain M 2D key points corresponding to each of the N image features.

[0202] The 2D-to-3D upscaling module can roughly obtain the 3D coordinates of each of the M hand keypoints based on the M 2D keypoints corresponding to the N image features, using the DLT method or a deep learning-based method to form hand keypoint data. This hand keypoint data can then be used to determine the initial vertex coordinates and initial vertex features of each vertex of the hand mesh.

[0203] The multi-view feature aggregation module is used to aggregate N image features based on the intrinsic and extrinsic parameters of N cameras to obtain the coordinates and features of a virtual point cloud in three-dimensional space (corresponding to the virtual point cloud data mentioned above).

[0204] The Cross Point Set Transformer module can iteratively optimize the vertex coordinates of the hand mesh based on the coordinates and features of the virtual point cloud in 3D space, thereby accurately determining the hand shape of the target hand.

[0205] The various embodiments, implementation methods, and examples disclosed herein can be implemented individually or in any combination without conflict. The specific implementation can be set according to actual needs, and this disclosure does not limit them.

[0206] Any of the methods for determining hand shape provided in the embodiments of this disclosure can be executed by any suitable device with data processing capabilities, including but not limited to: terminal devices and servers. Alternatively, any of the methods for determining hand shape provided in the embodiments of this disclosure can be executed by a processor, such as by a processor executing any of the methods for determining hand shape mentioned in the embodiments of this disclosure by calling corresponding instructions stored in memory. Further details will not be elaborated below.

[0207] Exemplary device

[0208] An exemplary embodiment of this disclosure provides an apparatus for determining hand shape.

[0209] like Figure 14 As shown, the apparatus provided in the embodiments of this disclosure may include:

[0210] The first determining module 1410 is used to determine key hand point data using the image features of multiple hand images captured at the same time. The multiple hand images are obtained by multiple cameras positioned at different angles capturing the target hand. The key hand points include at least one point representing the location of the fingertips, knuckles, wrist, and palm.

[0211] The second determining module 1420 is used to determine the vertex data of each vertex in the hand mesh based on the hand key point data. The hand mesh includes a mesh structure distributed on the surface of the hand, and the mesh structure includes multiple vertices.

[0212] The third determining module 1430 is used to determine virtual point cloud data based on the parameters of multiple cameras and the image features of multiple hand images.

[0213] The fourth determination module 1440 is used to correct the vertex data in the hand mesh using virtual point cloud data in order to determine the hand shape of the target hand.

[0214] In one optional embodiment of this disclosure, the second determining module 1420 includes a first determining submodule and a second determining submodule;

[0215] For each vertex in the hand mesh:

[0216] The first determination submodule is used to determine the initial anchor point coordinates corresponding to the vertex based on the hand keypoint data. The initial anchor point coordinates are the coordinates of the hand keypoints closest to the vertex.

[0217] The second determining submodule is used to determine the offset of the vertex relative to the initial anchor point position coordinates, and apply the offset to the initial anchor point position coordinates corresponding to the vertex to obtain the initial vertex coordinates. The initial vertex characteristics of the vertex are randomly initialized values.

[0218] In one optional embodiment of this disclosure, each vertex data includes: the initial vertex coordinates and initial vertex features of the vertex, and the virtual point cloud data includes: the point coordinates and point features of each of the multiple virtual point cloud points.

[0219] The fourth determining module 1440 is used to perform the following operation for any vertex: using the hand template, the point coordinates and point features of each of the multiple virtual point cloud points, the initial vertex features and initial vertex coordinates of the vertex are corrected to obtain the corrected vertex features and corrected vertex coordinates of the vertex. The hand template defines the relative positional relationships of each vertex in the hand mesh.

[0220] In an optional embodiment of this disclosure, the fourth determining module 1440 is used to perform at least one iterative update on the current vertex features and current vertex coordinates of the vertex, and each iterative update is implemented by the following module:

[0221] The third determination submodule is used to determine the first reference vertex feature of the vertex based on the hand template and the current vertex feature of the vertex. The initial value of the current vertex feature is the initial vertex feature.

[0222] The fourth determination submodule is used to determine a preset number of neighboring points among the multiple virtual point cloud points, based on the current vertex coordinates and the respective point coordinates of the multiple virtual point cloud points. The initial value of the current vertex coordinates is the initial vertex coordinates.

[0223] The first update submodule is used to obtain the second reference vertex feature by using the point features, point coordinates, current vertex coordinates of the vertex and the first reference vertex feature of each of the preset number of neighboring points, and then using the second reference vertex feature to update the current vertex feature of the vertex.

[0224] The second update submodule is used to obtain reference vertex coordinates based on the current vertex coordinates and current vertex features of the vertex, and then use the reference vertex coordinates to update the current vertex coordinates of the vertex.

[0225] In one optional embodiment of this disclosure, the third determining module 1430 includes a dividing submodule, a fifth determining submodule, an assignment submodule, a reshaping submodule, a sixth determining submodule, and a merging submodule.

[0226] For each hand image:

[0227] The partitioning submodule is used to divide the space faced by the camera capturing the hand image into a three-dimensional mesh. This three-dimensional mesh comprises multiple mesh regions, each with a corresponding position index.

[0228] The fifth determination submodule is used to determine the position coordinates of multiple grid regions in the reference coordinate system based on the parameters of the camera that captured the hand image and the position index of each grid region in the three-dimensional grid.

[0229] The assignment submodule is used to assign the position coordinates of multiple grid regions as feature values ​​to the corresponding grid regions, thereby obtaining the position encoding of the 3D grid.

[0230] The reshaping submodule is used to reshape the image features of the hand image to obtain reshaped image features so that the reshaped image features are consistent with the dimension of the position encoding.

[0231] The sixth determination submodule is used to determine the point coordinates and point features of each point in the sub-virtual point cloud corresponding to the hand image based on the position encoding and reconstructed image features.

[0232] The merging submodule is used to merge the sub-virtual point clouds corresponding to each hand image to obtain virtual point cloud data.

[0233] In an optional embodiment of this disclosure, the sixth determining submodule is configured to, for any location index, fuse the location coordinates at that location index in the location encoding with the feature value at that location index in the reconstructed image features to obtain a location embedding feature value at that location index. The location coordinates are the point coordinates of the virtual point cloud point corresponding to that location index, and the location embedding feature value is the point feature of the virtual point cloud point corresponding to that location index.

[0234] In an optional embodiment of this disclosure, the apparatus in the embodiments of this disclosure further includes: a size alignment module, configured to perform a size alignment operation on the position encoding of the three-dimensional mesh and the reshaped image features before determining the point coordinates and point features of each point in the sub-virtual point cloud in response to the inconsistency between the size of the reshaped image features and the size of the position encoding.

[0235] In one optional embodiment of this disclosure, the third determining module 1430 includes:

[0236] The seventh determination submodule is used to determine the position of a preset hand part in the reference coordinate system based on the key hand point data.

[0237] The sampling submodule is used to sample within a preset range in the reference coordinate system, centered on the position of the preset hand part, to obtain the point coordinates of each point in the virtual point cloud.

[0238] The eighth determination submodule is used to determine the point features of each point in the virtual point cloud based on the parameters of multiple cameras, the point coordinates of each point in the virtual point cloud, and the image features of multiple hand images.

[0239] In an optional embodiment of this disclosure, the eighth determining submodule includes:

[0240] The projection unit is used to project any point in the virtual point cloud: for each hand image, using the parameters of the camera that captured the hand image and the point coordinates, the point is projected onto the hand image to determine the projection point of the point on the image features of the hand image, thereby obtaining the feature of the point in the hand image.

[0241] The fusion unit is used to fuse the features of the point in multiple hand images to determine the point features.

[0242] In one optional embodiment of this disclosure, the fusion unit includes:

[0243] The selection sub-unit is used to select the pixel coordinate system of a reference hand image as the reference coordinate system from multiple hand images. The reference hand image can be any one of the multiple hand images.

[0244] The transformation subunit is used to transform the features of the point in other hand images besides the reference hand image to the reference coordinate system using parameters from multiple cameras, so as to obtain the transformed features of the point in the other hand images respectively.

[0245] The fusion subunit is used to fuse the features of the point in the reference hand image with the corresponding transformation features in other hand images to obtain the point features.

[0246] In one optional embodiment of this disclosure, the first determining module, the second determining module, the third determining module, and the fourth determining module can all be implemented by a target neural network.

[0247] The apparatus provided in the embodiments of this disclosure further includes:

[0248] The training module is used to train the initial neural network using the training set to obtain the target neural network. The training set includes multiple training samples. Each training sample includes: multiple training hand images, the ground truth 3D position values ​​of each sample vertex in the sample hand grid corresponding to the multiple training hand images, and the ground truth 2D position values ​​of each sample vertex projected into each training hand image.

[0249] The initial training loss of the neural network consists of two losses: a 3D loss value and a 2D loss value for each sample vertex. The 3D loss value is the deviation between the estimated 3D positions of each sample vertex output by the initial neural network after inputting a training sample and the true 3D positions in that training sample. The 2D loss value is the deviation between the estimated 2D positions of each sample vertex output by the initial neural network after inputting a training sample and the true 2D positions in that training sample.

[0250] In the apparatus disclosed herein, the various optional embodiments, optional implementation methods and optional examples disclosed above can be flexibly selected and combined as needed to achieve the corresponding functions and effects, and this disclosure does not list them all.

[0251] Exemplary electronic devices

[0252] Below, for reference Figure 15This describes an electronic device according to embodiments of the present disclosure. The electronic device may be either or both of a first device and a second device, or a standalone device independent of them, which may communicate with the first device and the second device to receive acquired input signals from them.

[0253] Figure 15 A block diagram of an electronic device 1500 according to an embodiment of the present disclosure is shown.

[0254] like Figure 15 As shown, the electronic device 1500 includes one or more processors 1510 and memory 1520.

[0255] Memory 1520 is used to store computer programs.

[0256] Processor 1510 is configured to execute a computer program stored in memory 1520, and when the computer program is executed, to implement the steps in the methods for determining hand shape according to various embodiments of the present disclosure described in the "Exemplary Methods" section above.

[0257] The processor 1510 may be a central processing unit (CPU) or other form of processing unit with data processing capabilities and / or instruction execution capabilities, and may control other components in the electronic device 1500 to perform desired functions.

[0258] The memory 1520 may include one or more computer programs and may include various forms of computer-readable storage media, such as volatile memory and / or non-volatile memory. The volatile memory may include, for example, random access memory (RAM) and / or cache memory. The non-volatile memory may include, for example, read-only memory (ROM), hard disk, flash memory, etc. One or more computer program instructions may be stored on the computer-readable storage medium, and the processor 1510 may execute the program instructions to implement the methods for determining hand shapes and / or other desired functions described in the various embodiments of this disclosure above.

[0259] In one example, the electronic device 1500 may also include an input device 1530 and an output device 1540, which are interconnected via a bus system and / or other forms of connection mechanism (not shown).

[0260] For example, when electronic device 1500 is a first device or a second device, the input device 1530 may be a microphone or a microphone array. When electronic device 1500 is a standalone device, the input device 1530 may be a communication network connector for receiving acquired input signals from the first device and the second device.

[0261] In addition, the input device 1530 may also include, for example, a keyboard, a mouse, etc.

[0262] The output device 1540 can output various information to the outside. The output device 1540 may include, for example, a display, a speaker, a printer, and a communication network and its connected remote output devices, etc.

[0263] Of course, for the sake of simplicity, Figure 15 Only some of the components of the electronic device 1500 relevant to this disclosure are shown, omitting components such as buses, input / output interfaces, etc. In addition, the electronic device 1500 may include any other suitable components depending on the specific application.

[0264] Exemplary computer program products and computer-readable storage media

[0265] In addition to the methods and apparatus described above, embodiments of this disclosure may also be computer program products comprising computer program instructions that, when executed by a processor, implement the steps in the methods for determining hand shapes according to various embodiments of this disclosure as described in the "Exemplary Methods" section above.

[0266] The computer program product can be written in any combination of one or more programming languages ​​to perform the operations of the embodiments of this disclosure. The programming languages ​​include object-oriented programming languages ​​such as Java and C++, as well as conventional procedural programming languages ​​such as C or similar languages. The program code can be executed entirely on a user's computing device, partially on a user's computing device, as a standalone software package, partially on a user's computing device and partially on a remote computing device, or entirely on a remote computing device or server.

[0267] Furthermore, embodiments of this disclosure may also be computer-readable storage media storing computer program instructions that, when executed by a processor, implement the steps of the methods for determining hand shapes according to various embodiments of this disclosure as described in the "Exemplary Methods" section above.

[0268] The computer-readable storage medium may be any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may, for example, include, but is not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fibers, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.

[0269] The basic principles of this disclosure have been described above with reference to specific embodiments. However, it should be noted that the advantages, benefits, and effects mentioned in this disclosure are merely examples and not limitations, and should not be considered as essential features of each embodiment of this disclosure. Furthermore, the specific details disclosed above are for illustrative and facilitative purposes only, and are not limitations. These details do not limit the scope of this disclosure to the necessity of employing the aforementioned specific details for implementation.

[0270] The various embodiments in this specification are described in a progressive manner, with each embodiment focusing on its differences from other embodiments. Similar or identical parts between embodiments can be referred to interchangeably. For the electronic device embodiments, since they largely correspond to the method embodiments, the descriptions are relatively simple; relevant parts can be referred to the descriptions of the method embodiments.

[0271] The block diagrams of devices, apparatuses, devices, and electronic equipment involved in this disclosure are merely illustrative examples and are not intended to require or imply that they must be connected, arranged, or configured in the manner shown in the block diagrams. As those skilled in the art will recognize, these devices, apparatuses, devices, and systems can be connected, arranged, and configured in any manner. Words such as “comprising,” “including,” “having,” etc., are open-ended terms meaning “including but not limited to” and are used interchangeably with them.

[0272] The methods and apparatus of this disclosure may be implemented in many ways. For example, they may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order of steps for the methods is for illustrative purposes only, and the steps of the methods of this disclosure are not limited to the order specifically described above, unless otherwise specifically stated.

[0273] It should also be noted that in the apparatus, devices, and methods of this disclosure, the components or steps can be disassembled and / or recombined. These disassemblies and / or recombinations should be considered as equivalent solutions to this disclosure.

[0274] The above description of the disclosed aspects is provided to enable any person skilled in the art to make or use this disclosure. Therefore, this disclosure is not intended to be limited to the aspects shown herein, but rather to be carried out within the widest scope consistent with the principles and novel features disclosed herein.

[0275] The above description has been given for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of this disclosure to the forms disclosed herein. Although numerous exemplary aspects and embodiments have been discussed above, those skilled in the art will recognize certain variations, modifications, alterations, additions, and sub-combinations therein.

Claims

1. A method for determining hand shape, comprising: By utilizing the image features of multiple hand images captured simultaneously, key hand point data is determined; wherein, the multiple hand images are obtained by multiple cameras positioned at different angles capturing the target hand; key hand points... The point includes at least one point used to characterize the location of the fingertip, knuckle, wrist, or palm; Based on the hand key point data, the vertex data in the hand mesh are determined; wherein, the hand mesh includes a mesh structure distributed on the surface of the hand, and the mesh structure includes multiple vertices; Based on the parameters of each of the multiple cameras and the image features of each of the multiple hand images, virtual point cloud data is determined; Using the virtual point cloud data, the vertex data in the hand mesh are corrected to determine the hand shape of the target hand; The process of determining virtual point cloud data based on the parameters of each of the multiple cameras and the image features of each of the multiple hand images includes: Perform the following operations for each hand image: The space facing the camera capturing the hand image is divided into a three-dimensional grid; wherein, the three-dimensional grid includes multiple grid regions, and each grid region has a corresponding position index; Based on the parameters of the camera that captured the hand image, and the position index of each of the multiple grid regions in the three-dimensional grid, the position coordinates of the multiple grid regions in the reference coordinate system are determined. The position coordinates of each of the multiple grid regions are used as feature values ​​and assigned to the corresponding grid regions to obtain the position code of the three-dimensional grid. The image features of the hand image are reconstructed to obtain reconstructed image features, so that the reconstructed image features are consistent with the dimension of the position encoding; Based on the location encoding and the reconstructed image features, the point coordinates and point features of each point in the sub-virtual point cloud corresponding to the hand image are determined; The sub-virtual point clouds corresponding to each hand image are merged to obtain the virtual point cloud data.

2. The method according to claim 1, wherein, The step of determining the vertex data in the hand mesh based on the hand key point data includes: For each vertex in the hand mesh: Based on the hand key point data, the initial anchor point position coordinates corresponding to the vertex are determined; wherein, the initial anchor point position coordinates are the position coordinates of the hand key point nearest to the vertex; Determine the offset of the vertex relative to the initial anchor point position coordinates, and apply the offset to the initial anchor point position coordinates corresponding to the vertex to obtain the initial vertex coordinates; the initial vertex characteristics of the vertex are random initialization values.

3. The method according to claim 1, wherein, Each vertex data includes: the initial vertex coordinates and initial vertex features of the vertex; the virtual point cloud data includes: the point coordinates and point features of each of the multiple virtual point cloud points. The step of using the virtual point cloud data to correct the vertex data in the hand mesh includes: For any vertex, perform the following operation: Using the hand template, the point coordinates and point features of each of the multiple virtual point cloud points, the initial vertex features and initial vertex coordinates of the vertex are corrected to obtain the corrected vertex features and corrected vertex coordinates of the vertex; wherein, the relative positional relationship of each vertex in the hand mesh is defined in the hand template.

4. The method according to claim 3, wherein, The step of using the hand template, the point coordinates and point features of each of the multiple virtual point cloud points to correct the initial vertex features and initial vertex coordinates of the vertex to obtain the corrected vertex features and corrected vertex coordinates includes: The current vertex features and current vertex coordinates of the vertex are updated at least once. Each iteration update includes the following steps: Based on the hand template and the current vertex features of the vertex, a first reference vertex feature of the vertex is determined; wherein, the initial value of the current vertex feature is the initial vertex feature; Based on the current vertex coordinates of the vertex and the respective point coordinates of the plurality of virtual point cloud points, a preset number of neighboring points adjacent to the vertex are determined among the plurality of virtual point cloud points; wherein, the initial value of the current vertex coordinates is the initial vertex coordinates; Using the point features, point coordinates, current vertex coordinates of the vertex, and the first reference vertex feature of each of the preset number of neighboring points, a second reference vertex feature is obtained, and the current vertex feature of the vertex is updated using the second reference vertex feature; Based on the current vertex coordinates and current vertex features of the vertex, the reference vertex coordinates are obtained, and the current vertex coordinates of the vertex are updated using the reference vertex coordinates.

5. The method according to claim 1, wherein, The step of determining the point coordinates and point features of each point in the sub-virtual point cloud corresponding to the hand image based on the position encoding and the reconstructed image features includes: For any of the aforementioned location indices: The position coordinates at the position index in the position encoding and the feature value at the position index in the reconstructed image features are fused to obtain the position embedding feature value at the position index. The location coordinates are the point coordinates of the virtual point cloud point corresponding to the location index, and the location embedding feature value is the point feature of the virtual point cloud point corresponding to the location index.

6. The method according to claim 1, wherein, In response to the inconsistency between the size of the reshaped image features and the size of the position encoding, before determining the point coordinates and point features of each point in the sub-virtual point cloud, the method further includes: performing a size alignment operation on the position encoding of the three-dimensional mesh and the reshaped image features.

7. The method according to claim 1, wherein, The process of determining virtual point cloud data based on the parameters of each of the multiple cameras and the image features of each of the multiple hand images includes: Based on the aforementioned key hand point data, the position of a preset hand part is determined in the reference coordinate system; Using the preset hand position as the center, sampling is performed within a preset range under the reference coordinate system to obtain the point coordinates of each point in the virtual point cloud; Based on the parameters of each of the multiple cameras, the point coordinates of each point in the virtual point cloud, and the image features of each of the multiple hand images, the point features of each point in the virtual point cloud are determined.

8. The method according to claim 7, wherein, The step of determining the point features of each point in the virtual point cloud based on the parameters of the multiple cameras, the point coordinates of each point in the virtual point cloud, and the image features of the multiple hand images includes: For any point in the virtual point cloud: For each hand image, using the parameters of the camera that captured the hand image and the point coordinates, project the point onto the hand image to determine the projection point of the point on the image features of the hand image, and obtain the features of the point in the hand image; The point features are determined by fusing the features of the point in each of the multiple hand images.

9. The method according to claim 8, wherein, The process of fusing the features of the point in the multiple hand images to determine the point feature includes: From the plurality of hand images, the pixel coordinate system of a reference hand image is selected as the reference coordinate system; wherein, the reference hand image is any one of the plurality of hand images; Using the parameters of the multiple cameras, the features of the point in other hand images besides the reference hand image are transformed to the reference coordinate system to obtain the transformed features of the point in the other hand images respectively; The feature of the point in the reference hand image and the corresponding transformation features in the other hand images are fused together to obtain the point feature.

10. The method according to any one of claims 1-9, wherein, The steps of determining the key point data of the hand, determining the vertex data of each vertex in the hand mesh, determining the virtual point cloud data, and correcting the vertex data of each vertex in the hand mesh are all implemented through a target neural network. The method further includes: The initial neural network is trained using the training set to obtain the target neural network; The training set includes multiple training samples, each training sample including: multiple training hand images, the ground truth value of the three-dimensional position of each sample vertex in the sample hand grid corresponding to the multiple training hand images, and the ground truth value of the two-dimensional position of each sample vertex projected in each training hand image. The training loss of the initial neural network is: the three-dimensional loss value and the two-dimensional loss value of each sample vertex; wherein, the three-dimensional loss value is the deviation between the three-dimensional position estimate of each sample vertex output by the initial neural network after inputting a training sample and the true three-dimensional position value in the training sample; the two-dimensional loss value is the deviation between the two-dimensional position estimate of each sample vertex output by the initial neural network after inputting a training sample and the true two-dimensional position value in the training sample.

11. An apparatus for determining hand shape, comprising: The first determining module is used to determine key hand point data by utilizing the image features of multiple hand images captured at the same time; wherein, the multiple hand images are obtained by multiple cameras set at different angles capturing the target hand; the key hand points include at least one point used to characterize the location of the fingertips, knuckles, wrist, and palm. The second determining module is used to determine the vertex data in the hand mesh based on the hand key point data; wherein, the hand mesh includes a mesh structure distributed on the surface of the hand, and the mesh structure includes multiple vertices; The third determining module is used to determine virtual point cloud data based on the parameters of each of the multiple cameras and the image features of each of the multiple hand images; The fourth determining module is used to use the virtual point cloud data to correct the vertex data in the hand mesh in order to determine the hand shape of the target hand; The third determining module includes a division submodule, a fifth determining submodule, an assignment submodule, a reshaping submodule, a sixth determining submodule, and a merging submodule; For each hand image: The segmentation submodule is used to divide the space faced by the camera capturing the hand image into a three-dimensional grid; wherein, the three-dimensional grid includes multiple grid regions, and each grid region has a corresponding position index; The fifth determining submodule is used to determine the position coordinates of the multiple grid regions in the reference coordinate system based on the parameters of the camera that captured the hand image and the position index of each of the multiple grid regions in the three-dimensional grid. The assignment submodule is used to assign the position coordinates of each of the multiple grid regions as feature values ​​to the corresponding grid regions to obtain the position code of the three-dimensional grid. The reshaping submodule is used to reshape the image features of the hand image to obtain reshaped image features, so that the reshaped image features are consistent with the dimension of the position encoding; The sixth determining submodule is used to determine the point coordinates and point features of each point in the sub-virtual point cloud corresponding to the hand image based on the position encoding and the reconstructed image features; The merging submodule is used to merge the sub-virtual point clouds corresponding to each hand image to obtain the virtual point cloud data.

12. An electronic device, comprising: Memory, used to store computer programs; A processor is configured to execute a computer program stored in the memory, wherein, when the computer program is executed, it implements the method for determining hand shape as described in any one of claims 1 to 10.

13. A computer-readable storage medium having stored thereon computer program instructions, which, when executed by a processor, implement the method for determining hand shape according to any one of claims 1 to 10.

14. A computer program product comprising computer program instructions, which, when executed by a processor, implement the method for determining hand shape according to any one of claims 1 to 10.