A pose estimation method, related device and storage medium

By combining a pose estimation method with a Transformer encoder, graph convolution module, and regression module, and utilizing global features, local static features, and local dynamic features, the problem of low accuracy in estimating 3D human pose from 2D images is solved, achieving higher pose estimation accuracy and generalization.

CN115273228BActive Publication Date: 2026-06-30PENG CHENG LAB

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
PENG CHENG LAB
Filing Date
2022-07-15
Publication Date
2026-06-30

Smart Images

  • Figure CN115273228B_ABST
    Figure CN115273228B_ABST
Patent Text Reader

Abstract

This invention discloses a pose estimation method, related apparatus, and storage medium. The method includes: a Transformer encoder learning global features between keypoints of a target object; a first graph convolution module determining local static features of the target object based on a first adjacency matrix and the global features; the first adjacency matrix being determined based on the physical connections between keypoints of the target object; a second graph convolution module determining local dynamic features based on a second adjacency matrix and the local static features; the second adjacency matrix being determined based on a nearest neighbor algorithm and the sparse dynamic connections between keypoints of the target object; and a regression module determining the three-dimensional coordinates of the keypoints of the target object based on the local dynamic features. The technical solution provided by this invention improves the accuracy of pose estimation of a target object based on a two-dimensional image and has good generalization ability.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision technology, and in particular to a pose estimation method, related apparatus, and storage medium. Background Technology

[0002] Pose estimation is a branch of computer vision that aims to estimate the coordinates of key points of a target object in three-dimensional space from a two-dimensional image or video. Common pose estimation methods include human pose estimation, half-body pose estimation, hand pose estimation, and animal pose estimation. Human pose estimation has broad application prospects in fields such as surveillance, motion analysis, film and animation modeling, virtual reality, and medicine, and its accuracy has a significant impact on the effectiveness of downstream tasks.

[0003] In different motion scenarios, the human body can present various postures in three-dimensional space. Due to factors such as arbitrary depth and occlusion, estimating three-dimensional human posture from two-dimensional images suffers from low accuracy. Arbitrary depth means that multiple different three-dimensional human postures may project into the same two-dimensional human posture, because it is difficult to infer the spatial location of key points far from the body's center of gravity from a single-view, monocular two-dimensional image. Occlusion refers to the fact that the human body is composed of articulated joints, which inevitably lead to self-occlusion by the joints. Furthermore, depending on the environment, surrounding objects may also obscure some joint points.

[0004] Therefore, how to improve the accuracy of pose estimation of target objects based on two-dimensional images is a problem that needs to be solved by existing technologies. Summary of the Invention

[0005] This invention discloses a pose estimation method, related apparatus, and storage medium, which can improve the accuracy of pose estimation of target objects based on two-dimensional images.

[0006] In a first aspect, embodiments of the present invention provide a pose estimation method for a pose estimation model to perform three-dimensional pose estimation on a target object in a target image. The target image is a two-dimensional image of the target object. The pose estimation model includes a Transformer encoder, a first graph convolution module, a second graph convolution module, and a regression module. The method includes: the Transformer encoder learning global features of key points of the target object from the target object; the first graph convolution module determining local static features of the target object based on a first adjacency matrix and the global features; the first adjacency matrix being determined based on the physical connection relationships between key points of the target object; the second graph convolution module determining local dynamic features based on a second adjacency matrix and the local static features; the second adjacency matrix being determined based on a nearest neighbor algorithm and the sparse dynamic connection relationships between all key points of the target object; and the regression module determining the three-dimensional coordinates of the key points of the target object based on the local dynamic features.

[0007] When using the attitude estimation method provided in this embodiment of the invention for attitude estimation, the three-dimensional coordinates of the key points of the target object are determined based on the global features of the key points of the target object, the local static features corresponding to the physical connections, and the local dynamic features corresponding to the sparse dynamic relationships. Compared with the prior art, the accuracy of attitude estimation using the technical solution provided in this embodiment of the invention is significantly improved, and it also has good generalization ability.

[0008] In conjunction with the first aspect, in some possible implementations, the Transformer encoder learns global features between key points of the target object, including: the Transformer encoder maps the coordinates of the key points of the target object to the latent space through a linear transformation, while maintaining the spatial information of the key points using learnable spatial location encoding, and then integrates the information of all the key points through a multi-head self-attention layer (MSA) and a feedforward network (FFN) to obtain the global features of all key points of the target object, and the calculation formula includes:

[0009] X′ (l) =X (l-1 )+MSA(LN(X (l-1) )),

[0010] X (l) =X′ (l) +FFN(LN(X′ (l) )),

[0011] Where LN(·) denotes layer normalization, l∈[1,...,L] denotes the layer index, and LN(x (l-1) ) and LN(X′(l) X′ is the feature vector after layer normalization. (l) X is the latent feature vector of layer l. (l) It is the feature vector output of layer l, representing the global features of layer l.

[0012] In conjunction with the first aspect, in some possible implementations, the first adjacency matrix is ​​A1, A1∈R J×J J is the total number of keypoints included in the target image.

[0013]

[0014] In conjunction with the first aspect, in some possible implementations, the first graph convolution module determines the local static features of the target object based on the first adjacency matrix and the global features, including: the first graph convolution module performs graph convolution operations using Chebyshev polynomials as the convolution kernel based on the first adjacency matrix and the global features.

[0015] The Chebyshev polynomials include:

[0016]

[0017]

[0018]

[0019] Using the Chebyshev polynomial as the convolution kernel in the graph convolution operation, local static features are obtained through the graph convolution operation.

[0020]

[0021]

[0022] in, Representing a Chebyshev polynomial of degree m, the normalized Laplace matrix yes The degree matrix, λ max Θ is the largest eigenvalue of the Laplace matrix L, I is the identity matrix, and Θ m This represents the learnable parameters.

[0023] In conjunction with the first aspect, in some possible implementations, the second adjacency matrix is ​​A2, A2∈R J×J

[0024]

[0025] The Ω iΩ is the set of the K closest key points to key point xi in the feature space. i =KNN(x i x j ,k),j∈[1,...,J]; the KNN is the K-nearest neighbor algorithm; key point x j In feature space, with the key point x i The distance is R(x) i x j ), R(x i x j ) = Dist(x i x j ), the Dist(x) i x j ) is the key point x i x j The Euclidean distance between them.

[0026] In conjunction with the first aspect, in some possible implementations, the second graph convolution module determines local dynamic features based on the second adjacency matrix and the local static features, including: the second graph convolution module performs graph convolution operations using the Chebyshev polynomial as the convolution kernel to obtain the local dynamic features based on the second adjacency matrix and the local static features.

[0027]

[0028]

[0029] in, Calculated from A2. yes The degree matrix.

[0030] In conjunction with the first aspect, in some possible implementations, the method may further include: constraining the pose estimation model using a loss function, Loss.

[0031]

[0032] Wherein, the Y i,j and Let N and J represent the true and estimated values ​​of the three-dimensional coordinates of the j-th key point of sample i, respectively. N is the number of samples, and J is the total number of key points included in the target object.

[0033] Secondly, embodiments of the present invention provide a pose estimation device for performing three-dimensional pose estimation on a target object in a target image, wherein the target image is a two-dimensional image obtained by capturing the target object. The device includes: a Transformer encoder module for learning global features between key points of the target object; a first graph convolution module for determining local static features of the target object based on a first adjacency matrix and the global features; the first adjacency matrix is ​​determined based on the physical connection relationships between key points of the target object; a second graph convolution module for determining local dynamic features based on a second adjacency matrix and the local static features; the second adjacency matrix is ​​determined based on the sparse dynamic connection relationships between key points determined by the nearest neighbor algorithm and the action of the target object; and a regression module for determining the three-dimensional coordinates of the key points of the target object based on the local dynamic features.

[0034] When using the attitude estimation device provided in the embodiments of the present invention for attitude estimation, the three-dimensional coordinates of the key points of the target object are determined based on the global features of the key points of the target object, the local static features corresponding to the physical connections, and the local dynamic features corresponding to the sparse dynamic relationships. Compared with the prior art, the accuracy of attitude estimation is significantly improved when using the technical solution provided in the embodiments of the present invention, and it also has good generalization ability.

[0035] In conjunction with the second aspect, in some possible implementations, the Transformer encoder module is specifically used to: map the coordinates of all key points of the target object to the latent space through a linear transformation, while maintaining the spatial information of the key points using learnable spatial location encoding, and then integrate the information of all the key points through a multi-head self-attention layer (MSA) and a feedforward network (FFN) to obtain the global features of the key points of the target object. The calculation formula includes:

[0036] X′ (l) =X (l-1) +MSA(LN(X (l-1) )),

[0037] X (l) =X′ (l) +FFN(LN(X′ (l) )),

[0038] Where LN(·) denotes layer normalization, l∈[1,...,L] denotes the layer index, and LN(X (l-1) ) and LN(X (l) ) is the feature vector after layer normalization, x′ (l) X is the latent feature vector of layer l. (l) It is the feature vector output of layer l, x (l)This represents the global features of layer l.

[0039] In conjunction with the second aspect, in some possible implementations, the first adjacency matrix is ​​A1, A1∈R J×J J is the total number of keypoints included in the target image.

[0040]

[0041] In conjunction with the second aspect, in some possible implementations, the first graph convolution module is specifically used to perform graph convolution operations using a Chebyshev polynomial as the convolution kernel based on the first adjacency matrix and the global features, wherein the Chebyshev polynomial includes:

[0042]

[0043]

[0044]

[0045] Using the Chebyshev polynomial as the convolution kernel in the graph convolution operation, local static features are obtained through the graph convolution operation.

[0046]

[0047]

[0048] in, Representing a Chebyshev polynomial of degree m, the normalized Laplace matrix yes The degree matrix, λ max Θ is the largest eigenvalue of the Laplace matrix L, I is the identity matrix, and Θ m This represents the learnable parameters.

[0049] In conjunction with the second aspect, in some possible implementations, the second adjacency matrix is ​​A2, A2∈R J×J ,

[0050]

[0051] The Ω i In the feature space, with key point x i The set of the K nearest key points, Ω i =KNN(x i x j ,k),j∈[1,...,J]; the KNN is the K-nearest neighbor algorithm; key point x j In feature space, with the key point x iThe distance is R(x) i x j ), R(x i x j ) = Dist(x i x j ), the Dist(x) i x j ) is the key point x i x j The Euclidean distance between them.

[0052] In conjunction with the second aspect, in some possible implementations, the second graph convolution module is specifically used to perform graph convolution operations using the Chebyshev polynomial as the kernel to obtain local dynamic features based on the second adjacency matrix and the local static features.

[0053]

[0054]

[0055] in, Calculated from A2. yes The degree matrix.

[0056] In conjunction with the second aspect, some possible implementations further include: a verification module for constraining the pose estimation performance using a loss function, Loss.

[0057]

[0058] Wherein, the Y i,j and Let N and J represent the true and estimated values ​​of the three-dimensional coordinates of the j-th key point of sample i, respectively. N is the number of samples, and J is the total number of key points included in the target object.

[0059] Thirdly, embodiments of the present invention also disclose a computer-readable storage medium storing one or more programs, which can be executed by one or more processors to implement the steps in the attitude estimation method described in the first aspect or any possible implementation of the first aspect.

[0060] Fourthly, embodiments of the present invention also disclose a terminal device, comprising: a processor, a memory, and a communication bus; the memory stores a computer-readable program executable by the processor; the communication bus enables communication between the processor and the memory; when the processor executes the computer-readable program, it implements the steps in the attitude estimation method described in the first aspect or any possible implementation of the first aspect. Attached Figure Description

[0061] To more clearly illustrate the technical solutions in the embodiments of the present invention, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of the present invention, and those skilled in the art can obtain other drawings based on these drawings.

[0062] Figure 1 This is a flowchart illustrating an attitude estimation method provided in an embodiment of the present invention.

[0063] Figure 2 This is a schematic diagram of a human skeleton model used in an embodiment of the present invention.

[0064] Figure 3 This is a target image used for attitude estimation in one embodiment of the present invention.

[0065] Figure 4A This is a diagram illustrating the physical connection between key points corresponding to the left wrist and left elbow.

[0066] Figure 4B This is a schematic diagram illustrating the sparse dynamic connection between key points on the left and right wrists.

[0067] Figure 5 This is an embodiment of the present invention. Figure 3 The diagram illustrates the process of pose estimation for the target image shown.

[0068] Figure 6 This is a schematic diagram of the Transformer encoder structure.

[0069] Figure 7A The first image is a schematic diagram of the convolution module structure.

[0070] Figure 7B The second figure is a schematic diagram of the convolution module structure.

[0071] Figure 8A This is a target image used for attitude estimation in one embodiment of the present invention.

[0072] Figure 8B This is a schematic diagram of attitude estimation obtained by using one of the existing technologies for attitude estimation.

[0073] Figure 8C This is a schematic diagram of attitude estimation obtained by using another approach in the existing technology for attitude estimation.

[0074] Figure 8D This is a schematic diagram of attitude estimation obtained after using the technical solution provided by the present invention for attitude estimation.

[0075] Figure 8E This is a schematic diagram of the actual attitude estimation.

[0076] Figure 9A This is a target image used for attitude estimation in one embodiment of the present invention.

[0077] Figure 9B This is a schematic diagram of attitude estimation obtained by using one of the existing technologies for attitude estimation.

[0078] Figure 9C This is a schematic diagram of attitude estimation obtained by using another approach in the existing technology for attitude estimation.

[0079] Figure 9D This is a schematic diagram of attitude estimation obtained after using the technical solution provided by the present invention for attitude estimation.

[0080] Figure 9E This is a schematic diagram of the actual attitude estimation.

[0081] Figure 10 This is a schematic diagram of the attitude estimation device provided in an embodiment of the present invention.

[0082] Figure 11 This is a schematic diagram of the structure of a terminal device provided in an embodiment of the present invention. Detailed Implementation

[0083] This invention provides an attitude estimation method, related apparatus, and storage medium. To make the objectives, technical solutions, and effects of this invention clearer and more explicit, the invention will be further described in detail below with reference to the accompanying drawings. It should be understood that the specific embodiments described herein are only for explaining the invention and are not intended to limit the invention.

[0084] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this specification means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any units and all combinations of one or more associated listed items.

[0085] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.

[0086] The pose estimation method disclosed in this invention is applicable to scenarios such as full-body human pose estimation, half-body human pose estimation, hand pose estimation, and animal pose estimation. Correspondingly, the target object for each scenario can be a person in a two-dimensional image, a half-body structure of a person, a person's hand, or an animal, etc. This invention uses full-body human pose estimation (hereinafter referred to as human pose estimation) as an example for description. The corresponding target object is a person in the image, which can be one person or multiple people. For simplicity, in the following embodiments, the image includes only one person, and this person in the image is the target object.

[0087] The human skeletal model is a tree-structured diagram with kinematic characteristics. Its simple structure allows for a visual description of human posture. Please see [link / reference]. Figure 2 , Figure 2 This is a schematic diagram of the human skeleton model used in human posture estimation according to an embodiment of the present invention. Each node in the diagram represents an important key point in the human posture. Key points usually correspond to joints or key parts of the human body with a certain degree of freedom.

[0088] like Figure 2As shown, the model uses key points to abstract key parts or joints such as the head, spine, knees, and wrists. Line segments represent bone segments, indicating physical connections between related key points. Each key point in the human skeleton model can be represented by coordinates. It should be noted that other styles of human skeleton models can also be used. For example, in some application scenarios, nodes identifying parts such as the nose and knuckles can be added as key points to the human skeleton model; these are all feasible. This invention does not limit the human skeleton model.

[0089] like Figure 2 As shown, the human skeleton includes seventeen nodes, each corresponding to an important key point in the human posture. These key points are labeled from 0 to 16, representing: hip, right hip, right knee, right foot, left hip, left knee, left foot, spine, chest, neck, head, left shoulder, left elbow, left wrist, right shoulder, right elbow, and right wrist.

[0090] The invention will be further explained below with reference to the accompanying drawings and the description of the embodiments.

[0091] Please see Figure 1 , Figure 1 This is a flowchart illustrating an attitude estimation method according to an embodiment of the present invention, as shown below. Figure 1 The attitude estimation method shown includes the following steps.

[0092] 101. The Transformer encoder learns the global features between the key points of the target object.

[0093] Specifically, the Transformer encoder maps the coordinates of all key points of the target object to the latent space through linear transformation, while maintaining the spatial information of the key points with learnable spatial location encoding. Then, the information of all key points is integrated through a multi-head self-attention layer (MSA) and feed-forward networks (FFN) to obtain the global features of all key points of the target object.

[0094] To Figure 3 Taking the pose estimation of the person in the image shown as an example, the target object is... Figure 3 The people in it, Figure 3 The person in the picture is making a gesture of raising both hands to greet someone.

[0095] It should be noted that, as Figure 5 As shown, before the Transformer encoder performs processing, it also includes processing based on... Figure 3The image shown determines the 2D pose of the target object, obtaining the 2D coordinates of each keypoint. These coordinates are then input into the keypoint embedding module, added to the output of the position encoder, and finally input into the Transformer encoder. It should be noted that the data processing before inputting into the Transformer encoder can employ common methods in existing technologies, such as using a high-resolution network (High-Resolution Net, HRNet). Figure 3 The human body pose is estimated in two dimensions. For details of the implementation process, please refer to the relevant descriptions in existing technologies, which will not be repeated here. After being processed by the keypoint embedding module, the keypoints are added to the output of the position encoder to maintain their spatial position information, which is then input to the Transformer encoder.

[0096] like Figure 6 As shown, the Transformer encoder 600 includes: a first-layer normalization module, a second-layer normalization module, a multi-head self-attention module (MSA), and a feedforward network (FFN). The MSA is used to model the relationships between multiple keypoints, and the FFN transforms the information. This embodiment uses the same activation function and structure as the classic Transformer, employing a residual structure and layer normalization (LN) operations. Its calculation process includes:

[0097] X′ (l) =X (l-1 )+MSA(LN(X (l-1) ))

[0098] X (l) =X′ (l) +FFN(LN(X′ (l) ))

[0099] Where LN(·) denotes layer normalization, l∈[1,...,L] denotes the layer index, and X (l) It is the output of layer l, where L is an integer that can be determined empirically.

[0100] 102. The first graph convolution module determines the local static features of the target object based on the first adjacency matrix and global features.

[0101] Let the first adjacency matrix be A1, where A1∈R J×J J is the total number of keypoints included in the target image. Figure 3 The skeletal structure shown is J=17.

[0102]

[0103] by Figure 4AFor example, if there is a skeletal connection between node 4-1 and node 4-2, then the two nodes are key points with a physical connection. Figure 2 For example, if keypoints numbered 12 and 13 are connected by a skeleton, then their corresponding value in A1 is A1(12, 13) = 1. Conversely, if keypoints numbered 13 and 16 are not directly connected by a skeleton, then their corresponding value in A1 is A1(13, 16) = 0.

[0104] The first graph convolution module performs graph convolution operations using Chebyshev polynomials as kernels based on the first adjacency matrix and global features. The Chebyshev polynomials include:

[0105]

[0106]

[0107]

[0108] Using the Chebyshev polynomials described above as the convolution kernel in the graph convolution operation, local static features are obtained through the graph convolution operation.

[0109]

[0110]

[0111] in, Representing a Chebyshev polynomial of degree m, the normalized Laplace matrix yes The degree matrix, λ max Θ is the largest eigenvalue of the Laplace matrix L, I is the identity matrix, and Θ m This represents the learnable parameters.

[0112] 103. The second graph convolution module determines the local dynamic features based on the second adjacency matrix and local static features.

[0113] The second adjacency matrix is ​​A2, A2∈R J×J

[0114]

[0115] Among them, Ω i Ω is the set of the K closest keypoints to keypoint xi in the feature space. i =KNN(x i x j (k), j∈[1,...,J]; KNN is the K-nearest neighbor algorithm; key point x j In feature space and key point xi The distance is R(x) i x j ), R(x i x j ) = Dist(x i x j ), Dist(x i x j ) is the key point x i x j The Euclidean distance between them. For example, if x i For (x) i y i z i ), x j For (x) j y j z j Then x i and x j The Euclidean distance between them is

[0116] The second graph convolution module performs graph convolution operations using Chebyshev polynomials as kernels based on the second adjacency matrix and local static features. The Chebyshev polynomials include:

[0117]

[0118]

[0119]

[0120] Using Chebyshev polynomials as convolution kernels to perform graph convolution operations yields local dynamic features.

[0121]

[0122]

[0123] in, Calculated from A2. yes The degree matrix.

[0124] The second convolutional module can learn sparse, dynamic K-nearest neighbor relationships between keypoints based on different poses. For example... Figure 4B As shown, although there is no direct physical connection between node 4-1 and node 4-3, in Figure 4B The two exhibit a sparse dynamic connection relationship in the postures they present.

[0125] To improve accuracy, in some possible implementations, the output of the second graph convolution module can be input back to the Transformer encoder. After undergoing the same processing as described above by the Transformer encoder, the first graph convolution module, and the second graph convolution module, the output is then input to the regression module. The specific number of repetitions can be determined empirically and is not limited here.

[0126] 104. The regression module determines the three-dimensional coordinates of all key points of the target object based on local dynamic features.

[0127] In some possible implementations, the attitude estimation method may further include: constraining the attitude estimation model using a loss function, Loss.

[0128]

[0129] Among them, Y i,j and Let Y and J represent the true and estimated 3D coordinates of the j-th keypoint of sample i, respectively. N is the number of samples, and J is the total number of keypoints in the target object. For example, if Y... i,j The coordinates are (x1, y1, z1). If the coordinates are (x1′, y1′, z1′), then It should be noted that the smaller the Loss value, the more accurate the 3D human pose estimation can be.

[0130] Using different technical solutions Figure 8A The pose estimation of the person in the middle is performed, and the estimation results are as follows: Figure 8B , Figure 8C and Figure 8D As shown, where Figure 8B The result is the pose estimation using the existing GraFormer method. Figure 8C It uses the estimation results obtained by employing only the Transformer encoder in existing technologies for pose estimation. Figure 8D The estimation result is obtained by using the technical solution provided in the embodiments of the present invention for attitude estimation. Figure 8E yes Figure 8A The image shown corresponds to the actual pose result. Observation reveals that, in the context of... Figure 8A When performing attitude estimation, there is a self-occlusion problem. By comparing the results of attitude estimation using different methods, especially based on the differences between the angles and relative positions of the line segments in the figure and the true values, it can be seen that the attitude estimation results using the technical solution provided in this embodiment of the invention are closer to the true values.

[0131] Similarly, different technical solutions are used for Figure 9A The pose estimation of the person in the middle is performed, and the estimation results are as follows: Figure 9B , Figure 9C and Figure 9D As shown, where Figure 9B The result is the pose estimation using the existing GraFormer method. Figure 9C It uses the estimation results obtained by employing only the Transformer encoder in existing technologies for pose estimation. Figure 9D The estimation result is obtained by using the technical solution provided in the embodiments of the present invention for attitude estimation. Figure 9E yes Figure 9A The image shown corresponds to the actual pose result. Observation reveals that, in the context of... Figure 9A Self-occlusion and depth blurring exist during attitude estimation. By comparing the results of attitude estimation using different methods, especially based on the differences between the angles and relative positions of the line segments in the figure and the true values, it can be seen that the attitude estimation results using the technical solution provided in this embodiment are closer to the true values.

[0132] When using the attitude estimation method provided in this embodiment of the invention for attitude estimation, the three-dimensional coordinates of the key points of the target object are determined based on the global features of the key points of the target object, the local static features corresponding to the physical connections, and the local dynamic features corresponding to the sparse dynamic relationships. Compared with the prior art, the accuracy of attitude estimation using the technical solution provided in this embodiment of the invention is significantly improved, and it also has good generalization ability.

[0133] Please see Figure 10 , Figure 10 This is a schematic diagram of a pose estimation device 1000 provided in an embodiment of the present invention. The pose estimation device 1000 is used to perform three-dimensional pose estimation on a target object in a target image. The target image is a two-dimensional image obtained by capturing the target object. The pose estimation device 1000 includes: a Transformer encoder module 1001, used to learn global features between key points of the target object from the target object; a first graph convolution module 1002, used to determine local static features of the target object based on a first adjacency matrix and global features; the first adjacency matrix is ​​determined based on the physical connection relationship between key points of the target object; a second graph convolution module 1003, used to determine local dynamic features based on a second adjacency matrix and local static features; the second adjacency matrix is ​​determined based on the sparse dynamic connection relationship between key points determined by the nearest neighbor algorithm and the action of the target object; and a regression module 1004, used to determine the three-dimensional coordinates of each key point of the target object based on the local dynamic features.

[0134] When using the attitude estimation method provided in this embodiment of the invention for attitude estimation, the three-dimensional coordinates of the key points of the target object are determined based on the global features of the key points of the target object, the local static features corresponding to the physical connections, and the local dynamic features corresponding to the sparse dynamic relationships. Compared with the prior art, the accuracy of attitude estimation using the technical solution provided in this embodiment of the invention is significantly improved, and it also has good generalization ability.

[0135] Please see Figure 6 In some possible implementations, the Transformer encoder module 600 includes: a first-layer normalization module, a second-layer normalization module, a multi-head attention module, and a feedforward network module. Specifically, the Transformer encoder module 600 is used to: map the coordinates of all key points of the target object to the latent space through linear transformation, while maintaining the spatial information of the key points using learnable spatial location encoding; then, integrate the information of all key points through a multi-head self-attention layer (MSA) and a feedforward network (FFN) to obtain the global features of all key points of the target object. The calculation formula includes:

[0136] X′ (l) =X (l-1) +MSA(LN(X (l-1) )),

[0137] X (l) =X′ (l) +FFN(LN(X (l) )),

[0138] Where LN(·) denotes layer normalization, l∈[1,...,L] denotes the layer index, and LN(X (l-1) ) and LN(X′ (l) ) is the feature vector after layer normalization, x′ (l) X is the latent feature vector of layer l. (l) X is the feature vector output of layer l. (l) This represents the global features of layer l.

[0139] In some possible implementations, the first adjacency matrix is ​​A1, A1∈R J×J J is the total number of keypoints included in the target image.

[0140]

[0141] Please see Figure 7AIn some possible implementations, the first graph convolution module 700 may include multiple graph convolution layers that are convolved with Chebyshev polynomials of order 0, 1, or 2, respectively. Specifically, it is used to perform graph convolution operations using Chebyshev polynomials as convolution kernels based on the first adjacency matrix and global features. The Chebyshev polynomials include:

[0142]

[0143]

[0144]

[0145] Using Chebyshev polynomials as the kernel in graph convolution operations, local static features are obtained through graph convolution.

[0146]

[0147]

[0148] in, Representing a Chebyshev polynomial of degree m, the normalized Laplace matrix yes The degree matrix, λ max Θ is the largest eigenvalue of the Laplace matrix L, I is the identity matrix, and Θ m This represents the learnable parameters.

[0149] In some possible implementations, the second adjacency matrix is ​​A2, A2∈R J×J ,

[0150]

[0151] Ω i In the feature space, with key point x i The set of the K nearest key points, Ω i =KNN(x i x j (k), j∈[1,...,J]; KNN is the K-nearest neighbor algorithm; key point x j In feature space and key point x i The distance is R(x) i x j ), R(x i x j ) = Dist(x i x j ), Dist(x i x j ) is the key point x i xj The Euclidean distance between them.

[0152] Please see Figure 7B In some possible implementations, the second graph convolution module 701 may include a K-nearest neighbor algorithm module and a graph convolution layer module. Specifically, the second graph convolution module 701 is used to perform graph convolution operations using Chebyshev polynomials as convolution kernels based on the second adjacency matrix and local static features to obtain local dynamic features.

[0153]

[0154]

[0155] in, Calculated from A2. yes The degree matrix.

[0156] In some possible implementations, the attitude estimation device 1000 may further include: a verification module for constraining the attitude estimation performance using a loss function Loss.

[0157]

[0158] Among them, Y i,j and Let N and J represent the true and estimated values ​​of the 3D coordinates of the j-th key point of sample i, respectively. N is the number of samples, and J is the total number of key points included in the target object.

[0159] This invention also provides a computer-readable storage medium storing one or more programs that can be executed by one or more processors to implement some or all of the steps of any attitude estimation method described in the above method embodiments. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, ROMs, RAMs, portable hard drives, magnetic disks, or optical disks.

[0160] Please see Figure 11 , Figure 11 This invention provides a terminal device 1100 according to an embodiment of the present invention. The terminal device 1100 includes: a processor 1101, a memory 1102, and a communication bus 1103. The memory 1102 stores a computer-readable program that can be executed by the processor 1101. The communication bus 1103 enables communication between the processor and the memory. When the processor 1101 executes the computer-readable program, it implements some or all of the steps of any attitude estimation method described in the above method embodiments.

[0161] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the present invention is not limited to the described order of actions, as some steps can be performed in other orders or simultaneously according to the present invention. Secondly, those skilled in the art should also understand that the embodiments described in the specification are preferred embodiments, and the actions and modules involved are not necessarily essential to the present invention. In the above embodiments, the descriptions of each embodiment have their own emphasis; for parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0162] In the several embodiments provided by this invention, it should be understood that the disclosed apparatus can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of units is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple units or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between devices or units may be electrical or other forms.

[0163] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs.

[0164] The above embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A pose estimation method, characterized in that, A pose estimation model is used to estimate the 3D pose of a target object in a target image, wherein the target image is a 2D image of the target object. The pose estimation model includes a Transformer encoder, a first image convolution module, a second image convolution module, and a regression module. The method includes: The Transformer encoder learns the global features of the key points of the target object; The first graph convolution module determines the local static features of the target object based on the first adjacency matrix and the global features; the first adjacency matrix is ​​determined based on the physical connection relationships between the key points of the target object; The second graph convolution module determines local dynamic features based on the second adjacency matrix and the local static features; the second adjacency matrix is ​​determined based on the nearest neighbor algorithm and the sparse dynamic connection relationship between the key points of the target object; The regression module determines the three-dimensional coordinates of the key points of the target object based on the local dynamic features; The second adjacency matrix is A2, A2 e R J×J , The In the feature space, with key point x i The K nearest key points The set, the J is the total number of keypoints in the target image; KNN is the K-nearest neighbor algorithm; keypoint x j In feature space, with the key point x i The distance is , The For the key point x i x j The Euclidean distance between them, where i and j are the key points x and j respectively. i x j The corresponding key points in physical space.

2. The method according to claim 1, characterized in that, The Transformer encoder learns global features of the key points of the target object, including: The Transformer encoder maps the coordinates of the key points of the target object to the latent space through linear transformation, while maintaining the spatial information of the key points with learnable spatial location encoding. Then, the information of the key points is integrated through a multi-head self-attention layer (MSA) and a feedforward network (FFN) to obtain the global features of the key points of the target object.

3. The method according to claim 2, characterized in that, The first adjacency matrix is A1, A1 e R J×J ; wherein, 。 4. The method according to claim 3, characterized in that, The first graph convolution module determines the local static features of the target object based on the first adjacency matrix and the global features, including: The first graph convolution module performs graph convolution operations using Chebyshev polynomials as kernels based on the first adjacency matrix and the global features to obtain local static variables.

5. The method according to claim 4, characterized in that, The second graph convolution module determines local dynamic features based on the second adjacency matrix and the local static features, including: The second graph convolution module performs graph convolution operations based on the second adjacency matrix and the local static features, using the Chebyshev polynomial as the convolution kernel to obtain local dynamic features.

6. An attitude estimation device, characterized in that, A device for performing three-dimensional pose estimation on a target object in a target image, wherein the target image is a two-dimensional image of the target object, and the pose estimation device includes: The Transformer encoder module is used to learn the global features between the key points of the target object. The first graph convolution module is used to determine the local static features of the target object based on the first adjacency matrix and the global features; the first adjacency matrix is ​​determined based on the physical connection relationship between the key points of the target object; The second graph convolution module is used to determine local dynamic features based on the second adjacency matrix and the local static features; the second adjacency matrix is ​​determined based on the sparse dynamic connection relationship between the key points determined by the nearest neighbor algorithm and the action of the target object; The regression module is used to determine the three-dimensional coordinates of key points of the target object based on the local dynamic features; The second adjacency matrix is ​​A2, A2∈R J×J , The In the feature space, with key point x i The K nearest key points The set, the J is the total number of keypoints in the target image; KNN is the K-nearest neighbor algorithm; keypoint x j In feature space, with the key point x i The distance is , The For the key point x i x j The Euclidean distance between them, where i and j are the key points x and j respectively. i x j The corresponding key points in physical space.

7. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores one or more programs, which can be executed by one or more processors to implement the steps in the attitude estimation method as described in any one of claims 1-5.

8. A terminal device, characterized in that, include: Processor, memory, and communication bus; the memory stores a computer-readable program that can be executed by the processor; The communication bus enables communication between the processor and the memory; When the processor executes the computer-readable program, it implements the steps in the attitude estimation method as described in any one of claims 1-5.