A 3D human pose estimation method, device and storage medium

By employing cross-view feature adaptive enhancement and spatiotemporal co-modeling, the problems of inconsistency in human motion and noise interference were solved, improving the accuracy and robustness of 3D human pose estimation and achieving more accurate 3D human pose estimation.

CN122244960APending Publication Date: 2026-06-19NANCHANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
NANCHANG UNIV
Filing Date
2026-05-25
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing 3D human pose estimation methods fail to fully consider the inconsistencies in the movement of different parts of the human body, are easily affected by noise in auxiliary views, and decouple spatiotemporal information, thus destroying the spatiotemporal entanglement structure of human movements, resulting in low estimation accuracy.

Method used

A cross-view feature adaptive enhancement and spatiotemporal co-modeling approach is adopted. The hierarchical knowledge extraction module jointly models the overall joint and part features of the human body, the feature adaptive enhancement module calculates spatial correlation, and the spatiotemporal co-encoder learns spatiotemporal entanglement features to output 3D human pose estimation results.

Benefits of technology

It effectively adapts to movement patterns with inconsistent frequencies and amplitudes in different parts of the body, filters out noise pollution, preserves the spatiotemporal entanglement structure of the movement, and improves the accuracy and robustness of 3D pose estimation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244960A_ABST
    Figure CN122244960A_ABST
Patent Text Reader

Abstract

This invention discloses a 3D human pose estimation method, device, and storage medium, belonging to the field of computer vision technology. To address the problems of existing technologies ignoring inconsistencies in human body part motion, susceptibility to noise interference from auxiliary views, and disruption of spatiotemporal entanglement structures, this invention proposes a cross-view feature adaptive enhancement and spatiotemporal collaborative modeling network. First, multi-view 2D keypoints are acquired; then, enhanced pose features are generated by jointly modeling the overall human body and part-specific features through a hierarchical knowledge extraction module; auxiliary view features are extracted and filtered using the spatial correlation of the current view as weights, achieving cross-view adaptive enhancement and fusion; subsequently, dual normalization along the channel and time dimensions is performed, and global spatiotemporal correlation is calculated by combining spatiotemporal identifier encoding to extract spatiotemporal collaborative features that retain the original motion entanglement structure; finally, the 3D coordinates are regressed and output. This invention effectively improves the estimation accuracy and robustness of 3D human pose.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and artificial intelligence, and in particular to a 3D human pose estimation method, device and storage medium based on cross-view feature adaptive enhancement and spatiotemporal co-modeling. Background Technology

[0002] 3D human pose estimation aims to locate the coordinates of human joints in 3D space from videos and images. The acquired coordinate information has wide applications in human-computer interaction, behavior recognition, and virtual and augmented reality. Compared to single-view methods, which are limited by depth blur and occlusion problems, multi-view systems effectively alleviate these limitations by integrating complementary visual data from multiple perspectives.

[0003] In the existing multi-view Figure 3 In 3D human pose estimation research, some works utilize polar geometry and triangulation to determine 3D joint positions. However, these methods heavily rely on accurate camera parameters, hindering their practical deployment in dynamically changing real-world environments. To address this, cross-view models that do not require camera parameters have been proposed. These models are typically based on graph convolutional networks and Transformer architectures, and their design improves estimation accuracy by modeling spatial and temporal dependencies between joints, consecutive frames, and multiple views.

[0004] However, the above methods still have significant drawbacks in practical applications: First, existing methods fail to fully consider the inconsistencies in the movement patterns of different parts of the human body. For example, high-frequency movements of the arm and low-frequency movements of the spine are processed uniformly, resulting in the smooth loss of minute instantaneous movements. Second, the quality of two-dimensional joints from different perspectives varies greatly due to environmental influences. Existing cross-view fusion methods blindly interact, causing the features of the current view to be interfered with and contaminated by low-quality noise information in the auxiliary view. Finally, existing methods usually decouple spatial and temporal features, such as extracting spatial features first and then temporal features, or processing them in parallel with two branches. This approach severs the natural spatiotemporal interweaving of the same entity's posture deformation and temporal evolution, destroying the original spatiotemporal entanglement structure of human movements, and causing the captured features to be distorted.

[0005] Therefore, existing technologies have problems such as failing to fully consider the inconsistency of movement of different parts of the human body, being susceptible to noise interference from auxiliary views, and decoupling spatiotemporal information, which destroys the spatiotemporal entanglement structure of human body movements, resulting in low accuracy of 3D human pose estimation. Summary of the Invention

[0006] To address the technical problems of existing technologies, such as failing to fully consider the inconsistencies in the movement of different parts of the human body, being susceptible to noise interference from auxiliary views, and decoupling spatiotemporal information and destroying the spatiotemporal entanglement structure of human movements, resulting in low accuracy of 3D human pose estimation, this invention provides a 3D human pose estimation method, device, and storage medium based on cross-view feature enhancement and spatiotemporal collaboration.

[0007] In a first aspect, the present invention provides a 3D human pose estimation method, comprising the following steps: Acquire a continuous image sequence and use a 2D pose estimator to extract the 2D keypoint sequence of each joint of the human body under multiple views; The two-dimensional key point sequence is input into the cross-view feature adaptive enhancement network. The hierarchical knowledge extraction module jointly models the overall joint features of the human body and the features of each part of the human body to generate enhanced posture features. The feature adaptive enhancement module calculates the spatial correlation within the current view as adaptive weights. The adaptive weights are used to weight and extract the auxiliary view features, and output spatially enhanced and fused multi-view features. The spatially enhanced and fused multi-view features are input into the spatiotemporal collaborative modeling network. The spatiotemporal entanglement collaborative features are learned through the spatiotemporal co-encoder, including: performing double normalization along the channel dimension and time dimension to obtain spatiotemporal normalized entanglement features, and generating spatiotemporal identifier codes in combination with the spatiotemporal identifier encoding module. The spatiotemporal identifier codes are fused with the spatiotemporal normalized entanglement features to generate a query matrix and a key matrix for calculating correlation. At the same time, the spatiotemporal normalized entanglement features without identifier encoding fusion are used as the value matrix. Then, the global spatiotemporal correlation is calculated based on the query matrix, key matrix and value matrix, and the spatiotemporal collaborative features that retain the spatiotemporal entanglement structure are output. The spatiotemporal collaborative features are input into the regression head for training, the three-dimensional coordinates of each joint in three-dimensional space are calculated, and the 3D human posture estimation results are output.

[0008] As an optional implementation of the first aspect of this application, the process of generating enhanced posture features by jointly modeling the overall joint features of the human body and the features of each human body part through a hierarchical knowledge extraction module includes: extracting posture-level spatial location information based on the two-dimensional keypoint sequence; dividing the human body into five human body part regions: left arm, right arm, spine, left leg, and right leg; constructing a part-level adjacency matrix based on the connection structure between joints and the joint point location information within each part; aggregating the position features through the part-level adjacency matrix to generate human body part graph features; constructing a posture-level adjacency matrix based on the spatial structure of the overall joints of the human body; aggregating the overall position features through the posture-level adjacency matrix to generate posture-level graph features; performing channel fusion of the human body part graph features and the posture-level graph features; concatenating the fused graph features with the posture-level spatial location information by feature dimension to obtain concatenated posture information; and performing feature mapping on the concatenated posture information through a linear transformation function to output the enhanced posture features.

[0009] As an optional implementation of the first aspect of this application, the process of using the feature adaptive enhancement module to calculate the spatial correlation within the current view as an adaptive weight includes: extracting the enhanced pose features corresponding to each view for the current view and the auxiliary view respectively; inputting the enhanced pose features corresponding to each view into a multi-head self-attention layer respectively; calculating the dot product attention between the joint features within the view in the multi-head self-attention layer, outputting the spatial features of the current view and the cross-joint spatial correlation within the current view, and simultaneously outputting the spatial features of the auxiliary view and the cross-joint spatial correlation within the auxiliary view; and using the cross-joint spatial correlation within the current view as the adaptive weight.

[0010] As an optional implementation of the first aspect of this application, the process of weighting and extracting auxiliary view features with the adaptive weights to output spatially enhanced and fused multi-view features includes: using the cross-joint spatial correlation within the current view as the query matrix and key matrix of cross-attention; using the spatial features of the auxiliary view as the value matrix of cross-attention; extracting auxiliary cross-view related features aligned with the current view from the auxiliary view through the cross-attention, and fusing the auxiliary cross-view related features with the spatial features of the current view to obtain spatially enhanced features of the current view; using the cross-joint spatial correlation within the auxiliary view as the query matrix and key matrix of cross-attention; using the spatial features of the current view as the value matrix of cross-attention; extracting current cross-view related features aligned with the auxiliary view from the current view through the cross-attention, and fusing the current cross-view related features with the spatial features of the auxiliary view to obtain spatially enhanced features of the auxiliary view; and outputting the spatially enhanced and fused multi-view features by introducing positional encoding and using convolutional layers to perform cross-channel adaptive fusion of the spatially enhanced features of the current view and the auxiliary view.

[0011] As an optional implementation of the first aspect of this application, the process of obtaining spatiotemporally normalized entangled features by performing dual normalization along the channel dimension and the time dimension includes: establishing a residual connection between the spatially enhanced cross-view fusion features and the spatial features of the current view and the auxiliary view to obtain a first initial enhanced view feature and a second initial enhanced view feature; performing instance normalization along the channel dimension of the first initial enhanced view feature and the second initial enhanced view feature to obtain spatially normalized features, and performing layer normalization along the time dimension of the features to obtain time-normalized features; and using learnable weight parameters to perform weighted merging of the spatially normalized features and the time-normalized features to output spatiotemporally normalized entangled features.

[0012] As an optional implementation of the first aspect of this application, the process of generating a spatiotemporal identifier code in conjunction with a spatiotemporal identifier encoding module, and fusing the spatiotemporal identifier code with the spatiotemporally normalized entangled features to generate a query matrix and a key matrix for calculating relevance includes: initializing a learnable channel embedding matrix as a spatial identifier, and initializing a learnable temporal embedding matrix as a temporal identifier; performing matrix multiplication on the channel embedding matrix and the temporal embedding matrix to generate the spatiotemporal identifier code; concatenating the spatiotemporal identifier code with the spatiotemporally normalized entangled features along the feature dimension; performing nonlinear mapping processing on the concatenated features using a multilayer perceptron to output spatiotemporal fusion features; and projecting the spatiotemporal fusion features into the query matrix and the key matrix using a linear layer.

[0013] As an optional implementation of the first aspect of this application, the process of using the spatiotemporally normalized entangled features that have not been identified and encoded as a value matrix, and then calculating the global spatiotemporal correlation based on the query matrix, key matrix, and value matrix to output spatiotemporal collaborative features that retain the spatiotemporal entanglement structure, includes: projecting the spatiotemporally fused features that have not been identified and encoded as the value matrix using a linear layer; calculating the dot product attention weight of the query matrix and the key matrix, and applying the dot product attention weight to the value matrix to capture the correlation between channel features and spatial context at a specific time step and the correlation between channel features and temporal context at a specific spatial location; and performing cross-view and cross-channel fusion of the attention output results through a feedforward neural network to output the spatiotemporal collaborative features.

[0014] As an optional implementation of the first aspect of this application, in the step of inputting the spatiotemporal collaborative features into the regression head for training, the overall regression loss function for training is: ,in This represents the predicted 3D coordinates of the j-th joint in the i-th frame regressed using the regression head. T represents the actual 3D coordinates, T represents the number of video frames, and J represents the number of joints.

[0015] In a second aspect, embodiments of this application provide an electronic device, which includes a processor, a memory, and a program or instructions stored in the memory and executable on the processor. When the program or instructions are executed by the processor, they implement the steps of the method described in the first aspect.

[0016] Thirdly, embodiments of this application provide a readable storage medium on which a program or instructions are stored, which, when executed by a processor, implement the steps of the method described in the first aspect.

[0017] Compared with the prior art, the beneficial effects of the present invention are as follows: (1) This invention proposes a hierarchical knowledge extraction module. By jointly modeling the features of five human body parts and the overall posture level features, it not only utilizes the prior knowledge of the global human body structure to avoid predictions that violate anatomy, but also more effectively adapts to the movement patterns of different parts of the body with inconsistent frequencies and amplitudes, and preserves subtle and instantaneous local action information.

[0018] (2) The present invention designs a feature adaptive enhancement module, which uses the spatial feature correlation of the current view itself as a priori query to dynamically guide the extraction of complementary features from the auxiliary view. This on-demand fusion mechanism can actively filter out noise pollution caused by viewpoint occlusion or depth blur in the auxiliary view, ensuring the efficiency and reliability of cross-view information fusion.

[0019] (3) This invention innovatively proposes a spatiotemporal collaborative modeling network, breaking the limitations of spatiotemporal dimension separation and decoupling in traditional algorithms. Through the design of double normalization operation and spatiotemporal identifier encoding, the multi-hop dependency of time and space is explicitly modeled in a unified feature space. Furthermore, the design of generating value matrix using pure original features maximizes the integrity of the spatiotemporal entanglement structure of actions, and greatly improves the accuracy of 3D pose estimation in complex dynamic scenes. Attached Figure Description

[0020] Figure 1 A schematic diagram of a cross-view feature adaptive enhancement and spatiotemporal collaborative modeling network architecture (CFAESCM-Net) provided for an embodiment of the present invention; Figure 2 This is a flowchart of a 3D human pose estimation method based on a cross-view feature adaptive enhancement and spatiotemporal collaborative modeling network architecture (CFAESCM-Net) according to an embodiment of the present invention; Figure 3 A schematic diagram of the spatial structure and overall posture connection structure of a human body part provided in an embodiment of the present invention; Figure 4 This is a schematic diagram of the internal structure of the feature adaptive enhancement module provided in an embodiment of the present invention; Figure 5 This is a visualization analysis diagram of the spatiotemporal entanglement structure characteristics of the action sequence sample provided in the embodiments of the present invention. Detailed Implementation

[0021] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0022] The terms "first," "second," etc., used in the specification and claims of this application are used to distinguish similar objects and not to describe a specific order or sequence. It should be understood that such data can be interchanged where appropriate so that embodiments of this application can be implemented in orders other than those illustrated or described herein. Furthermore, in the specification and claims, "and / or" indicates at least one of the connected objects, and the character " / " generally indicates that the preceding and following objects are in an "or" relationship. In the description of this invention, "a plurality of" means two or more, unless otherwise explicitly specified.

[0023] Example like Figure 1As shown, this invention proposes a cross-view feature adaptive enhancement and spatiotemporal collaborative modeling network (CFAESCM-Net).

[0024] On one hand, this invention employs a Cross-View Feature Adaptive Enhancement Network (CFAE-Net) to address the inconsistencies in motion across different parts of the human body, while simultaneously achieving effective enhancement and interaction of cross-view features. Specifically, the Hierarchical Knowledge Extraction (HKE) module integrates a global awareness module to capture the connections between joints in the overall pose, and combines this with a part awareness module to extract part-level information from five human body regions: left arm, right arm, spine, left leg, and right leg. Subsequently, the information from the global to the part level is enhanced into the original 2D pose to generate richer semantic information. Furthermore, the Feature Adaptive Enhancement (FAE) module utilizes a self-attention mechanism to extract spatial features within each view, and then uses a cross-attention mechanism to generate a weighted scheme for auxiliary view features based on the spatial relevance of the current view, thereby extracting effective target information from the auxiliary views. This target information is then injected into the spatial features of the current view to obtain spatially enhanced view features. Finally, the cross-view fusion module achieves the fusion of the spatially enhanced view features.

[0025] On the other hand, this invention proposes a Spatiotemporal Co-modeling Network (SCM-Net), which employs a spatiotemporal co-encoder to enhance the spatiotemporal entanglement structure of actions, thereby capturing strong spatiotemporal dependencies. First, double normalization (DN) is performed along both the time and channel dimensions to stabilize and standardize the input data distribution. Then, adaptive (learnable) weights are used to fuse the results of the two normalizations to obtain spatiotemporal entangled features. A Spatiotemporal Identifier Encoding Module (STIE) is designed to generate spatiotemporal information descriptors, which, when concatenated with the spatiotemporal entangled features, enhance the model's ability to distinguish between temporal and spatial locations. The concatenated features are then processed through a linear layer to generate spatiotemporal entanglement key and query matrices to compute spatiotemporal correlation (attention). Notably, the value matrix required for the attention mechanism is obtained by linearly projecting the original features (without STIE processing), ensuring that the final output features more purely preserve the spatiotemporal structural semantic knowledge of the actions.

[0026] In this embodiment, considering that increasing the number of cameras would lead to increased computational demands and reduced flexibility in field applications, the present invention uses only two cameras during the training and inference phases.

[0027] like Figure 2 As shown, based on the aforementioned cross-view feature adaptive enhancement and spatiotemporal collaborative modeling network (CFAESCM-Net), this invention proposes a 3D human pose estimation method, which specifically includes the following implementation steps: S1: Obtain a continuous image sequence and use a 2D pose estimator to extract the 2D keypoint sequence of each joint of the human body under multiple views.

[0028] For including One view, A continuous sequence of images in frames First, an offline 2D pose estimator is used to obtain the 2D keypoint sequence for each view. This invention proposes a 2D-to-3D lifting network for use in... Regression 3D Human Pose ,in This indicates the number of joints per frame. Figure 1 The overall architecture of the proposed Cross-View Feature Adaptive Enhancement and Spatiotemporal Co-modeling Network (CFAESCM-Net) is presented. This method comprises two sub-networks: the Cross-View Feature Adaptive Enhancement Network (CFAE-Net) and the Spatiotemporal Co-modeling Network (SCM-Net). CFAE-Net employs a hierarchical knowledge extraction module to capture the overall connection structure of joints and the joint distribution of various body parts. It then utilizes a feature adaptive enhancement module, constrained by the spatial relationships of features in the current view, to extract and fuse features from other views to effectively enhance the features of the current view. Finally, a cross-view fusion module fuses the features of all enhanced views. SCM-Net employs a spatiotemporal co-encoder to establish spatiotemporal connections of joints in a unified feature space, thereby enhancing the spatiotemporal entanglement structure of human motion to capture stronger spatiotemporal dependencies.

[0029] S2: Input the two-dimensional key point sequence into the cross-view feature adaptive enhancement network, and jointly model the overall joint features of the human body and the features of each part of the human body through the hierarchical knowledge extraction module to generate enhanced posture features. Then, use the feature adaptive enhancement module to calculate the spatial correlation within the current view as adaptive weights, and use the adaptive weights to weight and extract the auxiliary view features to output spatially enhanced and fused multi-view features.

[0030] 1. Layered knowledge extraction We observed that different body parts follow their own inherent movement patterns—for example, the right arm typically exhibits a higher frequency and amplitude of movement than the spine. We term this phenomenon as partial hierarchical motion inconsistency. Perceiving the human body as a whole may lead to the loss or inaccurate estimation of instantaneous rapid movements. Furthermore, from 2D keypoint sequences, we can only obtain the spatial positions of joints on the image plane, but cannot explicitly capture the hierarchical connection structure of pose between joints. To address these limitations, this invention proposes a hierarchical knowledge extraction module to jointly model the graph features of the joints as a whole (the entire pose) and the graph features of each body part. The specific construction method of this information is described below.

[0031] Attitude-level spatial position: Spatial position information records the specific position of the joint in the two-dimensional coordinate system in each view, which can be obtained from the two-dimensional key point sequence, and its definition is shown in formula (1).

[0032] Where X represents the 2D keypoint sequence. Representing 2D key points, This represents the x-coordinate of a 2D keypoint in 2D space. Represents the ordinate of a 2D keypoint in 2D space; Characteristics of different parts of the human body: The inconsistency in the movement of different parts of the human body guides us to divide the human body into five parts (see...). Figure 3 The human body is divided into four parts: left arm (green), right arm (blue), spine (yellow), left leg (purple), and right leg (red). Based on the connectivity structure between joints within each part and the known joint location information, we constructed spatial map features for each part to describe the spatial characteristics of the joints within each part. To highlight the connectivity structure between joints within a part, we defined a part-level adjacency matrix for each part. If the joint With joints If they belong to the same area, then ;otherwise . Figure 3 The right side shows the joint connection structures of five parts. The characteristics of these human body parts are shown in Equation (2).

[0033] in, yes A symmetric matrix. yes For the angle matrix, For learnable weight matrix, It represents a collection of body parts.

[0034] Pose-level graph features: 2D keypoint sequences describe discrete joint position information but do not include the spatial structure of the overall pose. As an important prior to human structure, this spatial structure is significant in two aspects: (1) it enhances the spatial structure constraints of the human body, avoiding prediction results that violate the laws of human anatomy; (2) the spatial structure graph implicitly encodes the kinetic chain rules, which can effectively enhance feature transfer between joints and alleviate occlusion problems. To this end, we propose to construct pose-level graph features to describe the spatial structure of human joints. The pose-level adjacency matrix is ​​denoted as ,like Figure 3 The left half is shown. If the joint... and joints If there are connections between them, then the matrix ;otherwise The attitude-level map features can be defined by formula (3).

[0035] in, yes A symmetric matrix. This is a learnable weight matrix.

[0036] To fully utilize the features of each body part and the posture-level map, we first perform a weighted fusion of the body part map features and the posture-level map features, and then combine the results with spatial location information. splicing to enrich the original two-dimensional pose The information is used to obtain the stitching attitude information. Subsequently, this stitching attitude information is processed by a linear transformation function. The process is performed to obtain enhanced pose features. This representation will serve as the input for the subsequent spatiotemporal feature extraction module. The specific operation is shown in formula (4).

[0037] in, Represents a collection of views. It is a concatenation function.

[0038] 2. Feature Adaptive Enhancement In multi-view camera systems, significant inconsistencies often exist between different viewpoints due to camera distance and the working environment, making it difficult to guarantee high-quality information from all views. Therefore, directly fusing information from different viewpoints may impair feature quality and expressive power. To address this issue, this invention proposes a feature adaptive enhancement module. This module dynamically adjusts auxiliary view features extracted from other views based on the spatial correlation within the current view, thereby mitigating interference caused by excessive viewpoint differences and effectively enhancing the features of the current view. Finally, by fusing the spatially enhanced features of all views, efficient utilization of complementary information in the multi-view system is achieved.

[0039] The structure of the feature adaptive enhancement module is as follows: Figure 4 As shown. This module first employs a multi-head self-attention mechanism to extract enhanced pose features corresponding to each view. Given a view and In the multi-head self-attention layer, dot product attention is calculated between joint features within the view to extract spatial features of each view. , It represents the feature dimension and outputs the cross-joint spatial correlation of each layer. The specific definition is shown in formula (5).

[0040] , , in, It is the layer index of the spatial Transformer. Represents a linear transformation. This indicates a multi-head self-attention mechanism. These represent the query matrix, key matrix, and value matrix, respectively. After a spatial Transformer, the feature dimensions are transformed to output the current view. Spatial features and auxiliary views Spatial features .

[0041] Subsequently, a cross-attention mechanism is used to extract valuable auxiliary view features from the supplementary view. The cross-view Transformer maintains the same number of layers as the spatial Transformer. This is for enhancing the view. Features, in view Spatial relevance is used as an adaptive weight to ensure that from the view Extracted features can effectively enhance the view. Its characteristics.

[0042] Specifically, we will obtain the cross-joint spatial correlation from the corresponding layer of the spatial Transformer. The query (Q) and key (K) are obtained after linear transformation, and the view is calculated using scaling dot product attention. Dependencies between features. The generated dependencies are used as attention weights, and then applied to the view via a dot product operation. Multiply the extracted value matrix (V) to achieve the desired result. Information is extracted adaptively according to demand, as defined in formula (6).

[0043] , , In particular, in cross-attention mechanisms In the first layer, input By linear projection Give, that is The final output of the cross-view Transformer assists in cross-view related features. To enhance the view Spatial characteristics, we will start from Extracted information Its spatial characteristics The fusion is performed as defined in formula (7).

[0044] The above process utilizes Information to enhance The characteristics of. Complementarily, in order to utilize Information to enhance Based on the characteristics, we use the method shown in formula (6) to obtain... Extract Required information Then, as in formula (7), and Spatial features Perform fusion to obtain the view Spatial enhancement features The process is as follows: , , Subsequently, the enhanced features of the two views are fused through a cross-view fusion module to further leverage their homogeneity and complementary information. Specifically, we first encode the location... The features are added to two view features, and then a convolutional network is used to achieve adaptive fusion across channels, ultimately resulting in spatially enhanced and fused multi-view features. The above process is described by formula (9).

[0045] in Indicates intermediate features. It is a batch normalization layer. S3: Input the spatially enhanced and fused multi-view features into the spatiotemporal collaborative modeling network, perform double normalization along the channel dimension and time dimension through the spatiotemporal co-encoder, and generate spatiotemporal identifier codes in combination with the spatiotemporal identifier encoding module. Calculate the global spatiotemporal correlation across time steps and across channels based on the spatiotemporal identifier codes, and output spatiotemporal collaborative features that preserve the spatiotemporal entanglement structure.

[0046] 1. Spacetime Entanglement Structure Figure 5 The first two rows show the motion sequences sampled from the Human3.6M dataset. The lower left figure depicts the changes in the X and Y coordinates of joints #15 and #16 over time in motion sequence (a). The lower right figure shows the changes in the X and Y coordinates of joints #1 and #2 over time in motion sequence (b). The vertical dashed lines in the figures represent the distance between the X and Y coordinates of the two joints. Analysis of this figure reveals...

[0047] First, such as Figure 5 As shown in (a) and (b), the motion trajectories of joints #16 and #2 exhibit high pose similarity in frames 4 to 10, demonstrating significant temporal correlation and reflecting the temporal dependency pattern in the action sequence.

[0048] Secondly Figure 5 Images (c) and (d) plot the trajectories of the X and Y coordinates of joints #15 and #1 (corresponding to the first-order connecting joints #16 and #2) over time. It can be observed that when joint #16 moves to a specific position, joint #15 inevitably undergoes a coordinated displacement. For example, when joint #16 crosses the midline of the body (spine), the X coordinate of joint #15 shifts to the left and decreases in value; simultaneously, the Y-axis distance between joints #1 and #2 gradually shortens during the movement, further confirming the kinematic coupling relationship between local joints.

[0049] Furthermore, the analysis based on kinematic constraints shows that if the position of the first-order connecting joint #15 is fixed, then joint #16 can only move on a sphere with #15 as the center and the bone length as the radius; if its second-order connecting joint #14 is also constrained, due to the limitation of the joint rotation angle, the reachable area of ​​joint #16 will be further reduced to a specific local range on the sphere.

[0050] These findings confirm that the instantaneous position of any joint is simultaneously influenced by its temporal context (historical and future frames) and spatial context (geometric and kinematic constraints of the connection structure). Furthermore, due to the transmissibility of information along the spatiotemporal channel, joints within the same body part can influence non-corresponding joints across frames, and this influence dynamically strengthens or weakens with changes in the nth-order connection relationship. Motion is essentially a unified process of the dynamic evolution of the same entity in the temporal dimension and the structural deformation in the spatial dimension. Therefore, the aforementioned multiple spatiotemporal influences coexist and intertwine synchronously. We define this phenomenon, where spatiotemporal factors are inseparable and jointly determine the representation of motion, as a spatiotemporal entanglement structure.

[0051] 2. Spatiotemporal collaborative modeling network Modeling spatiotemporal entanglement structures by separating and sequentially assigning spatial modules followed by temporal modules disrupts the inherent spatiotemporal entanglement structure, leading to distorted learned features or only approximating the original data structure. Therefore, this invention proposes a spatiotemporal collaborative modeling network, such as... Figure 1 As shown in the lower half, the network establishes stronger spatiotemporal dependencies by enhancing spatiotemporal entanglement.

[0052] In the cross-view feature adaptive enhancement network, the initial fusion of cross-view features has already been completed in the spatial dimension. To further integrate cross-view information along the temporal dimension, we enhance the cross-view fusion features spatially. With View , Spatial features , Residual connections are performed between them to improve the view. The richness and completeness of features yield new results. The features are respectively and Building upon this foundation, we proceed with the extraction of spatiotemporal entanglement features within views and the fusion of features between views.

[0053] Thus obtain the view First initial enhanced view features With View Second initial enhanced view feature Building upon this foundation, we proceed with the extraction of spatiotemporal entanglement features within views and the fusion of features between views.

[0054] The spatiotemporal collaborative modeling network employs a spatiotemporal collaborative encoder, aiming to extract strongly coupled spatiotemporal collaborative features.

[0055] The encoder first performs double normalization (DN) along both the channel and time dimensions to stabilize the input data distribution and enhance its regularity. Subsequently, it uses a set of learnable weight parameters... The normalization results from the two directions are adaptively fused to construct an enhanced feature with a closely related spatiotemporal entanglement structure.

[0056] The entire process is mathematically described by formula (9).

[0057] in, The instance normalization is designed to implement the channel dimension. Representation layer normalization. Features and Features After merging, the spatiotemporally normalized entanglement characteristics are obtained. .

[0058] Secondly, to fully utilize spatiotemporal correlations, the encoder must clearly distinguish the spatial and temporal attributes of features. To this end, we designed a Spatiotemporal Identifier Encoding (STIE) module. This module introduces learnable spatiotemporal descriptors, enabling the model to effectively distinguish the positions of frames (temporal) and channels (spatial). The encoder transforms the capture of inter-joint correlations and inter-frame relationships into modeling of synchronous interactions across frames and channels. Finally, each feature value is updated under the regulation of the global spatiotemporal context, thereby enhancing feature discriminative power and model generalization performance. The STIE includes a channel embedding matrix. With temporal embedding matrix These represent spatial and temporal identifiers, respectively. To align with the dimensions of the spatiotemporally entangled features, we fuse these two parts using formula (10) to obtain the spatiotemporal identifier encoding. .

[0059] in and These are trainable parameters.

[0060] Subsequently, the expanded The spatiotemporally normalized entangled features F are concatenated and processed through a simple multilayer perceptron to achieve deep fusion of spatiotemporal features and labeling information, thereby obtaining spatiotemporal fusion features with context awareness and task adaptability. .based on We generate a spatiotemporally entangled query matrix (Q) and key matrix (K). The dot product operation between the query and the key captures: (1) the correlation between a specific channel feature and the spatial context at a specific time step; and (2) the correlation between a specific channel feature and the temporal context at a specific spatial location, while the value matrix (V) is still extracted from the original feature F. This design ensures that the output feature vector of the self-attention mechanism is subject to the joint constraints of global spatiotemporal context, while more purely preserving the spatiotemporal semantic structure of the original features. This process generates spatiotemporally entangled features for each view. The specific definition of the spatiotemporal entanglement encoder is given in formula (11).

[0061] , , Cross-view and cross-channel fusion is achieved through a multi-layer perceptron, generating spatiotemporal collaborative features. .

[0062] S4: Input the spatiotemporal collaborative features into the regression head for training, calculate the three-dimensional coordinates of each joint in three-dimensional space, and output the 3D human posture estimation results.

[0063] In this embodiment of the invention, only the basic Mean Squared Error (MSE) is used for model training. This loss function aims to minimize the L2 distance error between the predicted human joint and the corresponding real joint, as expressed in formula (12).

[0064] in, It is the predicted 3D coordinate of the j-th joint in the i-th frame regressed using a regression head. It is a true 3D coordinate. Optionally, embodiments of this application also provide an electronic device, including a processor, a memory, and a program or instructions stored in the memory and executable on the processor. When the program or instructions are executed by the processor, they implement the various processes of the above-described 3D human pose estimation method embodiment and achieve the same technical effect. To avoid repetition, they will not be described again here.

[0065] This application also provides a readable storage medium storing a program or instructions. When the program or instructions are executed by a processor, they implement the various processes of the above-described embodiment of a 3D human pose estimation method and achieve the same technical effect. To avoid repetition, they will not be described again here.

[0066] The processor is the processor in the electronic device described in the above embodiments. The readable storage medium includes computer-readable storage media, such as computer read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.

[0067] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. Furthermore, it should be noted that the scope of the methods and apparatuses in the embodiments of this application is not limited to performing functions in the order shown or discussed, but may also include performing functions substantially simultaneously or in the reverse order, depending on the functions involved. For example, the described methods may be performed in a different order than described, and various steps may be added, omitted, or combined. Additionally, features described with reference to certain examples may be combined in other examples.

[0068] Through the above description of the embodiments, those skilled in the art can clearly understand that the methods of the above embodiments can be implemented by means of software plus necessary general-purpose hardware platforms. Of course, they can also be implemented by hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product is stored in a storage medium (such as ROM / RAM, magnetic disk, optical disk) and includes several instructions to cause a terminal (which may be a mobile phone, computer, server, air conditioner, or network device, etc.) to execute the methods described in the various embodiments of this application.

[0069] The embodiments of this application have been described above with reference to the accompanying drawings. However, this application is not limited to the specific embodiments described above. The specific embodiments described above are merely illustrative and not restrictive. Those skilled in the art can make many other forms under the guidance of this application without departing from the spirit and scope of the claims, and all of these forms are within the protection scope of this application.

Claims

1. A 3D human pose estimation method, characterized in that, Includes the following steps: Acquire a continuous image sequence and use a 2D pose estimator to extract the 2D keypoint sequence of each joint of the human body under multiple views; The two-dimensional key point sequence is input into the cross-view feature adaptive enhancement network. The hierarchical knowledge extraction module jointly models the overall joint features of the human body and the features of each part of the human body to generate enhanced posture features. The feature adaptive enhancement module calculates the spatial correlation within the current view as adaptive weights. The adaptive weights are used to weight and extract the auxiliary view features, and output spatially enhanced and fused multi-view features. The spatially enhanced and fused multi-view features are input into the spatiotemporal collaborative modeling network. The spatiotemporal entanglement collaborative features are learned through the spatiotemporal co-encoder, including: performing double normalization along the channel dimension and time dimension to obtain spatiotemporal normalized entanglement features, and generating spatiotemporal identifier codes in combination with the spatiotemporal identifier encoding module. The spatiotemporal identifier codes are fused with the spatiotemporal normalized entanglement features to generate a query matrix and a key matrix for calculating correlation. At the same time, the spatiotemporal normalized entanglement features without identifier encoding fusion are used as the value matrix. Then, the global spatiotemporal correlation is calculated based on the query matrix, key matrix and value matrix, and the spatiotemporal collaborative features that retain the spatiotemporal entanglement structure are output. The spatiotemporal collaborative features are input into the regression head for training, the three-dimensional coordinates of each joint in three-dimensional space are calculated, and the 3D human posture estimation results are output.

2. The method according to claim 1, characterized in that, The process of generating enhanced pose features by jointly modeling the overall joint features of the human body and the features of individual body parts through a hierarchical knowledge extraction module includes: Attitude-level spatial location information is extracted based on the two-dimensional keypoint sequence; The human body is divided into five parts: left arm, right arm, spine, left leg, and right leg. A part-level adjacency matrix is ​​constructed based on the connection structure between joints and the position information of joints in each part. The position features are aggregated through the part-level adjacency matrix to generate human body part map features. An attitude-level adjacency matrix is ​​constructed based on the spatial structure of the human body's joints. The overall position features are aggregated through the attitude-level adjacency matrix to generate attitude-level graph features. The human body part map features and the posture-level map features are fused using channel fusion. The fused graph features are concatenated with the attitude-level spatial location information by feature dimension to obtain the concatenated attitude information. The stitched posture information is feature-mapped using a linear transformation function to output the enhanced posture features.

3. The method according to claim 2, characterized in that, The process of using the feature adaptive enhancement module to calculate the spatial correlation within the current view as adaptive weights includes: For the current view and the auxiliary view, extract the enhanced pose features corresponding to each view respectively; The enhanced pose features corresponding to each view are input into the multi-head self-attention layer respectively; In the multi-head self-attention layer, the dot product attention between the joint features inside the view is calculated, and the spatial features of the current view and the cross-joint spatial correlation within the current view are output. At the same time, the spatial features of the auxiliary view and the cross-joint spatial correlation within the auxiliary view are also output. The cross-joint spatial correlation within the current view is used as the adaptive weight.

4. The method according to claim 3, characterized in that, The process of weighting and extracting auxiliary view features using the adaptive weights to output spatially enhanced and fused multi-view features includes: The cross-joint spatial correlation within the current view is used as the query matrix and key matrix for cross-attention; the spatial features of the auxiliary view are used as the value matrix for cross-attention; auxiliary cross-view related features aligned with the current view are extracted from the auxiliary view through cross-attention, and the auxiliary cross-view related features are fused with the spatial features of the current view to obtain the spatial enhancement features of the current view; The cross-joint spatial correlation within the auxiliary view is used as the query matrix and key matrix for cross-attention; the spatial features of the current view are used as the value matrix for cross-attention; the current cross-view related features aligned with the auxiliary view are extracted from the current view through the cross-attention, and the current cross-view related features are fused with the spatial features of the auxiliary view to obtain the spatial enhancement features of the auxiliary view; By introducing positional encoding and utilizing convolutional layers to adaptively fuse the spatial enhancement features of the current view and auxiliary views across channels, the spatially enhanced and fused multi-view features are output.

5. The method according to claim 4, characterized in that, The process of obtaining spatiotemporally normalized entangled features by performing dual normalization along the channel and time dimensions includes: A residual connection is established between the spatially enhanced cross-view fusion features and the spatial features of the current view and the auxiliary view to obtain the first initial enhanced view features and the second initial enhanced view features. For the first initial enhanced view features and the second initial enhanced view features, instance normalization is performed along the channel dimension of the features to obtain spatially normalized features, and layer normalization is performed along the time dimension of the features to obtain temporally normalized features. The spatially normalized features and the temporally normalized features are weighted and merged using learnable weight parameters to output spatiotemporally normalized entangled features.

6. The method according to claim 5, characterized in that, The process of generating a spatiotemporal identifier code by combining the spatiotemporal identifier encoding module, and fusing the spatiotemporal identifier code with the spatiotemporal normalized entanglement features to generate a query matrix and a key matrix for calculating relevance includes: Initialize a learnable channel embedding matrix as a spatial identifier, and initialize a learnable temporal embedding matrix as a temporal identifier; The spatiotemporal identifier code is generated by performing matrix multiplication between the channel embedding matrix and the temporal embedding matrix. The spatiotemporal identifier encoding is concatenated with the spatiotemporal normalized entangled feature along the feature dimension; A multilayer perceptron is used to perform nonlinear mapping on the stitched features to output spatiotemporal fusion features. The spatiotemporal fusion features are projected into the query matrix and the key matrix using a linear layer.

7. The method according to claim 6, characterized in that, The process of using the unidentified and encoded entangled spatiotemporally normalized entangled features as a value matrix, and then calculating the global spatiotemporal correlation based on the query matrix, key matrix, and value matrix to output spatiotemporally collaborative features that preserve the spatiotemporal entanglement structure, includes: The spatiotemporal fusion features, which are not identifiable encoded and fused, are projected into the value matrix using a linear layer; Calculate the dot product attention weight between the query matrix and the key matrix, and apply the dot product attention weight to the value matrix to capture the correlation between channel features and spatial context at a specific time step and the correlation between channel features and temporal context at a specific spatial location. The attention output is fused across views and channels using a feedforward neural network to output the spatiotemporal collaborative features.

8. The method according to claim 7, characterized in that, In the step of inputting the spatiotemporal collaborative features into the regression head for training, the overall regression loss function for training is: , in This represents the predicted 3D coordinates of the j-th joint in the i-th frame regressed using the regression head. T represents the actual 3D coordinates, T represents the number of video frames, and J represents the number of joints.

9. An electronic device, characterized in that, It includes a processor, a memory, and a program or instructions stored in the memory and executable on the processor, wherein when the program or instructions are executed by the processor, they implement the steps of a 3D human pose estimation method as described in any one of claims 1-8.

10. A readable storage medium, characterized in that, The program or instructions are stored on the readable storage medium, and when the program or instructions are executed by the processor, they implement the steps of the 3D human pose estimation method as described in any one of claims 1-8.