A motion feature extraction method, a human pose estimation method, a human network reconstruction method, a device and a medium
By dividing video sequences into segments and modeling them using adaptive pose pooling and cross-attention mechanisms, the problem of pose jumps caused by ignoring time dependencies in video 3D human pose estimation and mesh reconstruction is solved, achieving more continuous and accurate 3D pose and mesh reconstruction.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- PEKING UNIV SHENZHEN GRADUATE SCHOOL
- Filing Date
- 2026-05-22
- Publication Date
- 2026-06-19
Smart Images

Figure CN122244472A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of computer graphics technology, and in particular to a method for motion feature extraction, a method for human pose estimation, a method for human network reconstruction, an apparatus, and a medium. Background Technology
[0002] With the development of computer vision and deep learning technologies, 3D human pose estimation and mesh reconstruction techniques have been widely applied in fields such as motion analysis, intelligent monitoring, virtual reality, digital human modeling, and motion capture. Compared to 3D pose estimation and mesh reconstruction methods based on single-frame images, video sequence-based methods can utilize information from the temporal dimension, effectively improving the stability and accuracy of estimation results in dynamic scenes, and thus gradually becoming the mainstream direction of research and application.
[0003] Existing methods for 3D human pose estimation and mesh reconstruction in video typically employ a staged processing strategy: first, a 2D human pose estimator is used to extract 2D keypoint information from each video frame; then, the 2D pose sequence of consecutive frames is input into a temporal model to complete the mapping from 2D to 3D human pose and even human mesh. However, existing methods usually only perform temporal modeling within a single video segment, ignoring the temporal dependencies between adjacent segments. When human movements involve rapid changes, complex rotations, or long-term continuous motion, problems such as pose jumps and discontinuous movements can easily occur at segment boundaries, affecting the continuity of the overall 3D pose and mesh reconstruction.
[0004] Therefore, the existing technology still needs to be improved and enhanced. Summary of the Invention
[0005] The technical problem to be solved by this application is to provide a motion feature extraction method, a human pose estimation method, a human network reconstruction method, equipment, and medium to address the shortcomings of existing technologies.
[0006] To address the aforementioned technical problems, the first aspect of this application provides a motion feature extraction method, wherein the motion feature extraction method specifically includes: Obtain a video sequence and divide the video sequence into multiple video segments; Two-dimensional human pose estimation is performed on the video segment to obtain the two-dimensional human pose and first image features of the video segment; Based on the two-dimensional human pose, the first image features are subjected to adaptive pose pooling to obtain the second image features, and the temporal motion features are determined based on the two-dimensional human pose. The motion features corresponding to the edge temporal motion features in the temporal motion features are determined based on the temporal motion features and the second image features using a cross-attention mechanism. The motion features corresponding to the edge temporal motion features and other temporal motion features other than the neighborhood temporal motion features are integrated to obtain motion features with global temporal constraints.
[0007] The motion feature extraction method, wherein the step of performing temporal consistency modeling on the second image features and the temporal motion features to obtain motion features with global temporal constraints specifically includes: Based on the video clip, edge temporal motion features are selected from the temporal motion features, and the neighborhood temporal motion feature sequence of the temporal motion features and the second image feature sequence corresponding to the neighborhood temporal motion feature sequence are obtained. A cross-attention mechanism is used to interact the neighborhood temporal motion feature sequence with its corresponding second image feature sequence to obtain the motion features corresponding to the edge temporal motion features.
[0008] The motion feature extraction method, wherein the step of using a cross-attention mechanism to interact the neighborhood temporal motion feature sequence with its corresponding second image feature sequence to obtain the motion features corresponding to the edge temporal motion features specifically includes: A query vector is constructed based on the neighborhood temporal motion feature sequence, and a value vector and a key vector are constructed based on the second image feature sequence corresponding to the neighborhood temporal motion feature sequence. Based on the query vector, the value vector, and the key vector, the motion features corresponding to the edge temporal motion features are determined using a cross-attention mechanism.
[0009] The motion feature extraction method, wherein determining the motion features corresponding to the edge temporal motion features based on the query vector, the value vector, and the key vector using a cross-attention mechanism specifically includes: Obtain the confidence level of the edge temporal motion features; Motion weights are determined based on the confidence level, and image weights are determined based on the motion weights. Based on the motion weights and the image weights, the query vector, value vector, and key vector are interacted using an attention mechanism to obtain the motion features corresponding to the edge temporal motion features.
[0010] In the motion feature extraction method described above, the motion weights are positively correlated with the confidence level, and the image weights are negatively correlated with the confidence level.
[0011] The motion feature extraction method, wherein the adaptive pose pooling process performed on the first image features based on the two-dimensional human pose to obtain the second image features specifically includes: The attitude offset corresponding to the first image feature is determined by attitude offset perception; Based on the two-dimensional human posture, the first image features are sampled using the posture offset to obtain the second image features.
[0012] A second aspect of this application provides a human pose estimation method, wherein the human pose estimation method specifically includes: Motion features of the target video are extracted using the motion feature extraction method described above; The three-dimensional human posture is estimated based on the motion features.
[0013] A third aspect of this application provides a method for reconstructing a human body network, wherein the method specifically includes: Motion features of the target video are extracted using the motion feature extraction method described above; The motion features are fused with the image features of the target video to obtain fused features; Based on the fusion features, mesh reconstruction is performed in the camera coordinate system to obtain the first human body mesh; Based on the motion characteristics, a mesh is reconstructed in the world coordinate system to obtain a second human body mesh, and the motion trajectory and motion speed are predicted based on the second human body mesh; The first human body mesh is adjusted based on the motion trajectory and motion speed to obtain the human body reconstruction network corresponding to the target video.
[0014] A fourth aspect of this application provides a computer-readable storage medium storing one or more programs that can be executed by one or more processors to implement the steps in any of the motion feature extraction methods described above.
[0015] A fifth aspect of this application provides a terminal device comprising: a processor and a memory; The memory stores a computer-readable program that can be executed by the processor; When the processor executes the computer-readable program, it implements the steps in any of the motion feature extraction methods described above.
[0016] Beneficial Effects: This application provides a motion feature extraction method, a human pose estimation method, a human network reconstruction method, device, and medium. The method includes dividing a video sequence into multiple video segments; performing two-dimensional human pose estimation on the video segments to obtain two-dimensional human pose and first image features of the video segments; performing adaptive pose pooling processing on the first image features based on the two-dimensional human pose to obtain second image features, and determining temporal motion features based on the two-dimensional human pose; and performing temporal consistency modeling on the second image features and the temporal motion features to obtain motion features with global temporal constraints. This application introduces temporal consistency modeling in the three-dimensional pose mapping stage to perform temporal consistency modeling on human pose features within and between video segments, explicitly modeling the temporal dependency of human pose, and alleviating the boundary discontinuity problem caused by video segmentation. Attached Figure Description
[0017] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0018] Figure 1 A flowchart of the motion feature extraction method provided in the embodiments of this application.
[0019] Figure 2 This is a flowchart illustrating the principle of one embodiment of the motion feature extraction method provided in this application.
[0020] Figure 3 The flowchart illustrates the principle of the adaptive pose pooling process provided in the embodiments of this application.
[0021] Figure 4 A schematic block diagram of the terminal device provided in the embodiments of this application. Detailed Implementation
[0022] This application provides a motion feature extraction method, a human pose estimation method, a human network reconstruction method, an apparatus, and a medium. To make the objectives, technical solutions, and effects of this application clearer and more explicit, the following detailed description is provided with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only for explaining this application and are not intended to limit this application.
[0023] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this application means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any units and all combinations of one or more associated listed items.
[0024] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.
[0025] It should be understood that the sequence number and size of each step in this embodiment do not imply the order of execution. The execution order of each process is determined by its function and internal logic, and should not constitute any limitation on the implementation process of this application embodiment.
[0026] Research has shown that with the development of computer vision and deep learning technologies, 3D human pose estimation and mesh reconstruction techniques have been widely applied in fields such as motion analysis, intelligent monitoring, virtual reality, digital human modeling, and motion capture. Compared to 3D pose estimation and mesh reconstruction methods based on single-frame images, video sequence-based methods can utilize temporal information to effectively improve the stability and accuracy of estimation results in dynamic scenes, thus gradually becoming the mainstream direction of research and application.
[0027] Existing methods for 3D human pose estimation and mesh reconstruction in video typically employ a staged processing strategy: first, a 2D human pose estimator is used to extract 2D keypoint information from each video frame; then, the 2D pose sequence of consecutive frames is input into a temporal model to complete the mapping from 2D to 3D human pose and even human mesh. However, existing methods usually only perform temporal modeling within a single video segment, ignoring the temporal dependencies between adjacent segments. When human movements involve rapid changes, complex rotations, or long-term continuous motion, problems such as pose jumps and discontinuous movements can easily occur at segment boundaries, affecting the continuity of the overall 3D pose and mesh reconstruction.
[0028] Furthermore, most existing methods only use two-dimensional human keypoint coordinates as an intermediate representation, ignoring the high-resolution image features generated by the two-dimensional pose estimator when processing video frames. These high-resolution image features contain rich local appearance, human structure, and spatial semantic information, and their ineffective utilization will limit the accuracy of three-dimensional pose estimation and mesh reconstruction.
[0029] To address the aforementioned issues, in this embodiment, a video sequence is divided into multiple video segments; two-dimensional human pose estimation is performed on each video segment to obtain its two-dimensional human pose and first image features; adaptive pose pooling is applied to the first image features based on the two-dimensional human pose to obtain second image features, and temporal motion features are determined based on the two-dimensional human pose; temporal consistency modeling is performed on the second image features and the temporal motion features to obtain motion features with global temporal constraints. This application introduces temporal consistency modeling in the three-dimensional pose mapping stage, performing temporal consistency modeling on human pose features within and between video segments, explicitly modeling the temporal dependencies of human pose, and alleviating the boundary discontinuity problem caused by video segmentation.
[0030] The application content will be further explained below with reference to the accompanying drawings and the description of the embodiments.
[0031] This embodiment provides a motion feature extraction method, such as... Figure 1 and Figure 2 As shown, the motion feature extraction method specifically includes: S10. Obtain the video sequence and divide the video sequence into multiple video segments.
[0032] Specifically, the video sequence can be a continuous sequence of images captured in real time by a camera, or a pre-recorded video file read from a storage device, which includes continuous frames of human movement.
[0033] It's understandable that video sequences often span long periods of time, and directly estimating the sequence would increase computational complexity and reduce processing efficiency. Therefore, after obtaining the video sequence, it can be divided into multiple video segments, which are consecutive video segments. The segmentation method can be based on factors such as the video frame rate, a preset duration, or the completeness of the action segments. For example, the video sequence can be divided into segments of 100 frames each, or the start and end frames of actions can be identified using motion detection algorithms, and these can be used as the basis for segmentation. Each segment serves as the basic unit for subsequent processing, facilitating local feature extraction and temporal modeling.
[0034] It should be noted that in practical applications, the division method can also be determined according to the actual application requirements. For example, the division method can be based on keyframes.
[0035] S20. Perform two-dimensional human pose estimation on the video segment to obtain the two-dimensional human pose and first image features of the video segment.
[0036] Specifically, performing two-dimensional human pose estimation on the video segment refers to estimating the two-dimensional human pose for each frame of the video segment to obtain the two-dimensional human pose and first image features corresponding to each frame. This two-dimensional human pose estimation can be processed using a pre-trained two-dimensional human pose estimation model, which can be a deep learning-based network architecture such as HRNet (High-Resolution Network), SHN (Stacked Hourglass Network), or CPN (Cascaded Pyramid Network). When using a pre-trained two-dimensional human pose estimation model, the model not only outputs the two-dimensional keypoint coordinates of the human body in each frame (i.e., the two-dimensional human pose), but also outputs a high-resolution feature map of that frame at an intermediate layer or a specific feature extraction layer. The extracted high-resolution feature map constitutes the first image features. It can be understood that the first image features may include one high-resolution feature map or multiple high-resolution feature maps, and when multiple high-resolution feature maps are included, the image scales of the multiple high-resolution feature maps are different.
[0037] In this embodiment, the first image feature is a high-resolution feature map extracted during the two-dimensional human pose estimation process. This first image feature contains rich image detail information, such as the texture, contour, clothing folds, and spatial semantic information of various parts of the human body, as well as the background environment. This spatial semantic information is of great significance for improving the accuracy of subsequent three-dimensional pose estimation and mesh reconstruction. For example, when some two-dimensional key points in the two-dimensional human pose are inaccurately located due to occlusion or blurring, the local appearance information contained in the first image feature can provide additional constraints and supplements.
[0038] S30. Based on the two-dimensional human posture, the first image features are subjected to adaptive posture pooling to obtain the second image features, and the temporal motion features are determined based on the two-dimensional human posture.
[0039] Specifically, adaptive pose pooling is used to perform regional feature aggregation on the features of the first image based on the positions of key points in the 2D human pose, thereby enhancing the perception of key parts of the human body. In scenarios involving complex actions, occlusion, or rapid movement, it can significantly improve the estimation accuracy of local pose and human structure. During adaptive pose pooling, the feature sampling area and weights can be adaptively adjusted according to the key positions and pose changes of the human body to extract local features related to key parts of the human body (such as joints and limbs).
[0040] In one embodiment, the adaptive pose pooling process performed on the first image features based on the two-dimensional human pose to obtain the second image features specifically includes: The attitude offset corresponding to the first image feature is determined by attitude offset perception; Based on the two-dimensional human posture, the first image features are sampled using the posture offset to obtain the second image features.
[0041] Specifically, posture offset refers to the positional or directional deviation of the 2D human posture estimation result relative to the actual human posture during movement. It reflects the positioning error of 2D keypoints on the image plane or the prediction deviation of the posture estimation model for changes in human posture. For example, when a person performs a rapid arm swing, due to motion blur or occlusion, the wrist keypoint coordinates output by the 2D human posture estimation model may be offset by a few pixels from the actual wrist position in the image; this offset value is the posture offset. Or, when the human torso twists, the direction of the line connecting keypoints such as the shoulder and hip obtained from the 2D posture estimation may deviate from the actual torso direction by a certain angle; this angle deviation can also be considered as part of the posture offset. By accurately calculating the posture offset, the sampling area of the first image features can be dynamically adjusted during subsequent posture perception sampling, thereby more accurately capturing image features related to key human postures.
[0042] Among them, such as Figure 3 As shown, the process of obtaining the attitude offset can be as follows: first, the estimated offset and offset weight are obtained based on the first image features through the attitude offset estimation model; then, the weighted offset is determined based on the estimated offset and offset weight, and the weighted offset is fused with the two-dimensional human pose to obtain the attitude offset.
[0043] Furthermore, after obtaining the pose offset, the pose offset is fused with the first image feature to obtain the second image feature. When fusing the pose offset with the first image feature, the first image feature can first be convolved by a convolutional layer, and then adaptive pooling can be performed on the convolved first image feature based on the pose offset to obtain the second image feature.
[0044] It should be noted that in video segmentation processing mode, intermediate frames have sufficient contextual information, while edge frames located at the beginning and end of segments are prone to missing pose information or pose extraction errors. To address this, the pose offset can be calculated only for the first image feature of the edge frames, and then adaptive pooling can be performed on the first image feature of the edge frames using the pose offset. For other image frames, adaptive pooling can be performed directly. The edge frames can be the beginning and end frames of a video segment, or several consecutive image frames before and after a segment. This approach of performing adaptive pooling based on pose offset only on edge frames not only compensates for missing or incorrect pose information in video edge regions but also reduces computational complexity.
[0045] Furthermore, after obtaining the second image features, the first and second image features can be fused to generate updated image features. These updated image features are then used as the first image features to re-execute adaptive pose pooling, thus achieving iterative processing of adaptive pose pooling. Specifically, the fusion of the first and second image features can be performed using a weighted method, that is, the first and second image features are weighted to obtain the updated image features.
[0046] S40. Using a cross-attention mechanism, determine the motion features corresponding to the edge temporal motion features in the temporal motion features based on the temporal motion features and the second image features, and integrate the motion features corresponding to the edge temporal motion features and other temporal motion features in the temporal motion features except for the neighborhood temporal motion features to obtain motion features with global temporal constraints.
[0047] Specifically, the motion features with global temporal constraints are obtained using temporal consistency modeling. Temporal consistency modeling is used to uniformly model the two-dimensional human posture within and between video segments, to show the temporal dependencies of the modeled human posture and alleviate the boundary discontinuity problem caused by video segmentation. During temporal consistency modeling, image detail information included in the second image features and the dynamic change patterns of the human body reflected by the temporal motion features are used to construct a cross-video segment temporal association mechanism. That is, during temporal consistency modeling, a cross-attention mechanism is used to determine the motion features corresponding to the edge temporal motion features based on the temporal motion features and the second image features. Then, the motion features corresponding to the edge temporal motion features and other temporal motion features in the temporal motion features other than the neighboring temporal motion features are integrated to obtain motion features with global temporal constraints.
[0048] For example, for the middle image frames of a video segment, their corresponding second image features and temporal motion features can be fused and encoded to capture short-term motion dependencies between consecutive frames within the segment. For the edge image frames of a video segment, an inter-segment attention mechanism can be introduced to associate and match the motion features of the starting image frame in the current video segment with the motion features of the ending image frame of the preceding segment, so that adjacent segments can maintain the continuity of action during pose transitions.
[0049] In one embodiment, determining the motion features corresponding to the edge temporal motion features in the temporal motion features based on the temporal motion features and the second image features using a cross-attention mechanism specifically includes: Edge temporal motion features are selected from the temporal motion features based on video segments; Obtain the neighborhood temporal motion feature sequence of the temporal motion feature and the corresponding second image feature sequence of the neighborhood temporal motion feature sequence; A cross-attention mechanism is used to interact the neighborhood temporal motion feature sequence with its corresponding second image feature sequence to obtain the motion features corresponding to the edge temporal motion features.
[0050] Specifically, edge temporal motion features refer to the temporal motion features corresponding to image frames located at the edges of a video segment, such as the temporal motion features of the starting and ending image frames of a video segment. The neighborhood temporal motion feature sequence is the sequence of temporal motion features in image frames that are temporally adjacent to the edge temporal motion features. For example, if the current processing involves the temporal motion features of the edge of the ending image frame of the Nth video segment, then its neighborhood temporal motion feature sequence could be the temporal motion features of the starting image frame of the (N+1)th video segment, the temporal motion features of several image frames after the starting image frame, and the temporal motion features of several image frames before the ending image frame in the Nth video segment. Similarly, if the current processing involves the edge temporal motion features of the starting image frame of the Nth video segment, then its neighborhood temporal motion feature sequence could be the temporal motion features of the ending image frame of the (N-1)th video segment, the temporal motion features of several image frames before the ending image frame, and the temporal motion features of several image frames after the starting image frame in the Nth video segment. The neighborhood temporal motion feature sequence corresponds to the second image feature sequence, which is a sequence of second image features obtained after adaptive pose pooling processing of the image frames from which the neighborhood temporal motion features originate.
[0051] This study utilizes a cross-attention mechanism to interact with neighboring temporal motion feature sequences and their corresponding second image feature sequences, aiming to establish a correlation between edge temporal motion features and image content and motion information in neighboring video clips. Specifically, the cross-attention mechanism enables edge temporal motion features to focus on components of the neighboring temporal motion feature sequence related to their motion coherence, while simultaneously incorporating image detail information (such as joint edges and body contours) contained in the neighboring second image feature sequence to refine and enhance the edge temporal motion features.
[0052] In one embodiment, the step of using a cross-attention mechanism to interact the neighborhood temporal motion feature sequence with its corresponding second image feature sequence to obtain the motion features corresponding to the edge temporal motion features specifically includes: A query vector is constructed based on the neighborhood temporal motion feature sequence, and a value vector and a key vector are constructed based on the second image feature sequence corresponding to the neighborhood temporal motion feature sequence. Based on the query vector, the value vector, and the key vector, the motion features corresponding to the edge temporal motion features are determined using a cross-attention mechanism.
[0053] Specifically, the query vector can be constructed by first embedding each temporal motion feature in the neighborhood temporal motion feature sequence to obtain a fixed-dimensional vector representation, and then fusing the vector representations into a query vector through concatenation or weighted summation. This query vector aims to capture the overall dynamic trend of the neighborhood temporal motion. The construction process of the value vector and key vector is the same as that of the query vector. The key vector is used to represent the feature attributes of each image frame in the neighborhood second image feature sequence, while the value vector is used to provide detailed feature information corresponding to these feature attributes. During the cross-attention mechanism calculation, the query vector is similar to the key vector to obtain the attention weight. This weight reflects the importance of different image frame features in the neighborhood second image feature sequence for the correction of edge temporal motion features. Then, the value vector is weighted and summed using this attention weight to obtain the motion features corresponding to the edge temporal motion features that fuse neighborhood image details and motion information. In this way, the edge temporal motion features can effectively learn from the information of neighborhood segments, thereby better maintaining the continuity of actions in temporal consistency modeling.
[0054] Furthermore, after obtaining the interactively enhanced motion features corresponding to the edge temporal motion features, these features are integrated with other temporal motion features in the video segment's temporal motion features, excluding the neighborhood temporal motion features used for interaction (such as the temporal motion features of intermediate image frames in the video segment). The integration method can be simple splicing, or it can be through weighted fusion or other methods to organically combine the corrected motion features of the edge frames with the original motion features of the intermediate frames, ultimately forming a global temporal constraint-based motion feature for the entire video segment. In this embodiment, the global temporal constraint means that the motion features not only include short-term motion dependencies within the segment but also incorporate long-term motion correlations across segments through edge frame processing. This ensures that the motion features of the edge image frames no longer rely solely on incomplete context but also receive strong spatial constraints from the image frames, significantly reducing pose jumps and motion breaks, and maintaining smoothness and coherence of the output 3D pose sequence over long time scales.
[0055] In one embodiment, a selective activation mechanism can be configured during temporal consistency modeling to control the temporal consistency modeling of edge image frames while omitting it for other image frames. Furthermore, during temporal consistency modeling of edge image frames, motion weights for temporal motion features and image weights for image features can be adaptively configured to ensure pose smoothness under different motion states.
[0056] Based on this, the step of determining the motion features corresponding to the edge temporal motion features using a cross-attention mechanism based on the query vector, the value vector, and the key vector specifically includes: Obtain the confidence level of the edge temporal motion features; Motion weights are determined based on the confidence level, and image weights are determined based on the motion weights. Based on the motion weights and the image weights, the query vector, value vector, and key vector are interacted using an attention mechanism to obtain the motion features corresponding to the edge temporal motion features.
[0057] Specifically, the confidence level is the confidence level of the two-dimensional human pose corresponding to the edge temporal motion features. That is, when estimating the two-dimensional human pose, the confidence level of the two-dimensional human pose is output simultaneously to reflect the reliability of the pose estimation result. Among them, the higher the confidence level, the higher the accuracy of the two-dimensional human pose, and vice versa.
[0058] Motion weights are positively correlated with confidence levels; that is, the higher the confidence level, the greater the motion weight. This means that during cross-attention interactions, more emphasis is placed on the dynamic information reflected by temporal motion features. Image weights are used to complement motion weights, and the sum of image weights and motion weights is 1. Therefore, image weights are negatively correlated with the confidence level. That is, when motion weights decrease due to low confidence, image weights increase accordingly. In this case, more reliance is placed on image detail information in the neighborhood second image feature sequence to correct edge temporal motion features. In other words, when the action is violent or occlusion occurs, resulting in low temporal prediction accuracy (i.e., low confidence), the image weight of the image features is increased; when the action is stable (i.e., high confidence), the original temporal motion features are maintained.
[0059] For example, when the confidence level of the temporal motion features at the edge is 0.8, the motion weight can be set to 0.8 and the image weight to 0.2. In this case, the interaction mainly relies on the dynamic trend of the temporal motion features. If the confidence level drops to 0.3, the motion weight is adjusted to 0.3 and the image weight is increased to 0.7. At this time, the details of the joint edges and body contours of the neighboring images dominate the correction process. When using the cross-attention mechanism for interaction, the motion weight and image weight are applied to the query vector (derived from the temporal motion feature sequence of the neighboring area) and the key vector and value vector (derived from the second image feature sequence of the neighboring area), respectively. By adjusting the contribution ratio of information from different sources, the correction of the temporal motion features at the edge can be dynamically adapted according to its own confidence level. At high confidence, the motion coherence is enhanced, and at low confidence, the image details are used to compensate for the pose error. Thus, the smooth transition of the pose and the temporal consistency can be effectively maintained in various motion states.
[0060] In summary, this embodiment provides a motion feature extraction method. The method includes dividing a video sequence into multiple video segments; performing two-dimensional human pose estimation on the video segments to obtain two-dimensional human pose and first image features of the video segments; performing adaptive pose pooling processing on the first image features based on the two-dimensional human pose to obtain second image features, and determining temporal motion features based on the two-dimensional human pose; and performing temporal consistency modeling on the second image features and the temporal motion features to obtain motion features with global temporal constraints. This application introduces temporal consistency modeling in the three-dimensional pose mapping stage to perform temporal consistency modeling on human pose features within and between video segments, explicitly modeling the dependence of human pose in the time dimension, and alleviating the boundary discontinuity problem caused by video segmentation processing.
[0061] Based on the above motion feature extraction method, this embodiment provides a human pose estimation method, wherein the human pose estimation method specifically includes: Motion features of the target video are extracted using the motion feature extraction method described above; The three-dimensional human posture is estimated based on the motion features.
[0062] Based on the above motion feature extraction method, this embodiment provides a human body network reconstruction method, wherein the human body network reconstruction method specifically includes: Motion features of the target video are extracted using the motion feature extraction method described above; The motion features are fused with the image features of the target video to obtain fused features; Based on the fusion features, mesh reconstruction is performed in the camera coordinate system to obtain the first human body mesh; Based on the motion characteristics, a mesh is reconstructed in the world coordinate system to obtain a second human body mesh, and the motion trajectory and motion speed are predicted based on the second human body mesh; The first human body mesh is adjusted based on the motion trajectory and motion speed to obtain the human body reconstruction network corresponding to the target video.
[0063] Based on the above motion feature extraction method, this embodiment provides a computer-readable storage medium storing one or more programs, which can be executed by one or more processors to implement the steps in the motion feature extraction method as described in the above embodiment.
[0064] Based on the above motion feature extraction method, this application also provides a terminal device, such as... Figure 4As shown, it includes at least one processor 20; a display screen 21; and a memory 22, and may also include a communications interface 23 and a bus 24. The processor 20, display screen 21, memory 22, and communications interface 23 can communicate with each other via the bus 24. The display screen 21 is configured to display a preset user guide interface in the initial setup mode. The communications interface 23 can transmit information. The processor 20 can invoke logical instructions in the memory 22 to execute the methods described in the above embodiments.
[0065] Furthermore, the logical instructions in the aforementioned memory 22 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium.
[0066] The memory 22, as a computer-readable storage medium, can be configured to store software programs, computer-executable programs, such as program instructions or modules corresponding to the methods in the embodiments of this disclosure. The processor 20 executes functional applications and data processing by running the software programs, instructions, or modules stored in the memory 22, thereby implementing the methods in the above embodiments.
[0067] The memory 22 may include a program storage area and a data storage area. The program storage area may store the operating system and application programs required for at least one function; the data storage area may store data created based on the use of the terminal device. Furthermore, the memory 22 may include high-speed random access memory (RAM) and non-volatile memory. Examples include various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks, as well as transient storage media.
[0068] Furthermore, the specific process of loading and executing multiple instruction processors in the aforementioned storage medium and terminal device has been described in detail in the above method, and will not be repeated here.
[0069] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.
Claims
1. A motion feature extraction method characterized by, The motion feature extraction method specifically includes: Obtain a video sequence and divide the video sequence into multiple video segments; Two-dimensional human pose estimation is performed on the video segment to obtain the two-dimensional human pose and first image features of the video segment; Based on the two-dimensional human pose, the first image features are subjected to adaptive pose pooling to obtain the second image features, and the temporal motion features are determined based on the two-dimensional human pose. The motion features corresponding to the edge temporal motion features in the temporal motion features are determined based on the temporal motion features and the second image features using a cross-attention mechanism. The motion features corresponding to the edge temporal motion features and other temporal motion features other than the neighborhood temporal motion features are integrated to obtain motion features with global temporal constraints.
2. The motion feature extraction method of claim 1, wherein, The step of using the cross-attention mechanism to determine the motion features corresponding to the edge temporal motion features in the temporal motion features based on the temporal motion features and the second image features specifically includes: Edge temporal motion features are selected from the temporal motion features based on video segments; Obtain the neighborhood temporal motion feature sequence of the temporal motion feature and the corresponding second image feature sequence of the neighborhood temporal motion feature sequence; A cross-attention mechanism is used to interact the neighborhood temporal motion feature sequence with its corresponding second image feature sequence to obtain the motion features corresponding to the edge temporal motion features.
3. The motion feature extraction method of claim 2, wherein, The step of using a cross-attention mechanism to interact the neighborhood temporal motion feature sequence with its corresponding second image feature sequence to obtain the motion features corresponding to the edge temporal motion features specifically includes: A query vector is constructed based on the neighborhood temporal motion feature sequence, and a value vector and a key vector are constructed based on the second image feature sequence corresponding to the neighborhood temporal motion feature sequence. Based on the query vector, the value vector, and the key vector, the motion features corresponding to the edge temporal motion features are determined using a cross-attention mechanism.
4. The motion feature extraction method of claim 3, wherein, The step of determining the motion features corresponding to the edge temporal motion features using a cross-attention mechanism based on the query vector, the value vector, and the key vector specifically includes: Obtain the confidence level of the edge temporal motion features; Motion weights are determined based on the confidence level, and image weights are determined based on the motion weights. Based on the motion weights and the image weights, the query vector, value vector, and key vector are interacted using an attention mechanism to obtain the motion features corresponding to the edge temporal motion features.
5. The motion feature extraction method of claim 4, wherein, The motion weights are positively correlated with the confidence level, while the image weights are negatively correlated with the confidence level.
6. The motion feature extraction method of claim 1, wherein, The step of performing adaptive pose pooling processing on the first image features based on the two-dimensional human pose to obtain the second image features specifically includes: The attitude offset corresponding to the first image feature is determined by attitude offset perception; Based on the two-dimensional human posture, the first image features are sampled using the posture offset to obtain the second image features.
7. A human pose estimation method characterized by comprising: The human pose estimation method specifically includes: Motion features of the target video are extracted using the motion feature extraction method as described in any one of claims 1-6; The three-dimensional human posture is estimated based on the motion features.
8. A method for reconstructing a human body network, characterized in that, The human body network reconstruction method specifically includes: Motion features of the target video are extracted using the motion feature extraction method as described in any one of claims 1-6; The motion features are fused with the image features of the target video to obtain fused features; Based on the fusion features, mesh reconstruction is performed in the camera coordinate system to obtain the first human body mesh; Based on the motion characteristics, a mesh is reconstructed in the world coordinate system to obtain a second human body mesh, and the motion trajectory and motion speed are predicted based on the second human body mesh; The first human body mesh is adjusted based on the motion trajectory and motion speed to obtain the human body reconstruction network corresponding to the target video.
9. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores one or more programs, which can be executed by one or more processors to implement the steps in the motion feature extraction method as described in any one of claims 1-6.
10. A terminal device, characterized in that, include: Processor and memory; The memory stores a computer-readable program that can be executed by the processor; When the processor executes the computer-readable program, it implements the steps of the motion feature extraction method as described in any one of claims 1-6.