Pose estimation method

By acquiring and adjusting the 3D pose and adjacency matrix of video frames, and combining the trajectory information of key points, the problem of insufficient accuracy in 3D pose estimation is solved, and higher-precision pose recognition is achieved.

CN119723655BActive Publication Date: 2026-06-12JILIN UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
JILIN UNIVERSITY
Filing Date
2024-11-19
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies suffer from insufficient accuracy in 3D pose estimation, especially due to the ambiguity between 2D and 3D, occlusion between humans and the environment, and the complexity of human posture, resulting in low recognition accuracy.

Method used

By acquiring the initial 3D pose of the target object in multiple video frames, determining the trajectory information of the joints, adjusting the adjacency matrix, and combining the trajectory similarity and the initial pose, the pose features of the target object are determined, and finally the 3D pose of the target is estimated.

Benefits of technology

It improves the accuracy of 3D pose recognition by supplementing the motion characteristics and connection relationships of joints, thereby enhancing the precision of pose estimation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119723655B_ABST
    Figure CN119723655B_ABST
Patent Text Reader

Abstract

The application discloses a pose estimation method, and belongs to the technical field of computers. A plurality of first initial three-dimensional poses of a target object in a plurality of first video frames and a plurality of second initial three-dimensional poses of the target object in a plurality of second video frames are acquired; based on initial three-dimensional positions of a plurality of joint nodes of the target object in the plurality of second initial three-dimensional poses, trajectory information of each joint node in the plurality of second video frames is determined; based on the trajectory information of each joint node in the plurality of second video frames, an initial adjacency matrix of the target object is adjusted to obtain a target adjacency matrix of the target object; based on the target adjacency matrix, trajectory similarity between each joint node and the plurality of first initial three-dimensional poses, a pose feature of the target object in a third video frame is determined, the third video frame being a middle frame of the plurality of first video frames; and based on the pose feature of the target object in the third video frame, a target three-dimensional pose of the target object in the third video frame is determined.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of computer technology, and in particular to a pose estimation method. Background Technology

[0002] With the development of computer technology, artificial intelligence technology has made rapid progress. In order to improve the efficiency of image or video processing, computer vision technology has emerged.

[0003] 3D Human Pose Estimation (3D HPE) is a fundamental computer vision task that aims to identify the pose of a target object based on an input image or video (i.e., a sequence of images). Due to the ambiguity between 2D and 3D, occlusion between humans and their environment, and the complexity of human poses, 3D pose estimation is an extremely challenging task.

[0004] Therefore, improving the accuracy of 3D pose recognition is a hot research topic. Summary of the Invention

[0005] This application provides a pose estimation method that can improve the accuracy of 3D pose recognition. The technical solution is as follows:

[0006] On the one hand, a pose estimation method is provided, the method comprising:

[0007] The target object is obtained in multiple first initial three-dimensional poses in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames. The multiple first video frames and the multiple second video frames are obtained by sampling from the video frame set of the target object in time order at intervals, and the sampling frequency of the multiple second video frames is less than that of the multiple first video frames. One initial three-dimensional pose corresponds to one video frame.

[0008] Based on the initial three-dimensional positions of multiple joints of the target object in the plurality of second initial three-dimensional poses, the trajectory information of each joint in the plurality of second video frames is determined, and the trajectory information is used to represent the motion of the joints.

[0009] Based on the trajectory information of each of the joints in the plurality of second video frames, the initial adjacency matrix of the target object is adjusted to obtain the target adjacency matrix of the target object. The initial adjacency matrix is ​​used to represent the connection relationship of the plurality of joints connected by the skeleton.

[0010] Based on the target adjacency matrix, the trajectory similarity between each of the key points, and the plurality of first initial 3D poses, the pose features of the target object in the third video frame are determined, and the third video frame is the intermediate frame of the plurality of first video frames;

[0011] Based on the pose characteristics of the target object in the third video frame, the target three-dimensional pose of the target object in the third video frame is determined.

[0012] On the one hand, an attitude estimation device is provided, the device comprising:

[0013] The acquisition module is used to acquire multiple first initial three-dimensional poses of the target object in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames. The multiple first video frames and the multiple second video frames are obtained by sampling from the video frame set of the target object in time order at intervals, and the sampling frequency of the multiple second video frames is less than that of the multiple first video frames. One initial three-dimensional pose corresponds to one video frame.

[0014] The trajectory information determination module is used to determine the trajectory information of each joint in the multiple second video frames based on the initial three-dimensional positions of multiple joints of the target object in the multiple second initial three-dimensional poses. The trajectory information is used to represent the motion of the joint.

[0015] An adjustment module is used to adjust the initial adjacency matrix of the target object based on the trajectory information of each of the joints in the plurality of second video frames, to obtain the target adjacency matrix of the target object. The initial adjacency matrix is ​​used to represent the connection relationship between the plurality of joints through the skeleton.

[0016] The pose feature extraction module is used to determine the pose features of the target object in the third video frame based on the target adjacency matrix, the trajectory similarity between each of the joints and the plurality of first initial three-dimensional poses, wherein the third video frame is the intermediate frame of the plurality of first video frames;

[0017] The target 3D pose determination module is used to determine the target 3D pose of the target object in the third video frame based on the pose features of the target object in the third video frame.

[0018] In one possible implementation, the trajectory information determination module is used to determine the displacement of the plurality of joints in each of the second video frames based on the initial three-dimensional positions of the plurality of joints of the target object in the plurality of second initial three-dimensional poses, wherein the displacement is used to represent the position change of the joints in adjacent video frames; determine the velocity information and direction information of each joint in each of the second video frames based on the displacement of the plurality of joints in each of the second video frames; and concatenate the velocity information and direction information of each joint in each of the second video frames to obtain the trajectory information of each joint in the plurality of second video frames.

[0019] In one possible implementation, the adjustment module is configured to determine the trajectory similarity between the joints based on the trajectory information of each joint in the plurality of second video frames; and adjust the initial adjacency matrix of the target object based on the trajectory similarity between the joints to obtain the target adjacency matrix of the target object.

[0020] In one possible implementation, the adjustment module is configured to extract features from the trajectory information of each of the joint points in the plurality of second video frames to obtain the spatiotemporal trajectory features of the plurality of joint points; perform convolution, linear transformation and nonlinear transformation on the spatiotemporal trajectory features to obtain the motion features of each of the joint points; and determine the trajectory similarity between the joint points based on the motion features of each of the joint points.

[0021] In one possible implementation, the adjustment module is configured to perform graph convolution on the trajectory information of each of the joint points in the plurality of second video frames to obtain the spatial trajectory features of each of the joint points in each of the second video frames; concatenate the spatial trajectory features of each of the joint points in each of the second video frames with the position embedding features of each of the second video frames in the plurality of second video frames to obtain reference joint point features of each of the joint points; and encode the reference joint point features of each of the joint points based on an attention mechanism to obtain the spatiotemporal trajectory features of the plurality of joint points.

[0022] In one possible implementation, the pose feature determination module is used to extract features from the plurality of first initial three-dimensional poses to obtain initial three-dimensional pose features of each first initial three-dimensional pose; and to determine the pose features of the target object in the third video frame based on the target adjacency matrix, the initial three-dimensional pose features of each first initial three-dimensional pose, and the trajectory similarity between each of the joints.

[0023] In one possible implementation, the pose feature determination module is used to embed and encode each of the first initial three-dimensional poses to obtain pose embedding features of each of the first initial three-dimensional poses; and to perform graph convolution on the pose embedding features of each of the first initial three-dimensional poses based on the initial adjacency matrix to obtain initial three-dimensional pose features of each of the first initial three-dimensional poses.

[0024] In one possible implementation, the pose feature determination module is configured to fuse the target adjacency matrix, the initial three-dimensional pose features of each of the first initial three-dimensional poses, and the trajectory similarity between each of the joints to obtain fused pose features of each of the first initial three-dimensional poses; concatenate the fused pose features of each of the first initial three-dimensional poses with the position embedding features of each of the first video frames in the plurality of first video frames to obtain concatenated pose features of each of the first initial three-dimensional poses; and encode the concatenated pose features of each of the first initial three-dimensional poses based on an attention mechanism to obtain the pose features of the target object in the third video frame.

[0025] In one possible implementation, the target 3D pose determination module is used to perform convolution, linear transformation, and nonlinear transformation on the pose features of the target object in the third video frame to obtain the positions of multiple joints of the target object in the third video frame, and the positions of the multiple joints in the third video frame are used to represent the target 3D pose.

[0026] In one possible implementation, the acquisition module is used to perform two-dimensional pose estimation on the plurality of first video frames and the plurality of second video frames to obtain the first two-dimensional pose of the target object in each of the first video frames and the second two-dimensional pose in each of the second video frames.

[0027] The first two-dimensional pose of each first video frame and the second two-dimensional pose of each second video frame are upgraded to obtain multiple first initial three-dimensional poses of the target object in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames.

[0028] On one hand, a computer device is provided, the computer device including one or more processors and one or more memories, the one or more memories storing at least one computer program, the computer program being loaded and executed by the one or more processors to implement the attitude estimation method.

[0029] On one hand, a computer-readable storage medium is provided, wherein at least one computer program is stored in the computer-readable storage medium, the computer program being loaded and executed by a processor to implement the attitude estimation method.

[0030] On one hand, a computer program product or computer program is provided, which includes program code stored in a computer-readable storage medium. A processor of a computer device reads the program code from the computer-readable storage medium and executes the program code, causing the computer device to perform the above-described attitude estimation method. Attached Figure Description

[0031] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the accompanying drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0032] Figure 1 This is a schematic diagram of the implementation environment of an attitude estimation method provided in an embodiment of this application;

[0033] Figure 2 This is a flowchart of an attitude estimation method provided in an embodiment of this application;

[0034] Figure 3 This is a flowchart of another attitude estimation method provided in the embodiments of this application;

[0035] Figure 4 This is a flowchart of another attitude estimation method provided in the embodiments of this application;

[0036] Figure 5 This is a schematic diagram of a joint trajectory provided in an embodiment of this application;

[0037] Figure 6 This is a schematic diagram illustrating the extraction of splicing posture features according to an embodiment of this application;

[0038] Figure 7 This is a flowchart of another attitude estimation method provided in the embodiments of this application;

[0039] Figure 8 This is a schematic diagram of the structure of an attitude estimation device provided in an embodiment of this application;

[0040] Figure 9 This is a schematic diagram of the structure of a terminal provided in an embodiment of this application;

[0041] Figure 10 This is a schematic diagram of the structure of a server provided in an embodiment of this application. Detailed Implementation

[0042] To make the objectives, technical solutions, and advantages of this application clearer, the embodiments of this application will be described in further detail below with reference to the accompanying drawings.

[0043] In this application, the terms "first," "second," etc., are used to distinguish identical or similar items with essentially the same function. It should be understood that there is no logical or temporal dependency between "first," "second," and "nth," nor are there any restrictions on quantity or execution order.

[0044] In order to illustrate the technical solutions provided in the embodiments of this application, the terms involved in the embodiments of this application will be introduced below.

[0045] Artificial intelligence (AI) is the theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain better results.

[0046] Machine learning (ML) is a multidisciplinary field involving probability theory, statistics, approximation theory, convex analysis, and algorithm complexity theory. It specifically studies how computers can simulate or implement human learning behavior to acquire new knowledge or skills, and reorganize existing knowledge sub-models to continuously improve their performance. Machine learning is the core of artificial intelligence and the fundamental way to endow computers with intelligence; its applications span all areas of artificial intelligence.

[0047] Pose estimation: Pose estimation is an important research area in computer vision. Its main task is to detect and locate the position and orientation of various joints and key parts of the human body from a given image or video. Pose estimation can be divided into two main categories: two-dimensional human pose estimation (2D HPE) and three-dimensional human pose estimation (3D HPE). 2D HPE mainly focuses on the coordinate positions of human joints in 2D space, while 3D HPE is dedicated to locating the coordinate positions of human key points in 3D space. This positional information can reflect the motion state of the human body in 3D space.

[0048] Attention mechanism: The attention mechanism is a widely used technique in deep learning that allows models to selectively focus on certain parts of the input data while ignoring others. This mechanism mimics the workings of the human visual and cognitive systems, enabling models to automatically learn and selectively focus on information more critical to the current task objective, thereby improving model performance and generalization ability.

[0049] Multi-head attention mechanism: This is an extension of the attention mechanism widely used in Transformer models. It obtains the attention distribution of different subspaces of the input sequence by running multiple independent attention mechanisms in parallel, thereby capturing more comprehensive potential semantic associations in the sequence.

[0050] Graph Convolutional Network (GCN) is a deep learning model specifically designed for processing graph-structured data. It extends the traditional Convolutional Neural Network (CNN) to graph data, enabling feature extraction and representation learning of nodes by defining convolution operations on the graph, thus effectively processing graph-structured data.

[0051] Normalization: Mapping sequences of values ​​with different ranges to the interval (0, 1) to facilitate data processing. In some cases, normalized values ​​can be directly expressed as probabilities.

[0052] Embedded coding, mathematically speaking, represents a correspondence, that is, mapping data in space X to space Y using a function F. This function F is injective, and the mapping result preserves the structure. An injective function means that the mapped data uniquely corresponds to the original data, and preserving the structure means that the size relationship between the original and mapped data is the same. For example, if there are data X1 and X2 before mapping, after mapping we get Y1 corresponding to X1 and Y2 corresponding to X2. If the original data X1 > X2, then correspondingly, the mapped data Y1 > Y2. For words, this means mapping words to another space to facilitate subsequent machine learning and processing.

[0053] Attention weights represent the importance of a piece of data during training or prediction. Importance indicates the magnitude of the influence of input data on output data. Data with high importance corresponds to higher attention weights, while data with low importance corresponds to lower attention weights. The importance of data varies in different scenarios, and training the model to assign attention weights is essentially the process of determining data importance.

[0054] Figure 1 This is a schematic diagram illustrating the implementation environment of an attitude estimation method provided in this application embodiment. See also... Figure 1 The implementation environment may include terminal 110 and server 140.

[0055] Terminal 110 is connected to server 140 via a wireless or wired network. Optionally, terminal 110 may be a smartphone, tablet, laptop, desktop computer, etc., but is not limited to these. Terminal 110 has an application that supports gesture recognition installed and running.

[0056] Server 140 is a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server that provides basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, Content Delivery Network (CDN), and big data and artificial intelligence platforms. Server 140 provides background services for applications running on terminal 110.

[0057] Those skilled in the art will recognize that the number of terminals described above can be more or less. For example, there may be only one terminal, or there may be dozens or hundreds of terminals, or even more, in which case the above implementation environment may also include other terminals.

[0058] Human pose estimation can be applied to many downstream tasks. For example, motion-sensing game consoles can use visual information captured by cameras to perform real-time human pose estimation, thereby enabling convenient and quick human-computer interaction, providing convenience for better motion-sensing games, and giving players a strong sense of immersion. The metaverse built by Virtual Reality (VR) and Augmented Reality (AR) technologies uses professional motion capture systems to detect human poses and generate virtual character images that resemble real people, making social interaction in the virtual world possible. Epilepsy patients in hospital beds can be monitored in real time through cameras. The system recognizes the patient's human pose based on real-time images transmitted from the camera and issues an alarm in the event of a sudden seizure, reminding doctors to provide timely treatment.

[0059] The aforementioned downstream tasks all require relatively accurate attitude estimation results. The technical solution provided in this application can improve the accuracy of attitude estimation, thereby improving the execution effect of downstream tasks.

[0060] Figure 2 This is a flowchart of an attitude estimation method provided in an embodiment of this application. See also... Figure 2 Taking the server as the executing entity as an example, the method includes the following steps.

[0061] 201. The server obtains multiple first initial three-dimensional poses of the target object in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames. The multiple first video frames and the multiple second video frames are obtained by sampling from the video frame set of the target object in time order at intervals, and the sampling frequency of the multiple second video frames is less than that of the multiple first video frames. One initial three-dimensional pose corresponds to one video frame.

[0062] The target object is the object whose pose is to be estimated, typically a person. The video frame set includes multiple video frames of the target object, and the video frame set can be considered as a single video of the target object. Multiple first video frames and multiple second video frames are obtained by sampling from this video frame set. The sampling frequency of the multiple second video frames is lower than that of the multiple first video frames, meaning that the time interval between the sampling times of any two adjacent second video frames is greater than the time interval between the sampling times of any two adjacent first video frames. The first initial 3D pose is obtained by performing initial 3D pose estimation on the first video frames, and the second initial 3D pose is obtained by performing initial 3D pose estimation on the second video frames. In some embodiments, the initial 3D pose estimation can employ 3D pose estimation methods from related technologies. One initial 3D pose corresponds to one video frame, meaning one first initial 3D pose corresponds to one first video frame, and one second initial 3D pose corresponds to one second video frame.

[0063] 202. Based on the initial three-dimensional positions of multiple joints of the target object in the multiple second initial three-dimensional poses, the server determines the trajectory information of each joint in the multiple second video frames. This trajectory information is used to represent the motion of the joints.

[0064] The target object includes multiple joints. In this embodiment, pose estimation is used to determine the accurate three-dimensional positions of the multiple joints. The positions of the joints in the second initial three-dimensional pose are also referred to as the initial three-dimensional positions of the joints in the second initial three-dimensional pose. For any joint among the multiple joints, the trajectory information includes the motion of the joint in the multiple second video frames, that is, the positional changes of the joint in the multiple second video frames.

[0065] 203. Based on the trajectory information of each joint in the multiple second video frames, the server adjusts the initial adjacency matrix of the target object to obtain the target adjacency matrix of the target object. The initial adjacency matrix is ​​used to represent the connection relationship between the multiple joints through the skeleton.

[0066] The initial adjacency matrix is ​​used to represent the connection relationship between the multiple joints through the skeleton. That is, for the target object, in addition to joints, it also has a skeleton. The skeleton can connect the joints, and the initial adjacency matrix is ​​used to represent the connection relationship between the joints through the skeleton.

[0067] 204. Based on the target adjacency matrix, the trajectory similarity between each key point, and the multiple first initial 3D poses, the server determines the pose features of the target object in the third video frame, which is the intermediate frame of the multiple first video frames.

[0068] In this context, the intermediate frame is the video frame located in the middle position. For example, if there are 5 video frames, the intermediate frame refers to the 3rd video frame. In this embodiment, the final pose estimation obtains the target's 3D pose from the intermediate frame; the other video frames are used to provide additional information to assist in determining the target's 3D pose from the intermediate frame. Pose features can be viewed as an abstract expression of pose, and the target's 3D pose can be determined using pose features.

[0069] 205. The server determines the target three-dimensional pose of the target object in the third video frame based on the pose characteristics of the target object in the third video frame.

[0070] The technical solution provided in this application obtains multiple first initial 3D poses of a target object in multiple first video frames and multiple second initial 3D poses in multiple second video frames. Based on the initial 3D positions of multiple joints of the target object in the multiple second initial 3D poses, the trajectory information of each joint in the multiple second video frames is determined, thereby extracting the motion trajectory of the joints. Based on the trajectory information of each joint in the multiple second video frames, the initial adjacency matrix of the target object is adjusted to obtain the target adjacency matrix, thereby supplementing the connection relationship between multiple joints. Based on the target adjacency matrix, the trajectory information of each joint, and the multiple first initial 3D poses, the pose features of the target object in the intermediate frames of the multiple first video frames, that is, the pose features of the third video frame, are determined. Based on the pose features of the target object in the third video frame, the target 3D pose of the target object in the intermediate frames is determined, thereby achieving high-precision pose estimation of the target object.

[0071] Steps 201-205 above are a brief description of the technical solutions provided in the embodiments of this application. The technical solutions provided in the embodiments of this application will be described more clearly below with reference to some examples. See [link to relevant documentation]. Figure 3 Taking the server as the executing entity as an example, the method includes the following steps.

[0072] 301. The server obtains multiple first video frames and multiple second video frames of the target object.

[0073] The target object is the object whose pose is to be estimated, typically a person. Multiple first video frames and multiple second video frames are video frames arranged chronologically, and the first and second video frames originate from the same source.

[0074] In one possible implementation, the server samples the video frames of the target object in chronological order at intervals to obtain the plurality of first video frames and the plurality of second video frames, wherein the sampling frequency of the plurality of second video frames is less than that of the plurality of first video frames.

[0075] The video frame set includes multiple video frames of the target object, and the video frame set can be considered as a single video of the target object. The sampling frequency of these multiple second video frames is lower than that of the multiple first video frames, meaning that the time interval between the sampling times of any two adjacent second video frames is greater than the time interval between the sampling times of any two adjacent first video frames. For example, if the sampling frequency of the first video frames is one frame every 5 seconds, and the sampling frequency of the second video frames is one frame every 10 seconds, then the number of first video frames sampled within the same sampling time is greater than the number of second video frames.

[0076] In this implementation, multiple first video frames and multiple second video frames of the target object are obtained by sampling the video frame set in chronological order. The pose of the target object can then be estimated by using the multiple first video frames and multiple second video frames.

[0077] 302. The server performs two-dimensional pose estimation on the multiple first video frames and the multiple second video frames to obtain the first two-dimensional pose of the target object in each first video frame and the second two-dimensional pose in each second video frame.

[0078] Two-dimensional pose estimation is used to determine the two-dimensional positions of multiple joints of the target object, that is, the planar positions of multiple joints in the video frame.

[0079] In one possible implementation, the server inputs multiple first video frames and multiple second video frames into a two-dimensional pose estimator, and performs two-dimensional pose estimation on the multiple first video frames and multiple second video frames through the two-dimensional pose estimator to obtain the first two-dimensional pose of the target object in each first video frame and the second two-dimensional pose in each second video frame.

[0080] For example, for any one of multiple first video frames, the server inputs the first video frame into a two-dimensional pose estimator, extracts features from the first video frame using the two-dimensional pose estimator, and obtains the first video frame features. The server then uses the two-dimensional pose estimator to perform joint point identification based on the first video frame features, obtaining multiple joints in the first video frame and their two-dimensional positions within the first video frame, thus obtaining the first two-dimensional pose of the first video frame. Similarly, for any one of multiple second video frames, the server inputs the second video frame into a two-dimensional pose estimator, extracts features from the second video frame using the two-dimensional pose estimator, and obtains the second video frame features. The server then uses the two-dimensional pose estimator to perform joint point identification based on the second video frame features, obtaining multiple joints in the second video frame and their two-dimensional positions within the second video frame, thus obtaining the second two-dimensional pose of the second video frame.

[0081] In some embodiments, the two-dimensional attitude estimator is a two-dimensional attitude estimation model in related technologies, such as a cascaded pyramid, etc., and the embodiments of this application do not limit it.

[0082] 303. The server performs dimensionality upscaling on the first two-dimensional pose of each first video frame and the second two-dimensional pose of each second video frame to obtain multiple first initial three-dimensional poses of the target object in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames, with one initial three-dimensional pose corresponding to one video frame.

[0083] The initial 3D pose estimation can employ 3D pose estimation methods from related technologies. One initial 3D pose corresponds to one video frame, meaning a first initial 3D pose corresponds to one first video frame, and a second initial 3D pose corresponds to one second video frame.

[0084] In one possible implementation, the server inputs the first two-dimensional pose of each first video frame and the second two-dimensional pose of each second video frame into the pose upsizing device. The pose upsizing device performs pose upsizing on the first two-dimensional pose of each first video frame and the second two-dimensional pose of each second video frame to obtain multiple first initial three-dimensional poses of the target object in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames.

[0085] The attitude lifter is an attitude lifter in related technologies, such as a Stride Transformer, and this application does not limit it to this embodiment.

[0086] It should be noted that, in the embodiments of this application, the first initial 3D pose in each first video frame and the second initial 3D pose in each second video frame can both be regarded as 3D poses obtained by the 3D pose estimation methods provided in related technologies. Related technologies typically use joint positions to model the human body's pose space; however, this method inevitably ignores some motion characteristics, such as the speed and direction of joint movement. The lack of representation of some characteristics is detrimental to the full modeling of the motion process, thus affecting the final effect of human pose estimation. The technical solution provided in the embodiments of this application can improve the accuracy of pose estimation based on the first initial 3D pose and the second initial 3D pose.

[0087] That is, see Figure 4 Taking a human as an example, this application describes the technical concept of an embodiment. It estimates the two-dimensional human pose using a two-dimensional human pose estimation model from related technologies, and then uses a two-dimensional-to-three-dimensional uplifting model from related technologies to uplift the two-dimensional human pose into an initial three-dimensional pose. The technical solution provided in this application is then used to further process the initial three-dimensional pose to obtain the target three-dimensional pose.

[0088] 304. Based on the initial three-dimensional positions of multiple joints of the target object in the multiple second initial three-dimensional poses, the server determines the trajectory information of each joint in the multiple second video frames. This trajectory information is used to represent the motion of the joints.

[0089] The target object includes multiple joints. In this embodiment, pose estimation is used to determine the accurate three-dimensional positions of the multiple joints. The positions of the joints in the second initial three-dimensional pose are also referred to as the initial three-dimensional positions of the joints in the second initial three-dimensional pose. For any joint among the multiple joints, the trajectory information includes the motion of the joint in the multiple second video frames, that is, the positional changes of the joint in the multiple second video frames.

[0090] In one possible implementation, the server determines the displacement of multiple joints in each second video frame based on the initial 3D positions of multiple joints of the target object in the multiple second initial 3D poses. This displacement represents the positional change of the joints in adjacent video frames. Based on the displacement of the multiple joints in each second video frame, the server determines the velocity and orientation information of each joint in each second video frame. The server then concatenates the velocity and orientation information of each joint in each second video frame to obtain the trajectory information of each joint in the multiple second video frames.

[0091] Among them, the velocity information of the joints is used to represent the speed of the joint's movement, and the direction information of the joints is used to represent the direction of the joint's movement. The velocity information and direction information of the joints are collectively referred to as the trajectory information of the joints, or the motion information of the joints.

[0092] To provide a clearer explanation of the above embodiments, the following description will be divided into several parts.

[0093] The first part involves the server determining the displacement of multiple joints in each second video frame based on the initial three-dimensional positions of multiple joints of the target object in the multiple second initial three-dimensional poses.

[0094] In one possible implementation, for any one of a plurality of second video frames, the server determines the displacement of the plurality of joints in the second video frame using the following formula (1).

[0095]

[0096] Where, τ i This represents the displacement of these multiple joint points in the second video frame i. J represents the number of joints, x i ′ , represents the initial three-dimensional position of the multiple joints in the second video frame i, x i ′ +1 f represents the initial 3D position of these multiple joints in the second video frame i+1. T The number of second video frames is given. As can be seen from the above formula (1), the displacement of the multiple joints in the last video frame of the multiple second video frames is 0.

[0097] The second part involves the server determining the velocity and orientation information of each joint in each second video frame based on the displacement of the multiple joints in each second video frame.

[0098] In one possible implementation, for any one of the multiple second video frames, the server decomposes the displacements of multiple joints in the second video frame to obtain the velocity and orientation information of the multiple joints in the second video frame.

[0099] For example, for any second video frame among multiple second video frames, the server obtains the velocity and orientation information of multiple joints in the second video frame using the following formulas (2) and (3).

[0100]

[0101] Among them, v iThis represents the velocity information of the multiple joints in the second video frame i, τx i The displacement τ of these multiple joints in the second video frame i represents the displacement τ of these joints. i The component along the x-axis, τy i The displacement τ of these multiple joints in the second video frame i represents the displacement τ of these joints. i The component along the y-axis, τz i The displacement τ of these multiple joints in the second video frame i represents the displacement of these joints. i The component along the z-axis, d i This indicates the orientation information of the multiple joints in the second video frame i.

[0102] The third part involves the server stitching together the velocity and orientation information of each joint in each second video frame to obtain the trajectory information of each joint in the multiple second video frames.

[0103] In one possible implementation, the server obtains the trajectory information of each joint point in the plurality of second video frames using the following formula (4).

[0104]

[0105] Where T represents the trajectory information of each joint point in the multiple second video frames, t i This represents the trajectory representation of the second video frame i, and the trajectory information includes multiple trajectory representations.

[0106] The content described in step 304 above explains the reason for the difference in sampling frequency when sampling the first and second video frames in step 301. With a small sampling interval, there is almost no difference between the preceding and following video frames, and the movement differences of the joints are negligible. Therefore, reducing the sampling frequency and increasing the sampling interval between the second video frames when acquiring them better reflects the movement of the joints.

[0107] In related technologies, 3D human pose is typically represented as the 3D coordinates of each joint on the xyz axes of a spatial coordinate system. Because this representation focuses more on the pose itself, it is also called a pose-space representation; the initial 3D pose in the above steps is in pose space. However, this representation ignores the explicit description of some motion characteristics, such as the expression of joint motion velocity and direction. To find a more suitable representation of the target object's motion information, the dynamic trajectories of joints carrying motion context information are considered. See [link to relevant documentation]. Figure 5 Taking a person as an example, Figure 5 The right wrist joint of the person is marked in the middle. Figure 5 upper part) and left ankle joint ( Figure 5The lower half of the diagram shows the trajectory of two joints during a running motion by representing the coordinate displacement of the joints at different times. In step 304 above, the continuous motion displacement of each joint is calculated using the time difference method, and the motion velocity and direction of the joints are further obtained by decomposing the displacement. This representation is also known as the representation in trajectory space.

[0108] 305. Based on the trajectory information of each joint in the multiple second video frames, the server adjusts the initial adjacency matrix of the target object to obtain the target adjacency matrix of the target object. The initial adjacency matrix is ​​used to represent the connection relationship between the multiple joints through the skeleton.

[0109] The initial adjacency matrix is ​​used to represent the connection relationship between the multiple joints through the skeleton. That is, for the target object, in addition to joints, it also has a skeleton. The skeleton can connect the joints, and the initial adjacency matrix is ​​used to represent the connection relationship between the joints through the skeleton.

[0110] In one possible implementation, the server determines the trajectory similarity between each joint point based on the trajectory information of each joint point in the plurality of second video frames. Based on the trajectory similarity between the joint points, the server adjusts the initial adjacency matrix of the target object to obtain the target adjacency matrix of the target object.

[0111] In this context, similar motion patterns mean similar motion trajectories. Adjusting the initial adjacency matrix is ​​to supplement the connection information of joints in addition to skeletal connections. In other words, although some joints are not explicitly connected by bones, there are countless relationships between their implicit motion information. For example, during walking, the left hand joint and the right hand joint, the left ankle joint and the right ankle joint have extremely similar motion patterns. Therefore, the connection relationships between the left hand joint and the right hand joint, the left ankle joint and the right ankle joint can be added to the initial adjacency matrix to obtain the target adjacency matrix.

[0112] To provide a clearer explanation of the above embodiments, the following description is divided into several parts.

[0113] Part 1: The server determines the trajectory similarity between various key points based on the trajectory information of each key point in the multiple second video frames.

[0114] In one possible implementation, the server extracts features from the trajectory information of each joint point in the multiple second video frames to obtain the spatiotemporal trajectory features of the multiple joint points. The server then performs convolution, linear transformation, and nonlinear transformation on the spatiotemporal trajectory features to obtain the motion features of each joint point. Based on the motion features of each joint point, the server determines the trajectory similarity between the joint points.

[0115] Since there is a certain error between the initial 3D pose and the actual 3D pose, the trajectory information calculated based on this inaccurate initial 3D pose is very likely to show sharp increases or decreases. Therefore, in order to make the trajectory information smoother, the trajectory information is first fine-tuned, that is, the aforementioned feature extraction is performed to obtain the spatiotemporal trajectory features. Determining the trajectory similarity between joints is to find joints with similar motion, that is, to find at least one pair of joints, thereby increasing the connection between joints. This similarity threshold is set by the technician according to the actual situation, and this application embodiment does not limit it.

[0116] To provide a clearer explanation of the above embodiments, the following description will be divided into several parts.

[0117] A. The server extracts features from the trajectory information of each key point in the multiple second video frames to obtain the spatiotemporal trajectory features of the multiple key points.

[0118] In one possible implementation, the server performs graph convolution on the trajectory information of each keypoint in the plurality of second video frames to obtain the spatial trajectory features of each keypoint in each second video frame. The server concatenates the spatial trajectory features of each keypoint in each second video frame with the position embedding features of each second video frame in the plurality of second video frames to obtain reference keypoint features for each keypoint. The server encodes the reference keypoint features of each keypoint based on an attention mechanism to obtain the spatiotemporal trajectory features of the plurality of keypoints.

[0119] For example, the server performs a linear transformation on the trajectory information of each joint point in the multiple second video frames using the following formula (5) to obtain the high-dimensional trajectory features of each joint point in the multiple second video frames. The linear transformation is implemented through a fully connected layer. The server then performs graph convolution on the high-dimensional trajectory features using the following formula (5) to obtain the spatial trajectory features of each joint point in each second video frame.

[0120]

[0121] in, This represents the spatial trajectory features of each keypoint in the second video frame i after undergoing l-layer graph convolution. Indicates the final spatial trajectory characteristics, Let L1 represent the high-dimensional trajectory features of each keypoint in the second video frame i, where l represents the number of graph convolution layers and L1 is the total number of graph convolution layers. Let A represent the sum of the initial adjacency matrix A and the identity matrix I of multiple joints. express The degree matrix, Let represent the trainable weight matrix in the l-th layer graph convolutional network, and σ(·) represent the non-linear activation function.

[0122] The server uses the following formula (6) to concatenate the spatial trajectory features of each joint point in each second video frame with the position embedding features of each second video frame in the multiple second video frames to obtain the reference joint point features of each joint point.

[0123]

[0124] in, Indicates the features of the reference joint. This represents the spatial trajectory features of each key point in the first and second video frames. This represents the position embedding feature of the first and second video frames, and Concat{} represents the concatenation function.

[0125] The server encodes the reference joint features of each joint point based on the attention mechanism using the following formulas (7)-(10) to obtain the spatiotemporal trajectory features of the multiple joint points.

[0126]

[0127]

[0128] in, This represents the first initial trajectory features acquired by the multi-head self-attention mechanism at layer 0. This represents the first initial trajectory feature acquired by the multi-head self-attention mechanism in layer l. T represents the second initial trajectory feature obtained after linear and nonlinear transformations. T The spatiotemporal trajectory features of multiple key points are represented by MSA(·), which represents the multi-head self-attention mechanism, MLP(·) which represents the multilayer perceptron, and LN(·) which represents normalization.

[0129] B. The server performs convolution, linear transformation, and nonlinear transformation on the spatiotemporal trajectory features to obtain the motion features of each joint.

[0130] In one possible implementation, the server convolves the spatiotemporal trajectory features using the following formula (11) and performs linear and nonlinear transformations using the following formula (12).

[0131]

[0132]

[0133] Where conv represents the convolution process. denoted by , represents the spatiotemporal trajectory features after convolution, and M represents the motion features of each joint.

[0134] C. The server determines the trajectory similarity between each joint based on the motion characteristics of each joint.

[0135] In one possible implementation, the server determines the trajectory similarity between each joint using the following formula (13).

[0136]

[0137] in, This represents the similarity between the motion characteristics of joint i and joint j, that is, the trajectory similarity between joint i and joint j, m. i The motion characteristics of joint i are represented by m. j Let j represent the motion feature of the joint, and let cos<·,·> calculate the cosine similarity between two motion features.

[0138] The second part involves the server adjusting the initial adjacency matrix of the target object based on the trajectory similarity between each key point, thereby obtaining the target adjacency matrix of the target object.

[0139] In one possible implementation, the server fuses the trajectory similarity between each key point with the initial adjacency matrix to obtain the target adjacency matrix of the target object.

[0140] For example, the server obtains the target adjacency matrix using the following formula (14).

[0141]

[0142] in, Let A represent the target adjacency matrix, A represent the initial adjacency matrix, and S represent the trajectory similarity between each key point.

[0143] 306. The server extracts features from the multiple first initial 3D poses to obtain the initial 3D pose features of each first initial 3D pose.

[0144] In one possible implementation, the server performs embedding encoding on each first initial 3D pose to obtain pose embedding features for each first initial 3D pose. Based on the initial adjacency matrix, the server performs graph convolution on the pose embedding features of each first initial 3D pose to obtain initial 3D pose features for each first initial 3D pose.

[0145] The purpose of embedding and encoding the first initial three-dimensional pose is to transform the first initial three-dimensional pose into high-dimensional features.

[0146] For example, the server uses the following formula (15) to perform graph convolution on the pose embedding features of each first initial three-dimensional pose based on the initial adjacency matrix to obtain the initial three-dimensional pose features of each first initial three-dimensional pose.

[0147]

[0148] in, This represents the initial 3D pose features corresponding to the first video frame i. Let represent the pose embedding features of the first initial 3D pose corresponding to the first video frame i, and W represent the trainable weight matrix. Let A represent the sum of the initial adjacency matrix A and the identity matrix I of multiple joints. express The degree matrix.

[0149] It should be noted that the above formula (15) is illustrated by using a single-layer graph convolution to process the pose embedding features of the first initial three-dimensional pose as an example. In other possible implementations, in order to enhance the expressive power of the initial three-dimensional pose features, the pose embedding features of the first initial three-dimensional pose can also be processed by multi-layer graph convolution. The implementation method is shown in the following formula (16).

[0150]

[0151] in, This represents the initial 3D pose feature corresponding to the first video frame i at layer l of graph convolution, and the feature output of the last graph convolution layer is used as the initial 3D pose feature corresponding to the first video frame i. W (l) L1 represents the trainable weight matrix of the l-layer graph convolution, and L1 represents the number of first video frames.

[0152] 307. Based on the target adjacency matrix, the initial three-dimensional pose features of each first initial three-dimensional pose, and the trajectory similarity between each joint, the server determines the pose features of the target object in the third video frame, which is the intermediate frame of the multiple first video frames.

[0153] In this context, the intermediate frame is the video frame located in the middle position. For example, if there are 5 video frames, the intermediate frame refers to the 3rd video frame. In this embodiment, the final pose estimation obtains the target's 3D pose from the intermediate frame; the other video frames are used to provide additional information to assist in determining the target's 3D pose from the intermediate frame. Pose features can be viewed as an abstract expression of pose, and the target's 3D pose can be determined using pose features.

[0154] In one possible implementation, the server fuses the target adjacency matrix, the initial 3D pose features of each first initial 3D pose, and the trajectory similarity between each keypoint to obtain fused pose features of each first initial 3D pose. The server concatenates the fused pose features of each first initial 3D pose with the position embedding features of each first video frame in the plurality of first video frames to obtain concatenated pose features of each first initial 3D pose. The server encodes the concatenated pose features of each first initial 3D pose based on an attention mechanism to obtain the pose features of the target object in the third video frame.

[0155] To provide a clearer explanation of the above embodiments, the following description will be divided into several parts.

[0156] The first part involves the server fusing the target's adjacency matrix, the initial 3D pose features of each initial 3D pose, and the trajectory similarity between each joint to obtain the fused pose features of each initial 3D pose.

[0157] In one possible approach, the server obtains the fused pose features of each first initial three-dimensional pose using the following formula (17).

[0158]

[0159] Among them, X M This indicates the fused pose features. Let X represent the target adjacency matrix. S This represents multiple initial 3D poses. Represents the weight matrix. S represents the trajectory similarity between each joint.

[0160] The second part involves the server stitching together the fused pose features of each initial 3D pose with the position embedding features of each first video frame within the multiple first video frames to obtain the stitched pose features of each initial 3D pose.

[0161] In one possible implementation, the server obtains the stitching pose features of each first initial three-dimensional pose using the following formula (18).

[0162]

[0163] in, Indicates the splicing posture characteristics. p1 represents the fused pose feature of the first initial 3D pose, p1 represents the position embedding feature of the first video frame, and f represents the number of multiple first video frames.

[0164] Part Three: The server encodes the stitched pose features of each initial 3D pose based on an attention mechanism to obtain the pose features of the target object in the third video frame.

[0165] Among them, the Transformer network is good at modeling the long-range dependencies of the input sequence, which is well-suited for the long sequence time optimization process. In the embodiments of this application, the Transformer encoder is used to fine-tune the temporal dimension of the splicing pose features, thereby enhancing the expressive power of the features.

[0166] In one possible implementation, the server encodes the stitched pose features of each first initial three-dimensional pose based on the attention mechanism using the following formulas (19)-(22) to obtain the pose features of the target object in the third video frame.

[0167]

[0168] Among them, X T This indicates the pose characteristics of the target object in the third video frame. This represents the first initial pose feature acquired by the attention mechanism at layer 0. This represents the first initial pose feature acquired by the l-th layer attention mechanism. This represents the second initial attitude feature obtained after linear and nonlinear transformations.

[0169] The process described in steps 305-307 above can be achieved through... Figure 6 To represent it.

[0170] 308. Based on the pose characteristics of the target object in the third video frame, the server determines the target object's 3D pose in the third video frame.

[0171] In one possible implementation, the server performs convolution, linear transformation, and nonlinear transformation on the pose features of the target object in the third video frame to obtain the positions of multiple joints of the target object in the third video frame. The positions of the multiple joints in the third video frame are used to represent the three-dimensional pose of the target.

[0172] The technical solution provided in this application follows the "sequence-frame (seq2frame)" paradigm, aiming to estimate the target's 3D pose in the intermediate frame by referencing the combined information of the preceding and following frames, which requires considering the overall sequence information. Therefore, by convolving the pose features, reducing the dimensionality of the pose features, and finally using a multilayer perceptron (linear transformation and nonlinear transformation) to regress the target's 3D pose in the intermediate frame.

[0173] For example, the server performs convolution on the pose features using the following formula (23), and performs linear and nonlinear transformations and regressions using the following formula (24).

[0174]

[0175] in, This represents the pose features after convolution, which is also the pose features after dimensionality reduction. This indicates the position of the multiple joints in the third video frame, which is also the three-dimensional pose of the target.

[0176] The technical solutions described in steps 304-308 above can be implemented using a pose recognition model, see [link / reference]. Figure 7 The pose recognition model 700 includes a spatial fine-tuning module 701, a trajectory space module 702, a motion graph convolution module 703, a temporal fine-tuning module 704, and a regression head 705.

[0177] Specifically, the spatial fine-tuning module 701 is used to execute the above step 306, the trajectory space module 702 is used to execute the above step 304, the motion graph convolution module 703 is used to execute the content of fusing attitude features in the above steps 305 and 307, the temporal fine-tuning module 704 is used to execute the content of obtaining attitude features in the above steps 30 and 307, and the regression head 705 is used to execute the above step 308.

[0178] In some embodiments, the pose recognition model is obtained through supervised training, and the loss function for training the pose recognition model is given in the following formula (25).

[0179]

[0180] Where Loss represents the loss function, J represents the number of training samples, and y i Indicates the labeled 3D pose. This indicates the estimated three-dimensional pose.

[0181] All of the above-mentioned optional technical solutions can be combined in any way to form the optional embodiments of this application, and will not be described in detail here.

[0182] The technical solution provided in this application obtains multiple first initial 3D poses of a target object in multiple first video frames and multiple second initial 3D poses in multiple second video frames. Based on the initial 3D positions of multiple joints of the target object in the multiple second initial 3D poses, the trajectory information of each joint in the multiple second video frames is determined, thereby extracting the motion trajectory of the joints. Based on the trajectory information of each joint in the multiple second video frames, the initial adjacency matrix of the target object is adjusted to obtain the target adjacency matrix, thereby supplementing the connection relationship between multiple joints. Based on the target adjacency matrix, the trajectory information of each joint, and the multiple first initial 3D poses, the pose features of the target object in the intermediate frames of the multiple first video frames, that is, the pose features of the third video frame, are determined. Based on the pose features of the target object in the third video frame, the target 3D pose of the target object in the intermediate frames is determined, thereby achieving high-precision pose estimation of the target object.

[0183] Figure 8 This is a schematic diagram of the structure of an attitude estimation device provided in an embodiment of this application. See also... Figure 8 The device includes: an acquisition module 801, a trajectory information determination module 802, an adjustment module 803, an attitude feature extraction module 804, and a target three-dimensional attitude determination module 805.

[0184] The acquisition module 801 is used to acquire multiple first initial three-dimensional poses of the target object in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames. The multiple first video frames and the multiple second video frames are obtained by sampling from the video frame set of the target object in time order at intervals, and the sampling frequency of the multiple second video frames is less than that of the multiple first video frames. One initial three-dimensional pose corresponds to one video frame.

[0185] The trajectory information determination module 802 is used to determine the trajectory information of each joint in the multiple second video frames based on the initial three-dimensional positions of multiple joints of the target object in the multiple second initial three-dimensional poses. The trajectory information is used to represent the motion of the joint.

[0186] The adjustment module 803 is used to adjust the initial adjacency matrix of the target object based on the trajectory information of each joint in the multiple second video frames, so as to obtain the target adjacency matrix of the target object. The initial adjacency matrix is ​​used to represent the connection relationship of the multiple joints connected by the skeleton.

[0187] The pose feature extraction module 804 is used to determine the pose features of the target object in the third video frame based on the target adjacency matrix, the trajectory similarity between each of the joints and the multiple first initial three-dimensional poses, wherein the third video frame is the intermediate frame of the multiple first video frames.

[0188] The target 3D pose determination module 805 is used to determine the target 3D pose of the target object in the third video frame based on the pose characteristics of the target object in the third video frame.

[0189] In one possible implementation, the trajectory information determination module 802 is used to determine the displacement of the multiple joints in each of the second video frames based on the initial three-dimensional positions of the multiple joints of the target object in the multiple second initial three-dimensional poses. This displacement represents the positional change of the joints in adjacent video frames. Based on the displacement of the multiple joints in each of the second video frames, the velocity and orientation information of each joint in each of the second video frames are determined. The velocity and orientation information of each joint in each of the second video frames are then concatenated to obtain the trajectory information of each joint in the multiple second video frames.

[0190] In one possible implementation, the adjustment module 803 is used to determine the trajectory similarity between each joint point based on the trajectory information of each joint point in the plurality of second video frames. Based on the trajectory similarity between the joint points, the initial adjacency matrix of the target object is adjusted to obtain the target adjacency matrix of the target object.

[0191] In one possible implementation, the adjustment module 803 is used to extract features from the trajectory information of each joint point in the plurality of second video frames to obtain the spatiotemporal trajectory features of the plurality of joint points. Convolution, linear transformation, and nonlinear transformation are performed on the spatiotemporal trajectory features to obtain the motion features of each joint point. Based on the motion features of each joint point, the trajectory similarity between the joint points is determined.

[0192] In one possible implementation, the adjustment module 803 is used to perform graph convolution on the trajectory information of each joint point in the plurality of second video frames to obtain the spatial trajectory features of each joint point in each of the second video frames. The spatial trajectory features of each joint point in each of the second video frames are concatenated with the position embedding features of each second video frame in the plurality of second video frames to obtain reference joint point features for each joint point. The reference joint point features of each joint point are then encoded based on an attention mechanism to obtain the spatiotemporal trajectory features of the plurality of joint points.

[0193] In one possible implementation, the pose feature determination module is used to extract features from the plurality of first initial 3D poses to obtain initial 3D pose features for each of the first initial 3D poses. Based on the target adjacency matrix, the initial 3D pose features of each of the first initial 3D poses, and the trajectory similarity between each of the joints, the pose features of the target object in the third video frame are determined.

[0194] In one possible implementation, the pose feature determination module is used to embed and encode each of the first initial 3D poses to obtain pose embedding features of each of the first initial 3D poses. Based on the initial adjacency matrix, graph convolution is performed on the pose embedding features of each of the first initial 3D poses to obtain initial 3D pose features of each of the first initial 3D poses.

[0195] In one possible implementation, the pose feature determination module is used to fuse the target adjacency matrix, the initial 3D pose features of each of the first initial 3D poses, and the trajectory similarity between each of the joints to obtain fused pose features of each of the first initial 3D poses. The fused pose features of each of the first initial 3D poses are then concatenated with the position embedding features of each of the first video frames in the plurality of first video frames to obtain concatenated pose features of each of the first initial 3D poses. Finally, the concatenated pose features of each of the first initial 3D poses are encoded based on an attention mechanism to obtain the pose features of the target object in the third video frame.

[0196] In one possible implementation, the target 3D pose determination module 805 is used to perform convolution, linear transformation and nonlinear transformation on the pose features of the target object in the third video frame to obtain the positions of multiple joints of the target object in the third video frame. The positions of the multiple joints in the third video frame are used to represent the 3D pose of the target.

[0197] In one possible implementation, the acquisition module 801 is used to perform two-dimensional pose estimation on the plurality of first video frames and the plurality of second video frames to obtain the first two-dimensional pose of the target object in each of the first video frames and the second two-dimensional pose in each of the second video frames.

[0198] The first two-dimensional pose of each first video frame and the second two-dimensional pose of each second video frame are upgraded to obtain multiple first initial three-dimensional poses of the target object in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames.

[0199] It should be noted that the attitude estimation device provided in the above embodiments is only illustrated by the division of the above functional modules. In practical applications, the above functions can be assigned to different functional modules as needed, that is, the internal structure of the computer device can be divided into different functional modules to complete all or part of the functions described above. In addition, the attitude estimation device and attitude estimation method embodiments provided in the above embodiments belong to the same concept, and their specific implementation process can be found in the method embodiments, which will not be repeated here.

[0200] The technical solution provided in this application obtains multiple first initial 3D poses of a target object in multiple first video frames and multiple second initial 3D poses in multiple second video frames. Based on the initial 3D positions of multiple joints of the target object in the multiple second initial 3D poses, the trajectory information of each joint in the multiple second video frames is determined, thereby extracting the motion trajectory of the joints. Based on the trajectory information of each joint in the multiple second video frames, the initial adjacency matrix of the target object is adjusted to obtain the target adjacency matrix, thereby supplementing the connection relationship between multiple joints. Based on the target adjacency matrix, the trajectory information of each joint, and the multiple first initial 3D poses, the pose features of the target object in the intermediate frames of the multiple first video frames, that is, the pose features of the third video frame, are determined. Based on the pose features of the target object in the third video frame, the target 3D pose of the target object in the intermediate frames is determined, thereby achieving high-precision pose estimation of the target object.

[0201] This application provides a computer device for performing the above-described method. This computer device can be implemented as a terminal or a server. The structure of the terminal will be described below:

[0202] Figure 9 This is a schematic diagram of the structure of a terminal provided in an embodiment of this application. The terminal 900 can be a smartphone, tablet computer, laptop computer, or desktop computer. The terminal 900 may also be referred to as user equipment, portable terminal, laptop terminal, desktop terminal, or other names.

[0203] Typically, terminal 900 includes one or more processors 901 and one or more memories 902.

[0204] Processor 901 may include one or more processing cores, such as a quad-core processor or an octa-core processor. Processor 901 may be implemented using at least one hardware form selected from DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). Processor 901 may also include a main processor and a coprocessor. The main processor, also known as a CPU (Central Processing Unit), is used to process data in the wake-up state; the coprocessor is a low-power processor used to process data in the standby state. In some embodiments, processor 901 may integrate a GPU (Graphics Processing Unit), which is responsible for rendering and drawing the content to be displayed on the screen. In some embodiments, processor 901 may also include an AI (Artificial Intelligence) processor, which is used to handle computational operations related to machine learning.

[0205] The memory 902 may include one or more computer-readable storage media, which may be non-transitory. The memory 902 may also include high-speed random access memory and non-volatile memory, such as one or more disk storage devices or flash memory devices. In some embodiments, the non-transitory computer-readable storage media in the memory 902 are used to store at least one computer program, which is executed by the processor 901 to implement the attitude estimation method provided in the method embodiments of this application.

[0206] In some embodiments, the terminal 900 may also optionally include a peripheral device interface 903 and at least one peripheral device. The processor 901, memory 902, and peripheral device interface 903 can be connected via a bus or signal line. Each peripheral device can be connected to the peripheral device interface 903 via a bus, signal line, or circuit board. Specifically, the peripheral device includes at least one of the following: a radio frequency circuit 904, a display screen 905, a camera assembly 906, an audio circuit 907, and a power supply 908.

[0207] Peripheral device interface 903 can be used to connect at least one I / O (Input / Output) related peripheral device to processor 901 and memory 902. In some embodiments, processor 901, memory 902 and peripheral device interface 903 are integrated on the same chip or circuit board; in some other embodiments, any one or two of processor 901, memory 902 and peripheral device interface 903 can be implemented on separate chips or circuit boards, which is not limited in this embodiment.

[0208] The radio frequency (RF) circuit 904 is used to receive and transmit RF (Radio Frequency) signals, also known as electromagnetic signals. The RF circuit 904 communicates with communication networks and other communication devices via electromagnetic signals. The RF circuit 904 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals back into electrical signals. Optionally, the RF circuit 904 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a user identity module card, etc.

[0209] Display screen 905 is used to display a user interface (UI). This UI may include graphics, text, icons, video, and any combination thereof. When display screen 905 is a touch screen, it also has the ability to collect touch signals on or above its surface. These touch signals can be input as control signals to processor 901 for processing. In this case, display screen 905 can also be used to provide virtual buttons and / or a virtual keyboard, also known as soft buttons and / or a soft keyboard.

[0210] The camera assembly 906 is used to capture images or videos. Optionally, the camera assembly 906 includes a front-facing camera and a rear-facing camera. Typically, the front-facing camera is located on the front panel of the terminal, and the rear-facing camera is located on the back of the terminal.

[0211] The audio circuit 907 may include a microphone and a speaker. The microphone is used to collect sound waves from the user and the environment, and convert the sound waves into electrical signals that are input to the processor 901 for processing, or input to the radio frequency circuit 904 to realize voice communication.

[0212] The power supply 908 is used to supply power to the various components in the terminal 900. The power supply 908 can be AC ​​power, DC power, a disposable battery, or a rechargeable battery.

[0213] In some embodiments, the terminal 900 further includes one or more sensors 909. The one or more sensors 909 include, but are not limited to, an accelerometer 910, a gyroscope 911, a pressure sensor 912, an optical sensor 913, and a proximity sensor 914.

[0214] Accelerometer 910 can detect the magnitude of acceleration on the three coordinate axes of a coordinate system established with terminal 900.

[0215] The gyroscope sensor 911 can detect the orientation and rotation angle of the terminal 900. The gyroscope sensor 911 can work in conjunction with the accelerometer sensor 910 to collect the user's 3D movements on the terminal 900.

[0216] The pressure sensor 912 can be installed on the side bezel of the terminal 900 and / or on the lower layer of the display screen 905. When the pressure sensor 912 is installed on the side bezel of the terminal 900, it can detect the user's grip signal on the terminal 900, and the processor 901 can perform left / right hand recognition or quick operation based on the grip signal collected by the pressure sensor 912. When the pressure sensor 912 is installed on the lower layer of the display screen 905, the processor 901 can control the operable controls on the UI interface based on the user's pressure operation on the display screen 905.

[0217] An optical sensor 913 is used to collect ambient light intensity. In one embodiment, a processor 901 can control the display brightness of a display screen 905 based on the ambient light intensity collected by the optical sensor 913.

[0218] The proximity sensor 914 is used to detect the distance between the user and the front of the terminal 900.

[0219] Those skilled in the art will understand that Figure 9 The structure shown does not constitute a limitation on terminal 900, and may include more or fewer components than shown, or combine certain components, or use different component arrangements.

[0220] The aforementioned computer equipment can also be implemented as a server. The structure of a server is described below:

[0221] Figure 10This is a schematic diagram of a server structure provided in an embodiment of this application. The server 1000 can vary significantly due to different configurations or performance. It may include one or more Central Processing Units (CPUs) 1001 and one or more memories 1002. The one or more memories 1002 store at least one computer program, which is loaded and executed by the one or more processors 1001 to implement the methods provided in the various method embodiments described above. Of course, the server 1000 may also have wired or wireless network interfaces, a keyboard, and input / output interfaces for input and output. The server 1000 may also include other components for implementing device functions, which will not be elaborated upon here.

[0222] In an exemplary embodiment, a computer-readable storage medium is also provided, such as a memory including a computer program that can be executed by a processor to perform the attitude estimation method in the above embodiments. For example, the computer-readable storage medium may be a read-only memory (ROM), a random access memory (RAM), a compact disc read-only memory (CD-ROM), magnetic tape, floppy disk, and optical data storage device, etc.

[0223] In an exemplary embodiment, a computer program product or computer program is also provided, which includes program code stored in a computer-readable storage medium. A processor of a computer device reads the program code from the computer-readable storage medium and executes the program code, causing the computer device to perform the attitude estimation method described above.

[0224] In some embodiments, the computer program involved in the present application embodiments may be deployed and executed on a computer device, or executed on multiple computer devices located in one location, or executed on multiple computer devices distributed in multiple locations and interconnected through a communication network. Multiple computer devices distributed in multiple locations and interconnected through a communication network may constitute a blockchain system.

[0225] Those skilled in the art will understand that all or part of the steps of the above embodiments can be implemented by hardware or by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as a read-only memory, a disk, or an optical disk.

[0226] The above are merely optional embodiments of this application and are not intended to limit this application. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of this application should be included within the protection scope of this application.

Claims

1. A pose estimation method, characterized in that, The method includes: The target object is acquired in multiple first initial three-dimensional poses in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames. The multiple first video frames and the multiple second video frames are obtained by sampling from the video frame set of the target object in chronological order at intervals. The sampling frequency of the multiple second video frames is less than that of the multiple first video frames. One initial three-dimensional pose corresponds to one video frame. The multiple first video frames and the multiple second video frames are video frames arranged in chronological order. Based on the initial three-dimensional positions of multiple joints of the target object in the plurality of second initial three-dimensional poses, the trajectory information of each joint in the plurality of second video frames is determined, and the trajectory information is used to represent the motion of the joints. Based on the trajectory information of each of the joints in the plurality of second video frames, the initial adjacency matrix of the target object is adjusted to obtain the target adjacency matrix of the target object. The initial adjacency matrix is ​​used to represent the connection relationship of the plurality of joints connected by the skeleton. Based on the target adjacency matrix, the trajectory similarity between each of the key points, and the plurality of first initial 3D poses, the pose features of the target object in the third video frame are determined, and the third video frame is the intermediate frame of the plurality of first video frames; Based on the pose features of the target object in the third video frame, the target three-dimensional pose of the target object in the third video frame is determined; The step of determining the pose features of the target object in the third video frame based on the target adjacency matrix, the initial three-dimensional pose features of each of the first initial three-dimensional poses, and the trajectory similarity between each of the joints includes: The target adjacency matrix, the initial three-dimensional pose features of each of the first initial three-dimensional poses, and the trajectory similarity between each of the joints are fused to obtain the fused pose features of each of the first initial three-dimensional poses. The fused pose features of each of the first initial three-dimensional poses are combined with the position embedding features of each of the first video frames in the plurality of first video frames to obtain the combined pose features of each of the first initial three-dimensional poses. The stitched pose features of each of the first initial 3D poses are encoded based on an attention mechanism to obtain the pose features of the target object in the third video frame.

2. The method according to claim 1, characterized in that, The step of determining the trajectory information of each joint in the plurality of second video frames based on the initial three-dimensional positions of multiple joints of the target object in the plurality of second initial three-dimensional poses includes: Based on the initial three-dimensional positions of multiple joints of the target object in the multiple second initial three-dimensional poses, the displacements of the multiple joints in each second video frame are determined, and the displacements are used to represent the positional changes of the joints in adjacent video frames; Based on the displacement of the plurality of joints in each of the second video frames, determine the velocity information and orientation information of each joint in each of the second video frames; The velocity and direction information of each joint point in each of the second video frames are spliced ​​together to obtain the trajectory information of each joint point in the plurality of second video frames.

3. The method according to claim 1, characterized in that, The step of adjusting the initial adjacency matrix of the target object based on the trajectory information of each of the key points in the plurality of second video frames to obtain the target adjacency matrix of the target object includes: Based on the trajectory information of each of the joint points in the plurality of second video frames, the trajectory similarity between each of the joint points is determined. Based on the trajectory similarity between each of the key points, the initial adjacency matrix of the target object is adjusted to obtain the target adjacency matrix of the target object.

4. The method according to claim 3, characterized in that, The step of determining the trajectory similarity between the various joints based on their trajectory information in the plurality of second video frames includes: Feature extraction is performed on the trajectory information of each of the joint points in the plurality of second video frames to obtain the spatiotemporal trajectory features of the plurality of joint points; The spatiotemporal trajectory features are subjected to convolution, linear transformation, and nonlinear transformation to obtain the motion features of each of the joint points. Based on the motion characteristics of each joint, the trajectory similarity between each joint is determined.

5. The method according to claim 4, characterized in that, The step of extracting features from the trajectory information of each of the joint points in the plurality of second video frames to obtain the spatiotemporal trajectory features of the plurality of joint points includes: Graph convolution is performed on the trajectory information of each of the joint points in the plurality of second video frames to obtain the spatial trajectory features of each of the joint points in each of the second video frames; The spatial trajectory features of each joint point in each second video frame are concatenated with the position embedding features of each second video frame in the plurality of second video frames to obtain the reference joint point features of each joint point. The reference joint features of each joint are encoded based on an attention mechanism to obtain the spatiotemporal trajectory features of the multiple joints.

6. The method according to claim 1, characterized in that, The step of determining the pose features of the target object in the third video frame based on the target adjacency matrix, the trajectory similarity between each of the key points, and the plurality of first initial 3D poses includes: Feature extraction is performed on the plurality of first initial three-dimensional poses to obtain the initial three-dimensional pose features of each first initial three-dimensional pose; Based on the target adjacency matrix, the initial three-dimensional pose features of each of the first initial three-dimensional poses, and the trajectory similarity between each of the joints, the pose features of the target object in the third video frame are determined.

7. The method according to claim 6, characterized in that, The step of extracting features from the plurality of first initial three-dimensional poses to obtain the initial three-dimensional pose features of each first initial three-dimensional pose includes: Embedding encoding is performed on each of the first initial three-dimensional poses to obtain the pose embedding features of each of the first initial three-dimensional poses; Based on the initial adjacency matrix, graph convolution is performed on the pose embedding features of each of the first initial three-dimensional poses to obtain the initial three-dimensional pose features of each of the first initial three-dimensional poses.

8. The method according to claim 1, characterized in that, Determining the target object's 3D pose in the third video frame based on its pose features includes: Convolution, linear transformation, and nonlinear transformation are performed on the pose features of the target object in the third video frame to obtain the positions of multiple joints of the target object in the third video frame. The positions of the multiple joints in the third video frame are used to represent the three-dimensional pose of the target.

9. The method according to claim 1, characterized in that, The step of obtaining multiple first initial 3D poses of the target object in multiple first video frames and multiple second initial 3D poses in multiple second video frames includes: Two-dimensional pose estimation is performed on the plurality of first video frames and the plurality of second video frames to obtain the first two-dimensional pose of the target object in each of the first video frames and the second two-dimensional pose in each of the second video frames. The first two-dimensional pose of each first video frame and the second two-dimensional pose of each second video frame are upgraded to obtain multiple first initial three-dimensional poses of the target object in multiple first video frames and multiple second initial three-dimensional poses in multiple second video frames.