Human pose estimation methods, devices, terminals, and media based on multimodal perception

CN122290218APending Publication Date: 2026-06-26PEKING UNIV SHENZHEN GRADUATE SCHOOL

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
PEKING UNIV SHENZHEN GRADUATE SCHOOL
Filing Date
2026-05-27
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing 3D human pose estimation models lack generalization performance and robustness when faced with cross-domain data or complex motion tasks, mainly due to the use of random selection in the prompt retrieval process, which leads to mismatch in context logic and imbalance in motion patterns.

Method used

By clustering standard datasets, we determine each standard data class and cluster center, construct a context cue pool, and generate context cue by using the embedding matching of human motion sequences and cluster centers. This context cue is then used as prior information for context learning to determine the 3D human pose.

Benefits of technology

It improves the generalization performance and robustness of the 3D human pose estimation model in cross-domain data and complex motion tasks, ensures the relevance and balance of contextual cues, and enhances the model's task guidance capability.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122290218A_ABST
    Figure CN122290218A_ABST
Patent Text Reader

Abstract

This invention discloses a method, device, terminal, and medium for human pose estimation based on multimodal perception, relating to the field of computer vision. The method extracts a two-dimensional skeleton sequence based on the human motion sequence corresponding to the target object; clusters a standard dataset to determine each standard data class and its corresponding cluster center; determines contextual cues based on the human motion sequence, each standard data class, and each cluster center; and uses the contextual cues as prior information, employing a human pose estimation model to perform contextual learning based on the two-dimensional skeleton sequence to determine the corresponding three-dimensional human pose. Because this invention determines contextual cues based on each cluster center, each standard data class, and the human motion sequence after clustering the standard dataset, it effectively solves the problem of insufficient generalization performance and robustness of existing technologies' three-dimensional human pose estimation models when facing cross-domain data or complex motion tasks due to the random selection method used in the cue retrieval stage.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision, and more particularly to a method, device, terminal, and medium for human pose estimation based on multimodal perception. Background Technology

[0002] Data processing for 3D human pose estimation is modally diverse, typically including a sequence of skeletal poses consisting of the 3D coordinates of multiple joints, as well as a 3D human body mesh defined by a skinned multi-person linear model, containing vertex, face, pose, and shape parameters.

[0003] To achieve unified processing of multimodal data, the In-Context Learning paradigm has been introduced in recent years. This paradigm combines domain-specific input-target pairs (i.e., hints) with query samples, enabling models to perform task-specific reasoning using example templates without explicit fine-tuning.

[0004] However, existing context learning frameworks often use simple random selection in the prompt retrieval stage, which can easily lead to over- or under-representation of certain motion patterns, or a mismatch between the selected prompts and the query samples in terms of contextual logic. This results in insufficient generalization performance and robustness of 3D human pose estimation models when faced with cross-domain data or complex motion tasks.

[0005] Therefore, existing technologies still need improvement and development. Summary of the Invention

[0006] The technical problem to be solved by the present invention is to provide a human pose estimation method, device, terminal and medium based on multimodal perception, which addresses the above-mentioned defects of the prior art. The aim is to solve the problem that the existing technology uses a random selection method in the prompt retrieval stage, which makes the generalization performance and robustness of the three-dimensional human pose estimation model insufficient when facing cross-domain data or complex motion tasks.

[0007] The technical solution adopted by this invention to solve the problem is as follows: In a first aspect, embodiments of the present invention provide a human pose estimation method based on multimodal perception, wherein the method includes: Obtain the human motion sequence corresponding to the target object, and extract a two-dimensional skeleton sequence based on the human motion sequence; Cluster the standard dataset to determine each standard data class and its corresponding cluster center; Contextual prompts are determined based on the human motion sequence, the standard data classes, and the cluster centers. Using the contextual cues as prior information, a human pose estimation model is employed to perform contextual learning based on the two-dimensional skeleton sequence to determine the three-dimensional human pose corresponding to the target object.

[0008] In one implementation method, extracting a two-dimensional skeleton sequence based on the human motion sequence includes: The two-dimensional coordinate information of each joint point in the corresponding frame is extracted based on the human motion sequence using a human key point detection algorithm. The two-dimensional skeleton sequence corresponding to the human motion sequence is determined based on the preset human joint topology and the two-dimensional coordinate information of each joint point in each frame.

[0009] In one implementation, determining contextual cues based on the human motion sequence, each of the standard data classes, and each of the cluster centers includes: Based on each cluster center, standard data is selected from the corresponding standard data class to construct a prompt pool; The human motion sequence is embedded and mapped to determine the query embedding; Contextual hints are obtained from the hint pool based on the query embedding.

[0010] In one implementation method, selecting standard data from the corresponding standard data classes based on each cluster center to construct a hint pool includes: Calculate the average joint distance between the query embedding and each standard data in the prompt pool; Use the standard data corresponding to the minimum average joint distance as a contextual cue.

[0011] In one implementation method, determining contextual cues based on the human motion sequence, each of the standard data classes, and the corresponding cluster centers includes: The human motion sequence and each cluster center are mapped and fused to determine the spatiotemporal embedding vector corresponding to each cluster center; The spatiotemporal embedding vectors are weighted and fused based on the human motion sequence to determine contextual cues.

[0012] In one implementation method, mapping and fusing the human motion sequence and each of the cluster centers to determine the spatiotemporal embedding vector corresponding to each of the cluster centers includes: The human motion sequence is time-coded and spatial-coded to determine the spatiotemporal location code; Perform high-dimensional mapping on each cluster center to determine each initial embedding vector; The spatiotemporal position code is introduced into each of the initial embedding vectors to determine the corresponding spatiotemporal embedding vector.

[0013] In one implementation method, the contextual cues are determined by weighted fusion of the spatiotemporal embedding vectors based on the human motion sequence, including: Perform high-dimensional mapping on the human motion sequence to determine the query embedding; Calculate the multiplicative similarity between the query embedding and each of the spatiotemporal embedding vectors, and determine the weight of each of the spatiotemporal embedding vectors based on the multiplicative similarity; The spatiotemporal embedding vectors are weighted and fused to determine the contextual hints.

[0014] Secondly, embodiments of the present invention also provide a human pose estimation device based on multimodal perception, wherein the human pose estimation device based on multimodal perception includes: A two-dimensional skeleton sequence extraction module is used to obtain the human motion sequence corresponding to the target object and extract a two-dimensional skeleton sequence based on the human motion sequence. The standard dataset clustering module is used to cluster standard datasets and determine each standard data class and its corresponding cluster center. The context prompt determination module is used to determine context prompts based on the human motion sequence, each of the standard data classes, and each of the cluster centers. The three-dimensional human pose determination module is used to determine the three-dimensional human pose corresponding to the target object by using the contextual cues as prior information and employing a human pose estimation model to perform contextual learning based on the two-dimensional skeleton sequence.

[0015] Thirdly, embodiments of the present invention also provide a terminal, the terminal including a memory and one or more processors; the memory stores one or more programs; the programs include instructions for executing the human pose estimation method based on multimodal perception as described above; the processor is used to execute the programs.

[0016] Fourthly, embodiments of the present invention also provide a computer-readable storage medium storing a plurality of instructions, wherein the instructions are adapted to be loaded and executed by a processor to implement any of the above-described human pose estimation methods based on multimodal perception.

[0017] The beneficial effects of this invention are as follows: This invention extracts a two-dimensional skeleton sequence based on the human motion sequence corresponding to the target object; clusters a standard dataset to determine each standard data class and its corresponding cluster center; determines contextual cues based on the human motion sequence, each standard data class, and each cluster center; and uses the contextual cues as prior information, employing a human pose estimation model to perform contextual learning based on the two-dimensional skeleton sequence to determine the corresponding three-dimensional human pose. Because this invention determines contextual cues based on each cluster center, each standard data class, and the human motion sequence after clustering the standard dataset, it effectively solves the problem of insufficient generalization performance and robustness of the three-dimensional human pose estimation model when facing cross-domain data or complex motion tasks, due to the random selection method used in the cue retrieval stage of existing technologies. Attached Figure Description

[0018] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments recorded in the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0019] Figure 1 This is a flowchart illustrating the human pose estimation method based on multimodal perception provided in an embodiment of the present invention.

[0020] Figure 2 This is a schematic diagram of the prompt retrieval process based on random sampling provided in an embodiment of the present invention.

[0021] Figure 3 This is a schematic diagram of the parameterless prompt search process provided in an embodiment of the present invention.

[0022] Figure 4 This is a schematic diagram of the process of trainable parameterized prompt retrieval provided in an embodiment of the present invention.

[0023] Figure 5 This is a schematic diagram of the internal modules of the human pose estimation device based on multimodal perception provided in an embodiment of the present invention.

[0024] Figure 6 This is a schematic diagram of the terminal provided in the embodiment of the present invention. Detailed Implementation

[0025] This invention discloses a method, device, terminal, and medium for human pose estimation based on multimodal perception. To make the objectives, technical solutions, and effects of this invention clearer and more explicit, the invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are only for explaining the invention and are not intended to limit the invention.

[0026] Those skilled in the art will understand that, unless specifically stated otherwise, the singular forms “a,” “an,” “the,” and “the” used herein may also include the plural forms. It should be further understood that the term “comprising” as used in this specification means the presence of the stated features, integers, steps, operations, elements, and / or components, but does not exclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and / or groups thereof. It should be understood that when we say an element is “connected” or “coupled” to another element, it can be directly connected or coupled to the other element, or there may be intermediate elements. Furthermore, “connected” or “coupled” as used herein can include wireless connections or wireless coupling. The term “and / or” as used herein includes all or any units and all combinations of one or more associated listed items.

[0027] It will be understood by those skilled in the art that, unless otherwise defined, all terms used herein (including technical and scientific terms) have the same meaning as commonly understood by one of ordinary skill in the art to which this invention pertains. It should also be understood that terms such as those defined in general dictionaries should be understood to have the same meaning as in the context of the prior art, and should not be interpreted in an idealized or overly formal sense unless specifically defined as herein.

[0028] Data processing for 3D human pose estimation is modally diverse, typically including a sequence of skeletal poses consisting of the 3D coordinates of multiple joints, as well as a 3D human body mesh defined by a skinned multi-person linear model, containing vertex, face, pose, and shape parameters.

[0029] To achieve unified processing of multimodal data, the In-Context Learning paradigm has been introduced in recent years. This paradigm combines domain-specific input-target pairs (i.e., hints) with query samples, enabling models to perform task-specific reasoning using example templates without explicit fine-tuning.

[0030] However, existing context learning frameworks often employ simple random selection in the prompt retrieval stage, which can easily lead to over- or under-representation of certain motion patterns, or a mismatch between the selected prompts and the query samples in terms of contextual logic. This results in insufficient generalization performance and robustness of 3D human pose estimation models when facing cross-domain data or complex motion tasks. Specifically, existing context learning methods often use simple random selection in the prompt retrieval stage. While this approach is simple to operate, it introduces contextual bias when processing complex spatial and temporal human motion data. That is, the random sampling mechanism cannot guarantee a balanced distribution of motion patterns in the sampling pool, leading to over-representation of some common movements and under-representation of rare movements. This imbalanced contextual representation limits the model's ability to understand diverse human motion patterns. Secondly, randomly sampled contextual prompts are limited in their task guidance capabilities. Random selection cannot ensure that the retrieved prompts are logically relevant to the current query samples in terms of context. For example, when the query sample is a "sit down" action, if the system randomly retrieves a prompt about a "walking" action, a significant context gap will occur, forcing the model to reason based on mismatched information, thus weakening its ability to generate accurate output. This lack of targeted guidance is particularly pronounced when dealing with unfamiliar new tasks.

[0031] To address the aforementioned shortcomings of existing technologies, this invention provides a human pose estimation method based on multimodal perception. The method extracts a two-dimensional skeleton sequence based on the human motion sequence corresponding to the target object; clusters a standard dataset to determine each standard data class and its corresponding cluster center; determines contextual cues based on the human motion sequence, each standard data class, and each cluster center; and uses the contextual cues as prior information, employing a human pose estimation model to perform contextual learning based on the two-dimensional skeleton sequence to determine the corresponding three-dimensional human pose. Because this invention determines contextual cues based on each cluster center, each standard data class, and the human motion sequence after clustering the standard dataset, it effectively solves the problem of insufficient generalization performance and robustness of the three-dimensional human pose estimation model when facing cross-domain data or complex motion tasks due to the random selection method used in the cue retrieval stage of existing technologies.

[0032] Exemplary method: like Figure 1 As shown, the method includes: Step S100: Obtain the human motion sequence corresponding to the target object, and extract a two-dimensional skeleton sequence based on the human motion sequence.

[0033] A continuous motion video stream of the target object is acquired using visual acquisition devices (such as monocular cameras, binocular cameras, etc.), and this motion video stream is the human motion sequence. In offline scenarios, the stored motion video file of the target object is directly read as the human motion sequence. After acquiring the human motion sequence, it can be preprocessed, including frame sampling to remove blurry, severely occluded, or invalid frames without a target object, resulting in a regularized human motion sequence. The preprocessed human motion sequence is represented as a tensor composed of multiple frames, where each frame contains multiple keypoints and their coordinates in three-dimensional space. Feature extraction is performed on the preprocessed human motion sequence to obtain a two-dimensional skeleton sequence, which is used as input data for subsequent human three-dimensional pose estimation.

[0034] In one implementation, extracting a two-dimensional skeleton sequence based on the human motion sequence includes: Step S101: Extract the two-dimensional coordinate information of each joint point in the corresponding frame based on the human motion sequence using the human key point detection algorithm; Step S102: Determine the two-dimensional skeleton sequence corresponding to the human motion sequence based on the preset human joint topology and the two-dimensional coordinate information of each joint point in each frame.

[0035] A pre-trained human keypoint detection algorithm is selected to extract human appearance features based on human motion sequences, locate each preset keypoint of the target object (such as head, neck, shoulder, elbow, wrist, hip, knee, ankle, etc.), and output the two-dimensional coordinate information of each keypoint, which includes both horizontal and vertical coordinates. For the output two-dimensional coordinate information, confidence filtering is performed to remove keypoints with confidence scores below a preset threshold. Missing low-confidence keypoints are filled in using interpolation methods (such as interpolation of adjacent keypoint coordinates) to obtain a complete two-dimensional keypoint coordinate set.

[0036] Based on the pre-defined topological relationships of human joints (such as shoulder connecting to elbow, elbow connecting to wrist, hip connecting to knee, etc.), the coordinates of the completed two-dimensional joints are associated according to connection rules to form a single-frame two-dimensional skeleton data with a topological structure. The steps of single-frame detection and skeleton construction are repeated for each frame in the human motion sequence to obtain the corresponding two-dimensional skeleton data for each frame. The two-dimensional skeleton data are arranged and integrated according to the temporal order of each frame in the human motion sequence to obtain a two-dimensional skeleton sequence corresponding to the human motion sequence.

[0037] like Figure 1 As shown, the method further includes: Step S200: Cluster the standard dataset to determine each standard data class and its corresponding cluster center.

[0038] Existing context learning frameworks often employ simple random selection methods in the prompt retrieval stage, such as... Figure 2 As shown, the extracted training data clusters in certain motion patterns, which can easily lead to over-representation of some motion patterns or under-representation of motion patterns with limited data, or a mismatch between the selected prompts and the query samples in terms of contextual logic. This results in insufficient generalization performance and robustness of the 3D human pose estimation model when facing cross-domain data or complex motion tasks. This method clusters the standard dataset to obtain standard data classes with similar human poses and corresponding cluster centers for each standard data class. The standard dataset can be a publicly available dataset selected based on the actual application scenario. Specifically, the K-means clustering algorithm is used to divide the massive amount of data in the standard dataset into multiple clusters, thus obtaining multiple standard data classes and their corresponding cluster centers.

[0039] When determining contextual hints in the subsequent process, standard data that is more similar to the query embedding can be selected as contextual hints based on standard data classes and cluster centers. This avoids problems such as over-representation, under-representation, or logical mismatch in the selected contextual hints, thereby improving the generalization and robustness of the 3D human pose estimation model.

[0040] like Figure 1 As shown, the method further includes: Step S300: Determine contextual prompts based on the human motion sequence, each of the standard data classes, and each of the cluster centers.

[0041] In the suggestion retrieval phase, this method employs a decoupled suggestion retrieval framework. This framework consists of two phases: suggestion retrieval and task completion. Specifically, suggestion retrieval is decoupled into two steps: suggestion sampling and suggestion acquisition. The goal of suggestion sampling is to extract a representative subset of samples from a large-scale standard dataset (or training set) to construct a suggestion pool, thereby addressing the issue of balance in context representation. In this embodiment, after classifying the standard data in the standard dataset to obtain standard data classes and cluster centers, a suggestion pool can be constructed based on these standard data classes and cluster centers, thus resolving the issue of sample balance within the suggestion pool.

[0042] Meanwhile, the query embedding is determined based on the human motion sequence. The query embedding is matched with each standard data class and each cluster center. The most relevant examples are selected from the prompt pool as context prompts to enhance the guidance ability of context prompts for specific tasks and effectively overcome the performance bottleneck caused by the random selection of prompts in traditional methods.

[0043] In one implementation, determining contextual cues based on the human motion sequence, each of the standard data classes, and each of the cluster centers includes: Step S300a1: Based on each cluster center, select standard data from the corresponding standard data class to construct a prompt pool; Step S300a2: Perform embedding mapping on the human motion sequence to determine the query embedding; Step S300a3: Obtain contextual hints from the hint pool based on the query embedding.

[0044] To address different application needs, a decoupled suggestion retrieval framework is adopted, comprising two implementation paths. The first is a parameter-free suggestion retrieval method. During the sampling phase, the distance or similarity between each standard data point in the standard data class and its corresponding cluster center is calculated, and the standard data point with the closest distance or the highest similarity is selected. A suggestion pool is constructed using the standard data points selected from each standard data class to ensure the coverage and balanced distribution of motion patterns.

[0045] In the acquisition phase, encoders such as Transformers or spatiotemporal graph convolutional networks encode the preprocessed human motion sequence, capturing the spatiotemporal features of the motion and joint topological dependencies. This encodes the high-dimensional temporal data into a fixed-dimensional raw embedding vector. Finally, L2 normalization is applied to this vector to obtain a low-dimensional, standardized query embedding directly used for retrieval and matching. Then, based on the query embedding, the most similar standard data is searched in the hint pool as contextual cues, achieving context alignment during the inference process.

[0046] Specifically, such as Figure 3 As shown, the training data of the standard dataset are first clustered, and standard data are selected from each cluster to construct a cue pool. The cue pool is then used to retrieve cue information based on the query embedding determined by the human motion sequence, and the context cue corresponding to the query embedding is obtained from the cue pool.

[0047] In one implementation, a hint pool is constructed by selecting standard data from the corresponding standard data classes based on each cluster center, including: Step S300a31: Calculate the average joint distance between the query embedding and each standard data in the prompt pool; Step S300a32: Use the standard data corresponding to the smallest average joint distance as a contextual cue.

[0048] In this embodiment, during the contextual suggestion phase, the average joint distance between the query embedding and each standard data in the suggestion pool is calculated. Specifically, the Euclidean distance between each joint coordinate of the query embedding and the corresponding joint coordinate of each standard data is calculated. Then, the average of these Euclidean distances is taken to obtain the average joint coordinate between the query embedding and the standard data. The standard data with the smallest average joint distance is selected as the contextual suggestion, ensuring that the obtained contextual suggestion is logically relevant to the current query embedding. This prevents the problem of a lack of targeted guidance in the contextual suggestions when the model processes new tasks.

[0049] In one implementation, determining contextual cues based on the human motion sequence, each of the standard data classes, and the corresponding cluster centers includes: Step S300b1: Map and fuse the human motion sequence and each cluster center to determine the spatiotemporal embedding vector corresponding to each cluster center; Step S300b2: Based on the human motion sequence, perform weighted fusion of each of the spatiotemporal embedding vectors to determine the contextual prompts.

[0050] To address the issue that most existing methods are optimized using specific datasets, resulting in significant performance degradation when the task environment or data distribution changes, the proposed decoupled cue retrieval framework employs a trainable parameterized cue retrieval method. This method aims to further optimize task guidance accuracy through end-to-end learning. In this approach, the sampling step is performed by a learnable function. For example... Figure 4 As shown, the learnable function maps human motion sequences to cluster centers, extracts and fuses spatiotemporal features to obtain spatiotemporal embedding vectors corresponding to each cluster center, and constructs a cue embedding pool based on these vectors. The cue retrieval model selects spatiotemporal embedding vectors corresponding to the query embedding from the cue embedding pool based on the query input (or query embedding). The relevance of each selected spatiotemporal embedding vector to the human motion sequence is used to weight and fuse the vectors, dynamically generating contextual cue suggestions that best fit the human motion sequence. This significantly improves the model's generalization performance on complex, unseen tasks.

[0051] In one implementation, the human motion sequence and each of the cluster centers are mapped and fused to determine the spatiotemporal embedding vector corresponding to each of the cluster centers, including: Step S300b11: Perform time encoding and spatial encoding on the human motion sequence to determine the spatiotemporal location encoding; Step S300b12: Perform high-dimensional mapping on each cluster center to determine each initial embedding vector; Step S300b13: Introduce the spatiotemporal position code onto each of the initial embedding vectors to determine the corresponding spatiotemporal embedding vector.

[0052] Human motion sequences include spatial information of human posture and temporal information of posture changes over time. The core of extracting spatiotemporal position codes from human motion sequences is to construct and fuse position identifiers in the temporal and spatial dimensions. Temporally, based on frame-level temporal indexing, sine / cosine coding (with strong generalization) or learnable coding is used to assign temporal position features to each frame. Spatially, based on the topological structure of human joints, the spatial position of each joint in the body is characterized through joint index embedding, joint hierarchy, or physical distance coding. Finally, the temporal and spatial position codes are added together to obtain the spatiotemporal position code, which is then integrated into the motion features. Further optimization of the code can be achieved by introducing dynamic information such as joint velocity and relative distance between frames to perceive the temporal sequence of motion and the spatial topological patterns of the joints.

[0053] High-dimensional mapping of cluster centers involves projecting the cluster centers in the low-dimensional feature space onto a higher-dimensional feature space through linear transformation or nonlinear mapping networks such as multilayer perceptrons. This yields initial embedding vectors for each cluster center, enhancing feature representation and semantic information while preserving the original cluster structure and class discriminability. After layer normalization and L2 normalization, the mapped high-dimensional cluster centers become more stable and discriminative, thus better adapting to the high-dimensional feature space requirements of downstream tasks such as retrieval and generation.

[0054] Introducing spatiotemporal position coding into the high-dimensional initial embedding vector to capture dynamic features in human motion sequences enables the cue embedding pool based on spatiotemporal embedding vectors to ensure a balanced distribution of motion patterns and to provide targeted guidance for 3D human pose estimation of human motion sequences.

[0055] In one implementation, the contextual cue is determined by weighted fusion of the spatiotemporal embedding vectors based on the human motion sequence, including: Step S300b21: Perform high-dimensional mapping on the human motion sequence to determine the query embedding; Step S300b22: Calculate the multiplicative similarity between the query embedding and each of the spatiotemporal embedding vectors, and determine the weight of each of the spatiotemporal embedding vectors based on the multiplicative similarity; Step S300b23: Weighted fusion of each of the spatiotemporal embedding vectors to determine the contextual hints.

[0056] During the context hint acquisition phase, the human motion sequence is first mapped in high dimension to obtain the query embedding. Element-wise multiplication is performed on each spatiotemporal embedding vector in both the query embedding and the hint embedding pool to obtain a new vector composed of element-wise products. This new vector is then aggregated through summation, averaging, or norm taking to obtain a scalar value, which represents the multiplicative similarity between the query embedding and the spatiotemporal embedding vector. Using this multiplicative similarity as the weight of the corresponding spatiotemporal embedding vector, a weighted fusion of the spatiotemporal embedding vectors in the hint embedding pool is performed to obtain the context hint. This attention-based fusion method dynamically generates virtual hints that best fit the query requirements, significantly improving the model's generalization performance on complex and unseen tasks.

[0057] like Figure 1 As shown, the method further includes: Step S400: Using the contextual cues as prior information, a human pose estimation model is used to perform contextual learning based on the two-dimensional skeleton sequence to determine the three-dimensional human pose corresponding to the target object.

[0058] Specifically, the input 2D skeleton sequence (containing multi-frame joint coordinates) undergoes preprocessing, including root joint centering, temporal resampling, and coordinate normalization. Simultaneously, contextual prior information is integrated. The preprocessed 2D skeleton sequence is then input into a human pose estimation model. The model first extracts the spatial topological features of single-frame 2D joints using a convolutional neural network or graph convolution. Then, it uses a long short-term memory network or Transformer encoder to capture inter-frame temporal dependencies and incorporates predefined contextual priors to complete contextual learning. Subsequently, the model maps the 2D features to 3D space through a decoupled 3D regression branch (combining geometric constraints and motion priors), correcting the depth ambiguity caused by 2D projection. Multiplicative similarity is used to verify the matching degree between the predicted 3D joints and the prior motion embeddings, ultimately obtaining the 3D skeleton sequence. Each keypoint in the 3D skeleton sequence is represented by three spatial coordinates. Finally, the three-dimensional skeleton sequence obtained from the regression is post-processed (such as smoothing filtering and physical rationality verification), and a skin-based multi-person linear model (SMPL) is adopted to describe the geometric features of the human body through posture parameters and shape parameters. Linear hybrid skinning technology is used to ensure the smoothness and consistency of joint movements in the physical structure.

[0059] Based on the above embodiments, the present invention also provides a human pose estimation device based on multimodal perception, such as... Figure 5 As shown, the device includes: Two-dimensional skeleton sequence extraction module 01 is used to obtain the human motion sequence corresponding to the target object and extract a two-dimensional skeleton sequence based on the human motion sequence. Standard dataset clustering module 02 is used to cluster standard datasets and determine each standard data class and its corresponding cluster center; The context prompt determination module 03 is used to determine context prompts based on the human motion sequence, each of the standard data classes, and each of the cluster centers. The 3D human pose determination module 04 is used to determine the 3D human pose of the target object by using the contextual cues as prior information and employing a human pose estimation model to perform contextual learning based on the 2D skeleton sequence.

[0060] Based on the above embodiments, the present invention also provides a terminal, the principle block diagram of which can be as follows: Figure 6 As shown, the terminal includes a processor, memory, network interface, and display screen connected via a system bus. The processor provides computing and control capabilities. The memory includes non-volatile storage media and internal memory. The non-volatile storage media stores the operating system and computer programs. The internal memory provides an environment for the operation of the operating system and computer programs in the non-volatile storage media. The network interface is used to communicate with external terminals via a network connection. When the computer program is executed by the processor, it implements a human pose estimation method based on multimodal perception. The display screen can be a liquid crystal display (LCD) or an e-ink display.

[0061] Those skilled in the art will understand that Figure 6 The block diagram shown is merely a partial structural diagram related to the present invention and does not constitute a limitation on the terminal to which the present invention is applied. A specific terminal may include more or fewer components than those shown in the figure, or combine certain components, or have different component arrangements.

[0062] In one implementation, the terminal's memory stores one or more programs, and these programs are configured to be executed by one or more processors, and the programs contain instructions for performing a multimodal perception-based human pose estimation method.

[0063] Those skilled in the art will understand that all or part of the processes in the methods of the above embodiments can be implemented by a computer program instructing related hardware. The computer program can be stored in a non-volatile computer-readable storage medium. When executed, the computer program can include the processes of the embodiments of the above methods. Any references to memory, storage, databases, or other media used in the embodiments provided by this invention can include non-volatile and / or volatile memory. Non-volatile memory can include read-only memory (ROM), programmable ROM (PROM), electrically programmable ROM (EPROM), electrically erasable programmable ROM (EEPROM), or flash memory. Volatile memory can include random access memory (RAM) or external cache memory. By way of illustration and not limitation, RAM is available in various forms, such as static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), dual data rate SDRAM (DDRSDRAM), enhanced SDRAM (ESDRAM), synchronous link DRAM (SLDRAM), Rambus direct RAM (RDRAM), direct memory bus dynamic RAM (DRDRAM), and memory bus dynamic RAM (RDRAM), etc.

[0064] In summary, this invention discloses a method, device, terminal, and medium for human pose estimation based on multimodal perception. The method extracts a two-dimensional skeleton sequence based on the human motion sequence corresponding to the target object; clusters a standard dataset to determine each standard data class and its corresponding cluster center; determines contextual cues based on the human motion sequence, each standard data class, and each cluster center; and uses the contextual cues as prior information, employing a human pose estimation model to perform contextual learning based on the two-dimensional skeleton sequence to determine the corresponding three-dimensional human pose. Because this invention determines contextual cues based on each cluster center, each standard data class, and the human motion sequence after clustering the standard dataset, it effectively solves the problem of insufficient generalization performance and robustness of existing technologies that use random selection in the cue retrieval stage when facing cross-domain data or complex motion tasks.

[0065] It should be understood that the application of the present invention is not limited to the examples above. Those skilled in the art can make improvements or modifications based on the above description, and all such improvements and modifications should fall within the protection scope of the appended claims.

Claims

1. A human pose estimation method based on multimodal perception, characterized in that, The method includes: Obtain the human motion sequence corresponding to the target object, and extract a two-dimensional skeleton sequence based on the human motion sequence; Cluster the standard dataset to determine each standard data class and its corresponding cluster center; Contextual prompts are determined based on the human motion sequence, the standard data classes, and the cluster centers. Using the contextual cues as prior information, a human pose estimation model is employed to perform contextual learning based on the two-dimensional skeleton sequence to determine the three-dimensional human pose corresponding to the target object. Contextual cues are determined based on the human motion sequence, the standard data classes, and the cluster centers, including: Based on each cluster center, standard data is selected from the corresponding standard data class to construct a prompt pool; the human motion sequence is embedded and mapped to determine the query embedding; and context prompts are obtained from the prompt pool according to the query embedding. Alternatively, the human motion sequence and each of the cluster centers can be mapped and fused to determine the spatiotemporal embedding vector corresponding to each cluster center; the spatiotemporal embedding vectors can be weighted and fused based on the human motion sequence to determine the contextual prompts.

2. The human pose estimation method based on multimodal perception according to claim 1, characterized in that, Extracting a two-dimensional skeleton sequence based on the human motion sequence includes: The two-dimensional coordinate information of each joint point in the corresponding frame is extracted based on the human motion sequence using a human key point detection algorithm. The two-dimensional skeleton sequence corresponding to the human motion sequence is determined based on the preset human joint topology and the two-dimensional coordinate information of each joint point in each frame.

3. The human pose estimation method based on multimodal perception according to claim 1, characterized in that, Based on each cluster center, standard data is selected from the corresponding standard data class to construct a hint pool, including: Calculate the average joint distance between the query embedding and each standard data in the prompt pool; Use the standard data corresponding to the minimum average joint distance as a contextual cue.

4. The human pose estimation method based on multimodal perception according to claim 1, characterized in that, Mapping and fusing the human motion sequence and each cluster center to determine the spatiotemporal embedding vector corresponding to each cluster center includes: The human motion sequence is time-coded and spatial-coded to determine the spatiotemporal location code; Perform high-dimensional mapping on each cluster center to determine each initial embedding vector; The spatiotemporal position code is introduced into each of the initial embedding vectors to determine the corresponding spatiotemporal embedding vector.

5. The human pose estimation method based on multimodal perception according to claim 1, characterized in that, The contextual cues are determined by weighting and fusing the spatiotemporal embedding vectors based on the human motion sequence, including: Perform high-dimensional mapping on the human motion sequence to determine the query embedding; Calculate the multiplicative similarity between the query embedding and each of the spatiotemporal embedding vectors, and determine the weight of each of the spatiotemporal embedding vectors based on the multiplicative similarity; The spatiotemporal embedding vectors are weighted and fused to determine the contextual hints; Contextual cues are determined based on the human motion sequence, the standard data classes, and the cluster centers, including: Based on each cluster center, standard data is selected from the corresponding standard data class to construct a prompt pool; the human motion sequence is embedded and mapped to determine the query embedding; and context prompts are obtained from the prompt pool according to the query embedding. Alternatively, the human motion sequence and each of the cluster centers can be mapped and fused to determine the spatiotemporal embedding vector corresponding to each cluster center; the spatiotemporal embedding vectors can be weighted and fused based on the human motion sequence to determine the contextual prompts.

6. A human pose estimation device based on multimodal perception, characterized in that, The device includes: A two-dimensional skeleton sequence extraction module is used to obtain the human motion sequence corresponding to the target object and extract a two-dimensional skeleton sequence based on the human motion sequence. The standard dataset clustering module is used to cluster standard datasets and determine each standard data class and its corresponding cluster center. The context prompt determination module is used to determine context prompts based on the human motion sequence, each of the standard data classes, and each of the cluster centers. The three-dimensional human pose determination module is used to determine the three-dimensional human pose corresponding to the target object by using the contextual cues as prior information and employing a human pose estimation model to perform contextual learning based on the two-dimensional skeleton sequence.

7. A terminal, characterized in that, The terminal includes a memory and one or more processors; the memory stores one or more programs; the programs contain instructions for executing the human pose estimation method based on multimodal perception as described in any one of claims 1-5; the processors are used to execute the programs.

8. A computer-readable storage medium storing a plurality of instructions, characterized in that, The instructions are loaded and executed by the processor to implement the steps of the human pose estimation method based on multimodal perception as described in any one of claims 1-5.