A text and trajectory combined driving motion synthesis method

By using a text-trajectory joint-driven approach, and leveraging text-trajectory feature encoding fusion and a dual Transformer generation module, a human pose sequence that conforms to both text semantics and trajectory constraints is generated. This solves the problem of balancing action semantics and motion trajectory in existing technologies, and achieves precise control.

CN122199759APending Publication Date: 2026-06-12ZHEJIANG UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHEJIANG UNIV
Filing Date
2026-03-16
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing motion synthesis technologies struggle to simultaneously address the dual constraints of action semantics and motion trajectory. Text-driven approaches cannot accurately control motion trajectories, while trajectory-driven approaches cannot clearly define action types, thus failing to meet practical application needs.

Method used

A text- and trajectory-driven approach is adopted, which realizes cross-modal coding fusion through a text-trajectory feature coding fusion module. Combined with a dual Transformer generation module and a 2D residual quantization variational autoencoder, a human pose sequence that conforms to text semantics and trajectory constraints is generated.

🎯Benefits of technology

It achieves precise control of action semantics and motion path, generating human pose sequences that conform to text semantics and accurately match trajectory constraints, thus solving the core problem that single-modal driving cannot take into account.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122199759A_ABST
    Figure CN122199759A_ABST
Patent Text Reader

Abstract

The application discloses a text and trajectory combined driving motion synthesis method, relates to the technical field of motion synthesis, and solves the pain point that an existing text single-modal driving motion synthesis method cannot consider action semantic constraints and precise motion path control. First, training samples containing text descriptions, human 3D motion data and key joint trajectories are acquired, text semantics and trajectory space-time features are extracted through preprocessing, cross-modal alignment and fusion are completed to obtain text-trajectory fusion features; high-fitting reference motion features are matched through bidirectional momentum cross-modal retrieval; multi-source feature fusion is completed through a space-time attention module, motion quantization feature sequences are generated in combination with a double-Transformer architecture of a mask+residual; model training is completed through 2D residual quantization coding and decoding and a classifier-free guiding mechanism, and end-to-end motion synthesis of double driving signals is realized. The method fully gives play to the collaborative advantages of double modalities, significantly improves the semantic rationality, trajectory matching accuracy and motion naturalness of generated poses, and is suitable for 3D animation, game development, virtual digital people and other scenes.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of motion synthesis technology, and in particular to a motion synthesis method jointly driven by text and trajectory. Background Technology

[0002] Motion synthesis is one of the core technologies in the intersection of artificial intelligence and computer graphics. It aims to generate a three-dimensional human motion sequence that conforms to the laws of human biomechanics and is temporally coherent, based on preset driving signals and through AI algorithms. Its core challenge is how to make the generated motion semantically consistent with the input driving signals while also having physical rationality and visual naturalness, so as to provide basic technical support for virtual content creation and intelligent human-computer interaction.

[0003] Motion synthesis technology has been widely used in many fields: in 3D game development, it can quickly generate diverse actions for different characters, greatly reducing the cost of manual production and improving the efficiency of game content updates and iterations; in the field of animation production, it can automatically complete the motion content of intermediate frames based on key frames or drive signals, reducing the difficulty of complex action design and shortening the production cycle; in 3D movie special effects production, it can support the generation of virtual character actions and interaction simulation, replace some high-cost motion capture links, complete high-difficulty action scenes that are difficult to shoot in reality, and enrich visual expression forms.

[0004] Based on the type of driving signal, existing motion synthesis technologies can be categorized into text-driven, music-driven, audio-driven, and scene-driven types. Text-driven methods generate corresponding actions by parsing natural language descriptions, clearly defining action types and behavioral characteristics, but struggling to accurately describe the spatial details of motion trajectories. Trajectory-driven methods can precisely control the movement path and spatial position of digital humans, but cannot clearly define action types and behavioral characteristics. Existing motion synthesis methods are generally single-modal driven; a few methods support switching between multiple driving modes, but during the inference and generation process, they can only receive one driving signal, failing to meet the requirements of multi-dimensional constraints.

[0005] In practical applications such as 3D animation production, virtual characters often need to simultaneously meet the dual constraints of action semantics and motion trajectory. For example, they may need to perform a specific action along a designated trajectory in a crowded indoor scene. Text-driven methods alone struggle to accurately control motion trajectories, leading to path deviations; conversely, trajectory-driven methods cannot clearly define the action type, resulting in semantic discrepancies and failing to meet the demands of practical applications. Therefore, there is an urgent need for a motion synthesis method that can simultaneously integrate text and trajectory-driven signals, balancing action semantic constraints with precise motion path control. Summary of the Invention

[0006] The purpose of this application is to provide a motion synthesis method jointly driven by text and trajectory, which can realize dual-modal driven motion synthesis that takes into account both action semantic constraints and precise control of motion path.

[0007] To achieve the above objectives, this application provides the following solution: A motion synthesis method jointly driven by text and trajectory includes the following steps: Step S1: Obtain training samples containing text descriptions, 3D motion data of human body movements, and path trajectories of key human joints.

[0008] Step S2: Perform standardized preprocessing on the training samples using the input preprocessing module, and extract the text semantic features corresponding to the preprocessed text description and the trajectory spatiotemporal features corresponding to the waypoint trajectory.

[0009] Step S3: Use the text-trajectory feature encoding fusion module to perform dimensional alignment and cross-modal encoding fusion on the text semantic features and trajectory spatiotemporal features to obtain text-trajectory fused features.

[0010] Step S4: Using text-trajectory fusion features as query features, cross-modal retrieval is performed in the pre-built text-trajectory-motion multimodal retrieval library through the retrieval enhancement matching module combined with bidirectional momentum text-trajectory-motion modeling technology to obtain reference motion features.

[0011] Step S5: The text-trajectory spatiotemporal attention module is used to perform multi-source feature fusion on the text-trajectory fusion features and the reference motion features. The dual Transformer generation module is combined to complete the global modeling and fine-grained optimization of the motion features and generate a motion quantization feature sequence. The dual Transformer generation module includes a mask Transformer and a residual Transformer.

[0012] Step S6: Based on the motion quantization feature sequence, the motion features are encoded and decoded and optimized using a 2D residual quantization variational autoencoder module. A classifier-free guidance mechanism is introduced to strengthen the driving signal constraints, and the motion generation model is trained to obtain the trained motion generation model. The motion generation model includes an input preprocessing module, a text-trajectory feature encoding fusion module, a retrieval enhancement matching module, a text-trajectory spatiotemporal attention module, a dual Transformer generation module, a classifier-free guidance constraint module, and a 2D residual quantization variational autoencoder module.

[0013] Step S7: Obtain the target motion description text and the trajectory of the key joint path points of the target human body, input them into the trained motion generation model, and output a human pose sequence that conforms to the semantics of the text and matches the trajectory constraints.

[0014] Optionally, in step S1, the key joints of the human body include six joints: pelvis, head, left hand, right hand, left foot, and right foot. The trajectory spatiotemporal features are extracted by a pre-trained trajectory encoder through time window compression and joint position projection. The text semantic features are extracted by a pre-trained CLIP text encoder and then unified with the trajectory spatiotemporal features through a projection layer. The text semantic features are pre-concatenated to the trajectory spatiotemporal features, and after position encoding, Transformer encoding, and core feature extraction, the text-trajectory fusion features are obtained.

[0015] Optionally, in step S4, the text-trajectory-motion multimodal retrieval library stores text-trajectory fusion features and corresponding motion features, and establishes an index based on feature similarity; the bidirectional momentum text-trajectory-motion modeling technology realizes cross-modal comparative learning of text-trajectory-motion through dual momentum encoders and negative sample queues. The dual momentum encoders include a text-trajectory momentum encoder and a motion momentum encoder, and the negative sample queues include a text-trajectory fusion feature queue and a motion feature queue.

[0016] Optionally, in step S4, the text-trajectory fusion feature is used as the input to the query matrix Q, the concatenation result of the input motion feature, the text-trajectory fusion feature and the retrieved cross-modal feature is used as the input to the key matrix K, and the concatenation result of the input motion feature, the text-trajectory fusion feature and the retrieved motion feature is used as the input to the value matrix V, thereby completing the feature matching for cross-modal retrieval.

[0017] Optionally, in step S5, the text-trajectory spatiotemporal attention module receives the input motion features, text-trajectory fusion features, and retrieved reference motion features, calculates the query matrix, key matrix, and value matrix respectively, and completes the deep fusion of multi-source features through self-attention calculation.

[0018] Optionally, in step S5, the masking transformer uses a dynamic masking strategy to mask the input motion features and model the global structure of the motion sequence; the residual transformer optimizes the motion detail features layer by layer according to the quantization level of the 2D residual quantization variational autoencoder.

[0019] Optionally, the 2D residual quantization variational autoencoder module includes a 2D convolutional encoder, a multi-layer residual quantization module, and a 2D convolutional decoder. The 2D convolutional encoder encodes the input motion sequence to obtain 2D latent features, the residual quantization module performs residual quantization processing on the 2D latent features, and the 2D convolutional decoder decodes the quantized features to reconstruct the human motion sequence.

[0020] Optionally, the classifier-free guidance mechanism amplifies the constraint weights of text and trajectory-driven signals by generating feature prediction distributions under weighted constraints and unconstrained conditions, thereby balancing the matching degree and diversity of generated motions.

[0021] Optionally, step S7 specifically includes the following steps: Step S71: The input preprocessing module is used to perform standardization preprocessing on the acquired target motion description text and the trajectory of the key joint path points of the target human body, respectively, to obtain standardized text data and standardized trajectory data that are adapted to the input format of the motion generation model.

[0022] Step S72: Use the text-trajectory feature encoding fusion module to extract the target text semantic features corresponding to the standardized text data and the target trajectory spatiotemporal features corresponding to the standardized trajectory data. Then, perform cross-modal dimension alignment and deep fusion on the target text semantic features and the target trajectory spatiotemporal features to obtain the target text-trajectory fusion features.

[0023] Step S73: Using the target text-trajectory fusion feature as the query feature, input it into the pre-built text-trajectory-motion multimodal retrieval library, and perform cross-modal retrieval through the retrieval enhancement matching module to obtain target reference motion features with high semantic consistency and high trajectory fit.

[0024] Step S74: Use the text-trajectory spatiotemporal attention module to perform multi-source feature fusion on the target text-trajectory fusion features and the target reference motion features. Use the trained dual Transformer generation module to complete the global modeling and fine-grained optimization of motion features and generate the target motion quantization feature sequence.

[0025] Step S75: The target motion quantization feature sequence is enhanced by driving signal constraints and reconstructed by the 2D residual quantization variational autoencoder module and the classifier-free guided constraint module, and the human pose sequence that conforms to the text semantics and matches the trajectory constraints is output.

[0026] Optionally, in step S73, cross-modal retrieval is performed using bidirectional momentum text-trajectory-motion modeling technology. The bidirectional similarity and component-level similarity between the query features and the features in the retrieval library are calculated. Motion features that simultaneously meet the bidirectional feature similarity threshold and the component-level similarity variance threshold are selected from the pre-built text-trajectory-motion multimodal retrieval library. After sorting by comprehensive similarity in descending order, the features of the Top-K samples are selected as target reference motion features.

[0027] According to the specific embodiments provided in this application, the following technical effects are disclosed: This application provides a text- and trajectory-driven motion synthesis method, which includes the following steps: Step S1, acquiring training samples containing text descriptions, 3D motion data of human body movements, and trajectories of key joint path points, providing a complete data foundation for the training of the dual-modal joint-driven model; Step S2, performing standardized preprocessing on the training samples through an input preprocessing module to eliminate data scale differences and invalid redundant information, and simultaneously extracting text semantic features corresponding to the text descriptions and trajectory spatiotemporal features corresponding to the path point trajectories, accurately obtaining the core representation of the dual-modal driving signal; Step S3, performing dimensional alignment and cross-modal encoding fusion on the two types of features through a text-trajectory feature encoding fusion module to obtain text-trajectory fusion features, achieving deep binding of text semantics and trajectory spatiotemporal information, and overcoming the pain point of cross-modal feature fragmentation; Step S4, using the text-trajectory fusion features as query features, performing cross-modal retrieval in a pre-built text-trajectory-motion multimodal retrieval library through a retrieval enhancement matching module combined with bidirectional momentum text-trajectory-motion modeling technology, matching reference motion features with high semantic consistency and high trajectory fit, providing reliable priors for motion generation, and improving the efficiency of motion generation. The model generates a sequence of motion features, ensuring both the rationality and naturalness of the generated motion. Step S5 involves multi-source deep fusion of text-trajectory fusion features and reference motion features via a text-trajectory spatiotemporal attention module. This fusion is then processed by a dual-Transformer generation module (including a mask Transformer and a residual Transformer) to simultaneously perform global structural modeling and fine-grained detail optimization of the motion features, generating a motion quantization feature sequence that balances global semantic consistency and local motion naturalness. Step S6 involves high-quality encoding and decoding optimization of the motion features based on the motion quantization feature sequence using a 2D residual quantization variational autoencoder. A classifier-free guidance mechanism is introduced to strengthen the constraints of the dual-drive signals of text and trajectory, completing the training of the motion generation model. This model integrates the aforementioned full-link functional modules, enabling end-to-end dual-modal motion synthesis. Step S7 involves acquiring the target motion description text and the trajectory of the target key joints, inputting them into the trained motion generation model, and outputting a human pose sequence that conforms to text semantics and accurately matches trajectory constraints. This solves the core problem that text-based single-modal driving cannot simultaneously address both action semantic constraints and precise motion path control. Attached Figure Description

[0028] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0029] Figure 1This is a flowchart illustrating a motion synthesis method jointly driven by text and trajectory, provided as an embodiment of this application.

[0030] Figure 2 This is a schematic diagram of six key joints in a motion synthesis method jointly driven by text and trajectory provided in an embodiment of this application.

[0031] Figure 3 This is a schematic diagram illustrating the process of fusing text and trajectory features in a motion synthesis method jointly driven by text and trajectory, provided in an embodiment of this application.

[0032] Figure 4 This is a schematic diagram illustrating the process of text-trajectory-motion contrastive learning retrieval enhancement matching in a motion synthesis method jointly driven by text and trajectory provided in an embodiment of this application.

[0033] Figure 5 This is a schematic diagram of the structure of a text-trajectory-motion contrastive learning retrieval enhancement matching module in a text and trajectory jointly driven motion synthesis method provided in an embodiment of this application.

[0034] Figure 6 This is a schematic diagram illustrating the process of multi-source feature fusion to generate motion using a dual Transformer architecture in a motion synthesis method jointly driven by text and trajectory, provided in an embodiment of this application.

[0035] Figure 7 This is a schematic diagram illustrating the experimental results of a motion synthesis method jointly driven by text and trajectory provided in an embodiment of this application, used to achieve motion synthesis of specified text and trajectory. Detailed Implementation

[0036] The technical solutions of the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, and not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0037] To make the above-mentioned objectives, features and advantages of this application more apparent and understandable, the application will be further described in detail below with reference to the accompanying drawings and specific embodiments.

[0038] First, the core terms in the embodiments of this application are defined: SMPL: Skinned Multi-Person Linear, is a parametric 3D human body model.

[0039] CLIP: Contrastive Language-Image Pre-training, which is a text encoder used in this application to extract semantic features of text.

[0040] RAG: Retrieval-Augmented Generation, is a retrieval-enhanced generation technique that uses external knowledge bases to provide prior references for the generation process.

[0041] TT-BMM: Bidirectional Momentum Text-Trajectory-Motion Modeling, is a technology used to achieve cross-modal retrieval and matching.

[0042] TT-STA: Text-Trajectory Spatial-Temporal Attention, is a text-trajectory spatiotemporal attention mechanism used to achieve deep fusion of multi-source features.

[0043] 2D RVQ-VAE: 2D Residual Vector Quantized Variational Autoencoder, used to achieve high-quality encoding and decoding of motion features.

[0044] CFG: Classifier-Free Guidance, is a mechanism that strengthens the constraint effect of input driving signals without classifier guidance.

[0045] This application provides a text and trajectory-driven motion synthesis method. In one exemplary embodiment, such as... Figure 1 As shown, it includes the following steps: Step S1 involves acquiring training samples containing text descriptions, 3D motion data of human body movements, and path trajectories of key human joints. Specifically, in Step S1, the key human joints are selected as six joints: pelvis, head, left hand, right hand, left foot, and right foot; their skeleton tree structure and corresponding positions in the digital human model are as follows: Figure 2 As shown, the trajectory spatiotemporal features are extracted by a pre-trained trajectory encoder through time window compression and joint position projection; the text semantic features are extracted by a pre-trained CLIP text encoder, and then the dimensions are unified with the trajectory spatiotemporal features through a projection layer; the text semantic features are pre-concatenated to the trajectory spatiotemporal features, and after position encoding, Transformer encoding and core feature extraction, the text-trajectory fusion features are obtained.

[0046] Step S2 involves using the input preprocessing module to perform standardized preprocessing on the training samples, extracting the semantic features of the preprocessed text descriptions and the spatiotemporal features of the trajectory trajectories corresponding to the waypoint trajectories. During the preprocessing in step S2, the global mean and standard deviation of the trajectory data are calculated, and the motion sequences and trajectory sequences are pruned and normalized. The normalization formula is as follows: .

[0047] Among them, 1e -6 To prevent smooth terms with a denominator of zero.

[0048] Step S3: Use the text-trajectory feature encoding fusion module to perform dimensional alignment and cross-modal encoding fusion on the text semantic features and trajectory spatiotemporal features to obtain text-trajectory fused features.

[0049] Current retrieval enhancement techniques in motion synthesis include: 1) Single-modal retrieval: building a retrieval library based solely on text or a single trajectory driving signal and matching reference features; 2) Unstructured retrieval: directly building a retrieval library based on raw data without structuring, resulting in low retrieval efficiency; 3) Traditional similarity matching: using simple cosine similarity calculation for feature matching without considering bidirectional association across multiple modalities. The bidirectional momentum text-trajectory-motion modeling (TT-BMM) cross-modal retrieval strategy proposed in this embodiment, which integrates the RAG structured database, can effectively improve retrieval accuracy and provide reliable prior guidance for the generation stage based on both semantic and trajectory fit.

[0050] Step S4: Using text-trajectory fusion features as query features, cross-modal retrieval is performed in the pre-built text-trajectory-motion multimodal retrieval library through the retrieval enhancement matching module combined with bidirectional momentum text-trajectory-motion modeling technology to obtain reference motion features.

[0051] Specifically, the pre-built text-trajectory-motion multimodal retrieval library stores text-trajectory fusion features and corresponding motion features, and establishes an index based on feature similarity; the bidirectional momentum text-trajectory-motion modeling technology realizes cross-modal comparative learning of text-trajectory-motion through dual momentum encoders and negative sample queues. The dual momentum encoders include a text-trajectory momentum encoder and a motion momentum encoder, and the negative sample queues include a text-trajectory fusion feature queue and a motion feature queue.

[0052] In an exemplary embodiment, text-trajectory fusion features are used as input to the query matrix Q, the concatenation result of the input motion features, text-trajectory fusion features and retrieved cross-modal features is used as input to the key matrix K, and the concatenation result of the input motion features, text-trajectory fusion features and retrieved motion features is used as input to the value matrix V, thereby completing the feature matching for cross-modal retrieval.

[0053] Existing motion synthesis methods fuse multi-source information through the following approaches: 1) using a single Transformer architecture for feature fusion, which has the drawback of struggling to balance global structure and local details; 2) fusion based on a single feature branch using attention calculation, which has the drawback of not fully utilizing multi-source information; and 3) directly generating motion based on fused features without setting targeted constraint enhancement mechanisms, resulting in low consistency between the generated result and the input control signal. This embodiment proposes to combine a TT-STA three-branch attention mechanism, a dual Transformer architecture, and a CFG constraint enhancement mechanism to achieve efficient fusion of multi-source control information, generating a desired pose sequence that meets the control signal requirements and exhibits natural movements.

[0054] Step S5: The text-trajectory spatiotemporal attention module is used to perform multi-source feature fusion on the text-trajectory fusion features and the reference motion features. The dual Transformer generation module is combined to complete the global modeling and fine-grained optimization of the motion features and generate a motion quantization feature sequence. The dual Transformer generation module includes a mask Transformer and a residual Transformer.

[0055] Specifically, the text-trajectory spatiotemporal attention module receives input motion features, text-trajectory fusion features, and retrieved reference motion features, calculates the query matrix, key matrix, and value matrix respectively, and completes the deep fusion of multi-source features through self-attention calculation.

[0056] The Mask Transformer uses a dynamic masking strategy to mask the input motion features and model the global structure of the motion sequence; the Residual Transformer optimizes the motion detail features layer by layer according to the quantization levels of the 2D residual quantization variational autoencoder.

[0057] Step S6: Based on the motion quantization feature sequence, the motion features are encoded and decoded and optimized using a 2D residual quantization variational autoencoder module. A classifier-free guidance mechanism is introduced to strengthen the driving signal constraints, and the motion generation model is trained to obtain the trained motion generation model. The motion generation model includes an input preprocessing module, a text-trajectory feature encoding fusion module, a retrieval enhancement matching module, a text-trajectory spatiotemporal attention module, a dual Transformer generation module, a classifier-free guidance constraint module, and a 2D residual quantization variational autoencoder module.

[0058] The 2D residual quantization variational autoencoder module includes a 2D convolutional encoder, a multi-layer residual quantization module, and a 2D convolutional decoder. The 2D convolutional encoder encodes the input motion sequence to obtain 2D latent features, the residual quantization module performs residual quantization processing on the 2D latent features, and the 2D convolutional decoder decodes the quantized features to reconstruct the human motion sequence. As for the classifier-free guidance mechanism, it amplifies the constraint weights of text and trajectory-driven signals by generating and predicting feature distributions under weighted constraints and without constraints, balancing the matching degree and diversity of the generated motion.

[0059] Step S7: Obtain the target motion description text and the trajectory of the key joint path points of the target human body, input them into the trained motion generation model, and output a human pose sequence that conforms to the text semantics and matches the trajectory constraints. In this embodiment, step S7 specifically includes the following steps: Step S71: The input preprocessing module is used to perform standardization preprocessing on the acquired target motion description text and the trajectory of the key joint path points of the target human body, respectively, to obtain standardized text data and standardized trajectory data that are adapted to the input format of the motion generation model.

[0060] Step S72: Use the text-trajectory feature encoding fusion module to extract the target text semantic features corresponding to the standardized text data and the target trajectory spatiotemporal features corresponding to the standardized trajectory data. Then, perform cross-modal dimension alignment and deep fusion on the target text semantic features and the target trajectory spatiotemporal features to obtain the target text-trajectory fusion features.

[0061] Step S73: Using the target text-trajectory fusion feature as the query feature, input it into the pre-built text-trajectory-motion multimodal retrieval library, and perform cross-modal retrieval through the retrieval enhancement matching module to obtain target reference motion features with high semantic consistency and high trajectory fit.

[0062] Specifically, in step S73, cross-modal retrieval is performed using bidirectional momentum text-trajectory-motion modeling technology. The bidirectional similarity and component-level similarity between the query features and the features in the retrieval library are calculated. Motion features that simultaneously meet the bidirectional feature similarity threshold and the component-level similarity variance threshold are selected from the pre-built text-trajectory-motion multimodal retrieval library. After sorting the features in descending order of comprehensive similarity, the features of the Top-K samples are selected as target reference motion features.

[0063] Step S74: Use the text-trajectory spatiotemporal attention module to perform multi-source feature fusion on the target text-trajectory fusion features and the target reference motion features. Use the trained dual Transformer generation module to complete the global modeling and fine-grained optimization of motion features and generate the target motion quantization feature sequence.

[0064] Step S75: The target motion quantization feature sequence is enhanced by driving signal constraints and reconstructed by the 2D residual quantization variational autoencoder module and the classifier-free guided constraint module, and the human pose sequence that conforms to the text semantics and matches the trajectory constraints is output.

[0065] The motion synthesis considered in this embodiment does not involve interaction with objects or fine-grained finger movements. Theoretically, all other general motion types can be generated by the motion synthesis method proposed in this embodiment. The core of this method consists of three parts: "encoding text and trajectory, extracting text-trajectory fusion features" (corresponding to steps S2-S3), "constructing a RAG retrieval library and matching reference motion features" (corresponding to step S4), and "generating the desired digital human pose sequence based on a TT-STA dual Transformer architecture" (corresponding to steps S5-S6). The actual performance depends mainly on the specific implementation of these three parts and the specific content and complexity of the motion and trajectory instructions. Supplementary explanations are provided below for each part: Regarding the first part, "Encoding text and trajectory, extracting text-trajectory fusion features," such as... Figure 3 As shown, this part is the pre-processing stage for the joint text and trajectory motion synthesis implemented in this application. Its core is to construct a closed-loop logic of "data standardization → single-modal encoding → text-trajectory fusion" based on the input preprocessing module and the text-trajectory feature encoding fusion module (pre-trained CLIP text encoding submodule, trajectory encoding submodule, and text and trajectory feature fusion submodule). The steps are broken down in detail below: First, the original dataset is loaded. In this embodiment, the motion files and text files of the Humanml3D dataset are read. The text files provide textual descriptions of the motion, while the motion files provide 3D motion data of human body movements. Samples with motion sequence lengths less than a minimum threshold or greater than a maximum threshold are filtered out. Six key human joints (e.g., ...) are extracted from the filtered valid motion data. Figure 2 The 3D coordinate sequence (as shown) is used to obtain the trajectory data of the key joints' waypoints, and then the global mean and standard deviation of the trajectory data are calculated.

[0066] The pre-trained CLIP text encoding submodule (ViT-B / 32 version) is loaded to perform word segmentation and truncation on the input text description. The processed text tensor is then input into the pre-trained CLIP text encoding submodule, which outputs 512-dimensional original text semantic features through forward propagation. These text features are then projected onto a linear layer to a 256-dimensional array, resulting in a text feature sequence adapted for subsequent feature fusion.

[0067] The trajectory sequence is divided into blocks according to time windows by the trajectory encoding submodule. The average trajectory value of 4 time steps within each window is taken, and the original time step T is compressed into... The compressed trajectory features are projected into a linear layer to obtain 256-dimensional trajectory hidden features. Temporal position encoding is added to output a trajectory feature sequence with position information.

[0068] The text feature sequence, after being unified in dimensions, is pre-concatenated to the trajectory feature sequence to form a text-trajectory fusion feature sequence, achieving preliminary integration of text semantics and trajectory spatiotemporal information. After adding positional encoding to the fusion feature sequence, it is fed into a Transformer encoder based on a self-attention mechanism. Through multi-layer self-attention modeling, the global correlation between text features and trajectory features is captured, achieving deep interaction and feature aggregation of text-trajectory information. Core aggregated features are extracted from the encoded fusion sequence and fed into a projection module consisting of a linear layer, LayerNorm layer, GELU layer, and Dropout layer. These layers sequentially perform feature dimension mapping, normalization, non-linear transformation, and regularization, ultimately outputting a 512-dimensional text-trajectory fusion feature with unified dimensions and strong semantic correlation.

[0069] Regarding the second part, "Constructing a RAG retrieval library and matching reference motion features," such as... Figure 4 The enhanced matching retrieval is the core execution link in this embodiment. A retrieval device that combines the RAG structured database is constructed to form a text-trajectory-motion cross-modal retrieval library. Cross-modal retrieval is performed through TT-BMM technology, and the retrieval features are input into the TT-STA module. The steps are broken down in detail as follows: 1. Component loading and environment initialization First, load the 2D RVQ-VAE model, which is generated by a 2D convolutional encoder ( L+1 residual quantization module (L is the number of residual quantization layers), 2D convolutional decoder ( The 2D convolutional encoder consists of three parts. Its structure comprises four 2D convolutional layers (GELU activation function, LayerNorm normalization), with parameter configurations shown in Table 1. Its input is the original motion sequence. (T is the number of motion frames, J is the number of joints,) The number of potential channel features is used in this embodiment. (Taking the reference value as 64), the output is a 2D latent feature. .

[0070] Table 1. Parameter Configuration of Convolutional Encoder Model for Two-Dimensional Residual Vector Quantization Variational Autoencoder

[0071] L+1 residual quantization module for 2D latent features Residual quantization is performed, and the accumulated quantized features are output. A symmetrical design of the 2D convolutional decoder and encoder decodes and reconstructs the quantized 2D latent features into a motion sequence. The formula for calculating the loss function is as follows: .

[0072] in, For L1 norm loss, Indicates the first The residual input of the layer, Indicates the first Quantization results of the layer Perform a stopping gradient operation to avoid the gradient discontinuity problem caused by discrete quantization. For L2 norm loss, The L2 norm loss coefficient is taken as a reference value of 0.1 in this embodiment.

[0073] Secondly, configure the retrieval threshold (with a two-way similarity of ≥0.9) and the number of Top-K matches (in this embodiment, the reference value K is 3) to ensure a balance between batch retrieval efficiency and matching accuracy.

[0074] Next, initialize the TT-BMM components, which mainly include: 1) Dual momentum encoder: text-track momentum encoder With motion momentum encoder The 2D RVQ-VAE model encoder is the basic feature encoding network, responsible for encoding the text-trajectory input and motion input into corresponding feature vectors, respectively. In the dual-momentum mechanism, each encoder corresponds to a momentum copy encoder, forming a complete dual-momentum encoder architecture; 2) Momentum copy encoder: text-trajectory momentum copy encoder With motion momentum copy encoder These correspond to dual momentum encoders. and The core function is to maintain the temporal consistency of negative sample features. In contrastive learning, negative sample features need to maintain a stable distribution during training. Momentum replicas ensure the temporal continuity of negative sample features generated at different times through slow and smooth parameter updates, thereby improving the learning stability of contrastive loss; 3) Double negative sample queue, text-trajectory fusion feature queue With motion feature queue The queue capacity is fixed at 65536. This storage structure is designed to solve the problem of insufficient negative sample quantity in contrastive learning. It continuously stores historical negative sample features generated by the momentum replica encoder and removes the oldest features when the queue is full, thereby maintaining a large-scale and dynamically updated negative sample pool. The negative sample features in the queue are generated by the momentum replica encoder.

[0075] Finally, the RAG core feature file and corresponding metadata file are loaded. These files contain all the historical feature data required for model retrieval, providing an external knowledge source for subsequent retrieval and generation processes. A mapping index containing two types of features is constructed: 1) Text-trajectory fusion features: via a text-trajectory momentum encoder. Encoding results; 2) Motion features: The whole-body motion data is divided into local motion sequences of 6 body parts (e.g., Figure 5 (as shown in part (a) of the image), via a motion encoder. Local features are obtained by encoding the sequences of each part, and finally, the features of all parts are concatenated to obtain the component-level motion features. Metadata is used to enhance the interpretability and efficiency of retrieval. Each sample ID is associated with metadata such as the source of the original data, preprocessing parameters, and feature generation version.

[0076] 2. Cross-modal comparison retrieval After loading the components and initializing the environment as described above, the loaded TT-BMM receives the text-trajectory fusion features from the previous output and performs dimensionality and normalization checks. It then loads the structured database of the joint RAG. Based on the momentum update mechanism, the momentum encoder parameters are dynamically updated to ensure the temporal stability of negative sample features and to adapt to the TT-BMM execution logic. The update formula is as follows: .

[0077] .

[0078] in, m The momentum coefficient is taken as a reference value of 0.99 in this embodiment. , These are the parameters for the online text-track fusion encoder and the motion encoder, respectively.

[0079] The text-trajectory fusion features are used to calculate the cosine similarity between the query and the RAG motion features. Component-level similarity weighting is also introduced to enhance the fine-grained matching capability of the two modal constraints. The calculation formula is as follows: .

[0080] in, This is the global-local feature balance coefficient, and in this embodiment, a reference value of 0.6 is taken. The query results are text-trajectory fusion features (after normalization). This is the global motion feature matrix of RAG. For the first i Local feature matrix of each body part d The feature dimension is 512, which is taken as a reference value in this embodiment.

[0081] The reverse similarity between RAG motion features and text-trajectory fusion features used as a query is calculated using the following formula: .

[0082] in, The temperature coefficient is taken as a reference value of 0.07 in this embodiment, and is used to scale the similarity distribution to improve the distinguishability.

[0083] Candidate samples with significantly higher similarity to positive samples than negative samples are selected (this is the optimization objective during the training phase and the sample quality assessment during the inference phase). The loss function is calculated as follows: .

[0084] .

[0085] .

[0086] in, It is the similarity (positive sample pair) between the motion features and the matched text-trajectory fusion features in the i-th sample. In the i-th sample, the motion features and the text-trajectory negative sample queue The similarity of the j-th feature (negative sample pair). Let be the similarity (positive sample pair) between the text-trajectory fusion feature and the matched motion feature in the i-th sample. For the i-th sample, the text-trajectory fusion features and the motion negative sample queue The similarity of the j-th feature (negative sample pair). Text-track queue, For movement queues.

[0087] Retain candidate samples that simultaneously meet the following conditions: , and component-level similarity variance To ensure coordinated movement of all parts of the body. The similarity mean is sorted in descending order, and the top-3 samples are selected as the optimal reference features to balance matching accuracy and motion diversity.

[0088] 3. Multi-feature fusion and deep correlation of motion features based on TT-STA module like Figure 5 Part (b) shows the structure of the TT-STA module, which calls the TT-STA module to process the text-trajectory fusion feature tt, the input motion feature z, and the cross-modal retrieval Top-3 reference sample features. Deep fusion is performed to form a unified fusion feature, calculated using the following formula: .

[0089] .

[0090] .

[0091] .

[0092] Among them, the motion features are obtained by encoding the input raw motion data. z As input to the query matrix Q; motion features z Text-trajectory fusion features tt and retrieved cross-modal features The concatenation is used as input to the key matrix K; the motion features are then processed. z Text-trajectory fusion features tt and retrieved motion features The concatenation is used as input to the value matrix V; d For the feature dimension, a reference value of 512 is taken in this embodiment. P Representing 2D positional encoding, it is generated by applying sine functions to both the time axis (T, number of frames) and the spatial axis (J, number of joints), resulting in a final dimension of... .

[0093] like Figure 5 As shown in part (c), the features fused by TT-STA are stitched together into a unified reference feature, which is then input into the dual Transformer architecture after L2 norm normalization.

[0094] Table 2 compares the performance evaluation results of TT-BMM module retrieval with other benchmark retrieval methods. The best indicators in the table are highlighted in bold black font. Among them: Model parameter count is the total number of trainable parameters of the model, in M ​​(millions), reflecting the complexity and computational cost of the model; First-order recall (R1) is the probability that a correctly matched sample appears in the Top-1 list of retrieval results. ↑ indicates that the higher the indicator value, the better the retrieval accuracy; The meanings of R2, R3, R5, and R10 can be inferred from the meaning of R1; Median rank (MedR) is the median rank of all test samples with correct matching results. ↓ indicates that the lower the indicator value, the higher the average rank of the retrieval results, and the better the overall performance.

[0095] Table 2 Performance evaluation results of different methods for cross-modal motion retrieval tasks

[0096] Experimental results show that the TT-BMM module significantly outperforms the benchmark method in both types of retrieval tasks. In the conditional-to-motion retrieval task, TT-BMM achieves an R1 score of 45.28%, a 31.52% improvement over the best-performing benchmark model, BMM (13.76%). The MedR score drops to 2.00, a 14.00 decrease compared to BMM (16.00). The correct matching samples of TT-BMM retrieval results are almost consistently distributed within the Top-2 range. In the motion-to-conditional retrieval task, all recall metrics of TT-BMM reach their peak simultaneously, with an R10 score as high as 89.28% and a MedR score dropping to 2.00. This indicates the strengthening effect of text-trajectory constraints on the bidirectional association performance of motion features and conditional features. Although the number of parameters in TT-BMM (252.62M) is slightly higher than that of BMM (238M), the ratio of its performance gain to the increase in the number of parameters is significantly higher than that of other models, demonstrating that TT-BMM also has technical advantages in balancing parameter efficiency and performance.

[0097] Regarding Part Three, "Generating the Desired Digital Human Pose Sequence Based on a Dual Transformer Architecture using TT-STA," as follows... Figure 6 As shown, the heterogeneous multimodal feature fusion generation stage is the final execution link in the process of jointly driving human motion synthesis using text and trajectory. The detailed steps are broken down as follows: The Humanml3D dataset was preprocessed and divided into training, validation, and test sets in an 8:1:1 ratio to provide a data foundation for subsequent model training and evaluation. The reference features output from the retrieval enhancement matching stage were input into a mask Transformer. The key submodule TT-STA was used to deeply fuse the reference features, text-trajectory embeddings, and motion features output from the 2D RVQ-VAE, resulting in a unified feature representation containing both semantic and spatiotemporal information.

[0098] Subsequently, the merged base layer motion token, i.e. Execute 2D dynamic random masking (a masking strategy that first proceeds along the time dimension and then along the spatial dimension) to generate a tainted sequence. The trained model recovers the complete original motion sequence based on the unmasked features, thereby enhancing the robustness of the dual-modal constraints of text and trajectory. The corresponding masking loss is constructed using negative log-likelihood. .

[0099] in, Indicates a given tainted mask sequence Text-track embedding tt Reference text-trajectory fusion features and reference motion characteristics In this case, the model predicts the original complete token sequence. The probability of.

[0100] For the masked Transformer output 2D RVQ-VAE base layer quantization token, i.e. This token is a coarse-grained structural representation of the motion. The residual Transformer uses... Starting from this point, we perform layer-by-layer autoregressive prediction of the 2D RVQ-VAE upper-layer residual quantization token: Residual Transformer (before) Cumulative movement of layers 、 Current layer index and text-trajectory conditions tt As input, the 2D RVQ-VAE encoded features are refined with residual correction to gradually approximate the real motion sequence. The corresponding residual loss is constructed using negative log-likelihood: .

[0101] in, In the known The cumulative motion token of the layer, the current layer number and text-trajectory conditions tt Under the premise that the model predicts the first Layer residual The conditional probability distribution.

[0102] After predicting residual tokens at all levels, the token prediction distribution output by the residual Transformer is first constrained and guided using the CFG mechanism. Then, the final determined complete quantized token sequence is fed into the 2D RVQ-VAE decoder to obtain the final motion sequence. The CFG guidance mechanism defines the guidance conditions as features fused from text and trajectory. tt Retrieval text-trajectory features With retrieval of motion features The constrained set constituted: .

[0103] CFG Token Prediction of Residual Transformer Output logits Weighted guidance ( logits It is the unnormalized score of each token in the 2D RVQ-VAE codebook, output by the residual Transformer. The calculation formula is: .

[0104] in, Constraints con Token prediction from the lower residual Transformer output logits , Token prediction for residual Transformer output under unconstrained conditions logits , s As a guiding coefficient, in this embodiment s The reference value is 4.

[0105] By amplifying the difference between constrained generation and unconstrained generation logits The difference is that CFG can suppress feature outputs that deviate from the text-trajectory constraints, so that the final decoded motion sequence strictly fits the text semantics and is consistent with the preset key joint trajectory path, thereby generating a high-quality and highly controllable human pose sequence.

[0106] Table 3 presents the quantitative evaluation results of the motion synthesis method proposed in this application (named T2ReMoMask) and other motion synthesis methods based on the HumanML3D dataset. Among them: Retrieval Precision (R-Precision Top1 / Top2 / Top3) refers to the semantic matching degree between the generated motion and the text instruction; TopN represents the probability that the generated motion appears in the TopN real motion retrieval list corresponding to the text, with ↑ indicating a higher value and stronger semantic alignment; FID is used to measure the distribution similarity between the generated motion and the real motion, with ↓ indicating a smaller value, closer to the real motion (FID ≈ 0.002 for Real Motions), and higher generation quality; Motion Matching Distance (MM Dist) refers to the geometric feature difference (such as joint trajectory, temporal dynamics) between the generated motion and the real motion, with ↓ indicating a smaller value and higher motion morphology fidelity; Motion Diversity refers to the degree of feature difference between generated motion samples, with → indicating that the value is closer to the real motion (Real Motions) under the premise of matching the real distribution. Motions Diversity≈9.503), the higher the diversity, the better; MultiModality refers to the ability to generate different motion samples under the same text command, ↑ indicates that the higher the value, the richer the motion expression of the same semantics.

[0107] Table 3 Evaluation results of different motion synthesis methods based on the HumanML3D dataset

[0108] As shown in Table 3, T2ReMoMask achieves a Top3 retrieval accuracy of 0.817, the highest among all methods, slightly better than ReMoMask (0.813) and MoGenTS (0.812). Compared to earlier benchmark methods (such as Text2Gesture's Top3 accuracy of 0.345), the retrieval accuracy of the proposed method is close to that of real motion (0.797), demonstrating the performance gain brought by cross-modal modeling. T2ReMoMask's FID is 0.039, close to MoGenTS (0.033) and MoMask (0.045), but far lower than earlier benchmark methods (such as Text2Gesture's FID of 7.664). The results show that the distribution of generated motion is highly similar to that of real motion, with a motion matching distance of 2.871, which is basically on par with real motion (2.974) and the optimal method (MoGenTS=2.867), indicating that the geometric fidelity of the motion morphology reaches an excellent level. The motion diversity of T2ReMoMask is 9.7965, close to the 9.503 of real motion, indicating that the sample diversity of its generated motion is good and no mode collapse problem occurs. The motion multimodality is 2.825, which is one of the highest values ​​among all methods (comparable to ReMoMask's 2.823 and ReMoGPT's 2.816), indicating that T2ReMoMask has a strong ability to generate differentiated motions under the same text command.

[0109] The experimental results above show that the motion synthesis method proposed in this application is among the best in terms of retrieval accuracy, FID, motion matching distance, motion diversity, and motion multimodality. It achieves a good balance between semantic alignment, generation quality, and motion diversity, and can serve as a reliable technical support for multimodal signal-driven motion generation tasks.

[0110] Figure 7 The proposed text- and trajectory-driven motion synthesis method is visually illustrated. From top to bottom, the content is as follows: the text description "a person walks forward while raising one hand" and the trajectory of the pelvic joint path point are used as the joint driving signal input. The motion synthesis method proposed in the above embodiments of this application generates a human pose sequence that meets preset constraints. The results of motion synthesis are presented in top and front views. As can be seen from the embodiments, the method proposed in the above embodiments of this application deeply aligns and fuses the semantic goals and geometric constraints of motion generation at the feature level. By fully leveraging the synergistic effect of the text semantics and the joint path point trajectory as driving signals, the quality of the generated poses by the motion synthesis method is improved, and the applicability of motion synthesis is broadened.

[0111] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0112] This document uses specific examples to illustrate the principles and implementation methods of this application. The descriptions of the above embodiments are only for the purpose of helping to understand the methods and core ideas of this application. Furthermore, those skilled in the art will recognize that, based on the ideas of this application, there will be changes in the specific implementation methods and application scope. Therefore, the content of this specification should not be construed as a limitation of this application.

Claims

1. A motion synthesis method jointly driven by text and trajectory, characterized in that, include: Step S1: Obtain training samples containing text descriptions, 3D motion data of human body movements, and path trajectories of key human joints. Step S2: Perform standardized preprocessing on the training samples using the input preprocessing module, and extract the text semantic features corresponding to the preprocessed text description and the trajectory spatiotemporal features corresponding to the waypoint trajectory, respectively. Step S3: Use the text-trajectory feature encoding fusion module to perform dimensional alignment and cross-modal encoding fusion on the text semantic features and the trajectory spatiotemporal features to obtain text-trajectory fused features; Step S4: Using the text-trajectory fusion features as query features, cross-modal retrieval is performed in the pre-built text-trajectory-motion multimodal retrieval library through the retrieval enhancement matching module combined with bidirectional momentum text-trajectory-motion modeling technology to obtain reference motion features; Step S5: Use the text-trajectory spatiotemporal attention module to perform multi-source feature fusion on the text-trajectory fusion features and reference motion features, and combine the dual Transformer generation module to complete the global modeling and fine-grained optimization of motion features to generate a motion quantization feature sequence. The dual Transformer generation module includes a mask Transformer and a residual Transformer; Step S6: Based on the motion quantization feature sequence, the motion features are encoded and decoded using a 2D residual quantization variational autoencoder module. A classifier-free guidance mechanism is introduced to strengthen the driving signal constraints, thereby completing the training of the motion generation model and obtaining the trained motion generation model. The motion generation model includes an input preprocessing module, a text-trajectory feature encoding fusion module, a retrieval enhancement matching module, a text-trajectory spatiotemporal attention module, a dual Transformer generation module, a classifier-free guidance constraint module, and a 2D residual quantization variational autoencoder module. Step S7: Obtain the target motion description text and the trajectory of the key joint path points of the target human body, input them into the trained motion generation model, and output a human pose sequence that conforms to the semantics of the text and matches the trajectory constraints.

2. The motion synthesis method jointly driven by text and trajectory according to claim 1, characterized in that, In step S1, the key joints of the human body include six joints: pelvis, head, left hand, right hand, left foot, and right foot. The trajectory spatiotemporal features are extracted by a pre-trained trajectory encoder through time window compression and joint position projection. The text semantic features are extracted by a pre-trained CLIP text encoder and then unified with the dimensions of the trajectory spatiotemporal features through a projection layer. The text semantic features are pre-concatenated to the trajectory spatiotemporal features, and after position encoding, Transformer encoding, and core feature extraction, the text-trajectory fusion features are obtained.

3. The motion synthesis method jointly driven by text and trajectory according to claim 1, characterized in that, In step S4, the text-trajectory-motion multimodal retrieval library stores text-trajectory fusion features and corresponding motion features, and establishes an index based on feature similarity; The bidirectional momentum text-trajectory-motion modeling technology achieves cross-modal comparative learning of text-trajectory-motion through dual momentum encoders and a negative sample queue. The dual momentum encoders include a text-trajectory momentum encoder and a motion momentum encoder, and the negative sample queue includes a text-trajectory fusion feature queue and a motion feature queue.

4. The motion synthesis method jointly driven by text and trajectory according to claim 1, characterized in that, In step S4, the text-trajectory fusion feature is used as the input to the query matrix Q, the concatenation result of the input motion feature, text-trajectory fusion feature and retrieved cross-modal feature is used as the input to the key matrix K, and the concatenation result of the input motion feature, text-trajectory fusion feature and retrieved motion feature is used as the input to the value matrix V, thus completing the feature matching for cross-modal retrieval.

5. The motion synthesis method jointly driven by text and trajectory according to claim 1, characterized in that, In step S5, the text-trajectory spatiotemporal attention module receives the input motion features, text-trajectory fusion features, and retrieved reference motion features, calculates the query matrix, key matrix, and value matrix respectively, and completes the deep fusion of multi-source features through self-attention calculation.

6. The motion synthesis method jointly driven by text and trajectory according to claim 1, characterized in that, In step S5, the masking Transformer uses a dynamic masking strategy to mask the input motion features and model the global structure of the motion sequence; the residual Transformer optimizes the motion detail features layer by layer according to the quantization level of the 2D residual quantization variational autoencoder.

7. The motion synthesis method jointly driven by text and trajectory according to claim 1, characterized in that, The 2D residual quantization variational autoencoder module includes a 2D convolutional encoder, a multi-layer residual quantization module, and a 2D convolutional decoder. The 2D convolutional encoder encodes the input motion sequence to obtain 2D latent features. The residual quantization module performs residual quantization processing on the 2D latent features. The 2D convolutional decoder decodes the quantized features to reconstruct the human motion sequence.

8. The motion synthesis method jointly driven by text and trajectory according to claim 1, characterized in that, The classifier-free guidance mechanism amplifies the constraint weights of text and trajectory-driven signals by generating feature prediction distributions under weighted constraints and unconstrained conditions, thereby balancing the matching degree and diversity of generated motions.

9. The motion synthesis method jointly driven by text and trajectory according to claim 1, characterized in that, Step S7 specifically includes: Step S71: The input preprocessing module is used to perform standardized preprocessing on the acquired target motion description text and the trajectory of the key joint path points of the target human body, respectively, to obtain standardized text data and standardized trajectory data that are adapted to the input format of the motion generation model. Step S72: Use the text-trajectory feature encoding fusion module to extract the target text semantic features corresponding to the standardized text data and the target trajectory spatiotemporal features corresponding to the standardized trajectory data. Perform cross-modal dimension alignment and deep fusion on the target text semantic features and the target trajectory spatiotemporal features to obtain the target text-trajectory fusion features. Step S73: Using the target text-trajectory fusion feature as the query feature, input it into the pre-built text-trajectory-motion multimodal retrieval library, and perform cross-modal retrieval through the retrieval enhancement matching module to obtain target reference motion features with high semantic consistency and high trajectory fit. Step S74: Use the text-trajectory spatiotemporal attention module to perform multi-source feature fusion on the target text-trajectory fusion features and the target reference motion features. Use the trained dual Transformer generation module to complete the global modeling and fine-grained optimization of motion features and generate the target motion quantization feature sequence. Step S75: The target motion quantization feature sequence is subjected to driving signal constraint enhancement and decoding reconstruction through the 2D residual quantization variational autoencoder module and the classifier-free guided constraint module, and a human pose sequence that conforms to the text semantics and matches the trajectory constraints is output.

10. The motion synthesis method jointly driven by text and trajectory according to claim 9, characterized in that, In step S73, cross-modal retrieval is performed using bidirectional momentum text-trajectory-motion modeling technology. The bidirectional similarity and component-level similarity between the query features and the features in the retrieval database are calculated. Motion features that simultaneously meet the bidirectional feature similarity threshold and the component-level similarity variance threshold are selected from the pre-built text-trajectory-motion multimodal retrieval database. After sorting the features in descending order of comprehensive similarity, the features of the Top-K samples are selected as target reference motion features.