Humanoid robot motion simulation data generation method and device, equipment and medium

By constructing a temporal diffusion generation model, and using the diffusion generation model to learn the distribution characteristics of human motion data and a temporal network model to extract motion semantic features, the problem of high difficulty and cost in acquiring motion simulation data for humanoid robots is solved, and low-cost and efficient motion data generation is achieved.

CN122244586APending Publication Date: 2026-06-19CHINA MOBILEHANGZHOUINFORMATION TECH CO LTD +1

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHINA MOBILEHANGZHOUINFORMATION TECH CO LTD
Filing Date
2026-03-05
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

In existing technologies, it is difficult and costly to acquire motion simulation data for humanoid robots, requiring expensive motion capture equipment and complex post-processing procedures.

Method used

By constructing a temporal diffusion generation model, the distribution characteristics of human motion data are learned using the diffusion generation model, and combined with a temporal network model to extract motion semantic features and inter-frame temporal features, motion sequences that conform to the target motion semantic labels are generated.

Benefits of technology

It enables the rapid and low-cost generation of continuous and stable human motion data, reduces the technical barriers to human motion data acquisition, and improves the realism and flexibility of humanoid robot motion simulation.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244586A_ABST
    Figure CN122244586A_ABST
Patent Text Reader

Abstract

This application relates to the field of humanoid robot technology, providing a method, apparatus, device, and medium for generating humanoid robot motion simulation data. The method includes: determining target motion semantic labels from a set of motion semantic labels; inputting the target motion semantic labels into a temporal diffusion generation model to obtain a target motion sequence corresponding to the target motion semantic labels; the temporal diffusion generation model is used to model the human motion generation process as a Markov noise addition and denoising process; the temporal diffusion generation model is constructed and trained from a diffusion generation model and a temporal network model; wherein, the diffusion generation model is used to learn the distribution characteristics of human motion data to generate motion data at different time steps; the temporal network model is used to extract motion semantic features from the motion data and temporal features between adjacent motion frames. This application can quickly generate a continuous and stable human motion data, reducing the cost of acquiring human motion data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of humanoid robot technology, and in particular to methods, devices, equipment and media for generating motion simulation data of humanoid robots. Background Technology

[0002] Humanoid robots have garnered widespread attention due to their humanoid appearance and promising application prospects. With the development of large language model technology, interest in humanoid robots has reached new heights. Utilizing the powerful natural language processing and intent understanding capabilities of large language models, humanoid robots have truly ushered in the era of embodied humanoid intelligent agents.

[0003] However, motion simulation for humanoid robots remains a pressing problem. Traditional humanoid robot movement patterns typically follow manually pre-defined motion strategies, such as the angle at which robotic arms or legs are raised. These strategies cannot meet the demands of increasingly complex robot movement scenarios. To make humanoid robot motion more refined and human-like, mainstream methods usually require the use of specialized motion capture equipment (such as sensors) to acquire realistic human motion data. Based on this, motion designers use simulation platforms to create animations from the captured data, such as using keyframe interpolation techniques to generate different motion sequences. Through continuous iteration and adjustment of visual effects, a smooth and stable robot motion sequence is ultimately generated. This approach not only requires expensive motion capture equipment but also necessitates motion designers to iteratively adjust the motion effects, resulting in high costs in terms of time, manpower, and other resources. Summary of the Invention

[0004] This application provides a method, apparatus, equipment, and medium for generating motion simulation data for humanoid robots, in order to solve the technical problems of high difficulty and high cost in obtaining motion simulation data for humanoid robots in the prior art.

[0005] This application provides a method for generating motion simulation data for a humanoid robot, comprising: determining a target motion semantic label from a set of motion semantic labels; the set of motion semantic labels includes multiple motion semantic labels; inputting the target motion semantic label into a temporal diffusion generation model to obtain a target motion sequence output by the temporal diffusion generation model corresponding to the target motion semantic label; the temporal diffusion generation model is used to model the process of human motion generation as a Markov noise addition and denoising process; the temporal diffusion generation model is constructed and trained by a diffusion generation model and a temporal network model; wherein, the diffusion generation model is used to learn the distribution characteristics of human motion data to generate motion data at different time steps; the temporal network model is used to extract motion semantic features and temporal features between adjacent motion frames from the motion data.

[0006] According to the humanoid robot motion simulation data generation method provided in this application, before determining the target motion semantic label from the motion semantic label set, the method further includes: acquiring a human motion dataset, and setting different motion semantic labels for different types of human motion sequences based on the motion semantic features of the human motion sequences in the human motion dataset, so as to obtain multiple motion semantic labels; and determining a motion semantic label set based on the multiple motion semantic labels.

[0007] According to the method for generating motion simulation data of a humanoid robot provided in this application, the temporal diffusion generation model includes a human motion encoding subnetwork and a human motion decoding subnetwork; the human motion encoding subnetwork is used to extract the motion semantic features corresponding to the forward-noiseed human motion sequence; the human motion decoding subnetwork is used to receive the motion semantic features extracted by the human motion encoding subnetwork, and combine them with the inter-frame temporal features of the human motion sequence to determine and output the human motion sequence.

[0008] According to the method for generating motion simulation data of a humanoid robot provided in this application, the method further includes: dimensionally aligning the motion semantic labels and the forward-noiseed time steps, and using the aligned result as the start marker of the motion sequence to constrain the human motion coding sub-network to extract motion semantic features corresponding to the motion semantic labels.

[0009] According to the method for generating humanoid robot motion simulation data provided in this application, the human motion decoding subnetwork is used to fuse the motion semantic features extracted by the human motion coding subnetwork with the inter-frame temporal features, so that the generated human motion sequence has both motion semantic information and visual continuity.

[0010] According to the humanoid robot motion simulation data generation method provided in this application, before inputting the target motion semantic label into the temporal diffusion generation model, the method further includes: training the initial temporal diffusion generation model through a loss function to obtain the temporal diffusion generation model; the loss function is used to constrain the optimization direction of the model parameters; wherein, the loss function includes a reconstruction loss term and a geometric loss term, the reconstruction loss term is used to make the learned sample distribution continuously approximate the original sample distribution; the geometric loss term is used to ensure the inter-frame smoothness of the generated target motion sequence.

[0011] This application also provides a humanoid robot motion simulation data generation device, comprising: a target motion semantic label module, used to determine target motion semantic labels from a set of motion semantic labels; the set of motion semantic labels includes multiple motion semantic labels; a target motion sequence module, used to input the target motion semantic labels into a temporal diffusion generation model to obtain a target motion sequence output by the temporal diffusion generation model corresponding to the target motion semantic labels; the temporal diffusion generation model is used to model the process of human motion generation as a Markov noise addition and denoising process; the temporal diffusion generation model is constructed and trained by the diffusion generation model and the temporal network model; wherein, the diffusion generation model is used to learn the distribution characteristics of human motion data to generate motion data at different time steps; the temporal network model is used to extract motion semantic features and temporal features between adjacent motion frames from the motion data.

[0012] This application also provides an electronic device, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the program to implement the humanoid robot motion simulation data generation method described above.

[0013] This application also provides a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the humanoid robot motion simulation data generation method as described above.

[0014] This application also provides a computer program product, including a computer program that, when executed by a processor, implements the humanoid robot motion simulation data generation method described above.

[0015] This application provides a method, apparatus, device, and medium for generating humanoid robot motion simulation data. The method includes: determining a target motion semantic label from a set of motion semantic labels; the set of motion semantic labels contains multiple motion semantic labels; inputting the target motion semantic label into a temporal diffusion generation model to obtain a target motion sequence output by the temporal diffusion generation model corresponding to the target motion semantic label; the temporal diffusion generation model is used to model the process of human motion generation as a Markov noise addition and denoising process; the temporal diffusion generation model is constructed and trained by a diffusion generation model and a temporal network model; wherein, the diffusion generation model is used to learn the distribution characteristics of human motion data to generate motion data at different time steps; the temporal network model is used to extract motion semantic features and temporal features between adjacent motion frames from the motion data. Through the above method, this application can utilize publicly available and real human motion datasets to learn the distribution characteristics of human motion data and establish a mapping relationship between motion semantic features and human motion data. By specifying human motion semantic labels, a continuous and stable human motion data can be quickly generated, reducing the cost of acquiring human motion data. Attached Figure Description

[0016] To more clearly illustrate the technical solutions in this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0017] Figure 1 This is a flowchart illustrating the method for generating motion simulation data of a humanoid robot provided in an embodiment of this application.

[0018] Figure 2 This is a schematic diagram of the overall framework of the humanoid robot motion simulation data generation method provided in the embodiments of this application.

[0019] Figure 3 This is a schematic diagram of the network architecture of the temporal diffusion generation model provided in the embodiments of this application.

[0020] Figure 4 This is a schematic diagram of the structure of the humanoid robot motion simulation data generation device provided in the embodiments of this application.

[0021] Figure 5 This is a schematic diagram of the physical structure of the electronic device provided in the embodiments of this application. Detailed Implementation

[0022] To make the objectives, technical solutions, and advantages of this application clearer, the technical solutions of this application will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this application, not all embodiments. Based on the embodiments of this application, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this application.

[0023] In related technologies, it is usually necessary to use specialized equipment to obtain real human motion data, and certain post-processing procedures are required to obtain the final human motion data, such as joint movement posture information.

[0024] Based on this, this application provides a method for generating humanoid robot motion simulation data. It can utilize publicly available and real human motion datasets to learn the distribution characteristics of human motion data and establish a mapping relationship between motion semantic features (such as walking, running, etc.) and human motion data. By specifying human motion semantic labels, a continuous and stable human motion data can be quickly generated, reducing the cost of acquiring human motion data.

[0025] Please see Figure 1 , Figure 1 This is a flowchart illustrating the method for generating motion simulation data of a humanoid robot according to an embodiment of this application. In this embodiment, the method for generating motion simulation data of a humanoid robot may include steps S110 to S120, each of which is detailed below: S110: Determine the target motion semantic label from the motion semantic label set; the motion semantic label set contains multiple motion semantic labels.

[0026] S120: Input the target motion semantic label into the temporal diffusion generation model to obtain the target motion sequence output by the temporal diffusion generation model corresponding to the target motion semantic label; the temporal diffusion generation model is used to model the process of human motion generation as a Markov noise addition and denoising process; the temporal diffusion generation model is constructed and trained by the diffusion generation model and the temporal network model.

[0027] In this embodiment, the humanoid robot motion simulation data generation method can utilize a temporal diffusion generation model to generate corresponding target motion sequences through target motion semantic tags, providing data support for subsequent humanoid robot motion simulation. Specifically, the temporal diffusion generation model combines the advantages of diffusion generation models and temporal network models, and models the human motion generation process as a Markov noise addition and denoising process.

[0028] Specifically, in step S110, this embodiment can select a target motion semantic tag from a pre-set set of motion semantic tags containing multiple motion semantic tags. Motion semantic tags can be understood as an abstract description of different motion types or features, such as "running," "jumping," and "grabbing."

[0029] Motion semantic labels can concisely express specific motion meanings and provide clear semantic guidance for the subsequent generation of specific motion sequences, that is, to inform the temporal diffusion generation model user what type of motion data they expect to generate.

[0030] In step S120, the target motion semantic label determined in step S110 is input into the temporal diffusion generation model. After processing, the model can output the target motion sequence corresponding to the target motion semantic label.

[0031] The temporal diffusion generation model is used to model the process of human motion generation as a Markov noise addition and denoising process. Specifically, in the noise addition stage, the temporal diffusion generation model can gradually add noise to the original motion data, making it gradually blurred; in the denoising stage, the temporal diffusion generation model can gradually remove noise from the noise-added data according to the input target motion semantic label, and finally recover and output the target motion sequence corresponding to the target motion semantic label.

[0032] Furthermore, the temporal diffusion generation model in this embodiment is constructed and trained from a diffusion generation model and a temporal network model. The diffusion generation model is used to learn the distribution characteristics of human motion data to generate motion data at different time steps. The temporal network model is used to extract motion semantic features and temporal features between adjacent motion frames from the motion data.

[0033] Diffusion generative models are used to learn the distribution characteristics of human motion data. Specifically, human motion data exhibits certain patterns and distributions at different time steps. By learning from a large amount of human motion data, diffusion generative models can grasp these distribution characteristics and thus generate motion data that conforms to this distribution at different time steps.

[0034] Temporal network models are used to extract motion semantic features from motion data and temporal features between adjacent motion frames. Specifically, motion semantic features help the model understand the essential characteristics of different motions, while temporal features between adjacent motion frames ensure that the generated motion sequence is coherent and reasonable, making the generated motion appear natural and smooth.

[0035] The above embodiment provides a method for generating motion simulation data for a humanoid robot, which can achieve the following beneficial effects: 1. Diverse generated motion sequences: Because the diffusion generative model can learn the distribution characteristics of human motion data, it can generate different motion data within a certain range, resulting in diverse motion sequences. This is crucial for motion simulation of humanoid robots, as robots may need to perform various motion tasks in practical applications, and diverse motion sequences can better simulate real-world scenarios.

[0036] 2. The motion sequence possesses coherence and rationality: The introduction of a temporal network model ensures a reasonable temporal relationship between adjacent motion frames in the generated motion sequence, making the generated motion appear natural and smooth, conforming to the laws of human movement. This helps improve the realism and reliability of humanoid robot motion simulation, enabling robots to complete various tasks more naturally.

[0037] 3. Semantic-driven generation: By using target motion semantic labels as input, specific types of motion sequences can be generated according to user needs. Users only need to specify the desired motion type, and the model can automatically generate the corresponding motion data. This semantic-driven approach makes motion data generation more flexible and efficient, allowing users to customize motion simulation data according to different application scenarios.

[0038] Based on the above embodiments, the steps prior to determining the target motion semantic tag from the motion semantic tag set may further include: Obtain a human motion dataset, and based on the motion semantic features of human motion sequences in the dataset, set different motion semantic labels for different types of human motion sequences to obtain multiple motion semantic labels; determine a set of motion semantic labels based on multiple motion semantic labels.

[0039] In this embodiment, before determining the target motion semantic label from the motion semantic label set, it is also necessary to construct a motion semantic label set. This process is the foundation of the entire motion simulation data generation method. By analyzing and processing the human motion dataset, different types of motion sequences are assigned labels with clear semantics, thereby forming a label set that can be selected subsequently.

[0040] Specifically, it involves acquiring and collecting a large amount of human motion data. This data can be obtained in various ways, such as actual human motion data recorded by motion capture devices and motion information extracted from video analysis.

[0041] It should be noted that the human motion dataset should cover as many different types of human motion as possible. Rich and realistic human motion data can ensure the comprehensiveness and accuracy of subsequent labeling, so as to accurately identify and distinguish different types of motion.

[0042] This embodiment can also analyze human motion sequences in a human motion dataset and extract their motion semantic features. Motion semantic features can include the type of motion (such as walking, running, climbing, etc.), the direction of motion, and the speed of motion. Based on these features, different motion semantic labels are assigned to different types of human motion sequences.

[0043] For example, a movement sequence in which the legs move forward at a certain speed can be labeled as "walking"; a movement sequence in which the legs jump rapidly can be labeled as "jumping".

[0044] Finally, all the pre-set motion semantic tags are integrated to form a motion semantic tag set. This set includes semantic tags for various types of motion, providing a selectable tag library. When generating motion simulation data, users can choose the appropriate target motion semantic tag from the set according to their specific needs.

[0045] Accurate motion semantic labels provide more precise semantic information for subsequent motion sequence generation. The model can generate motion sequences that better meet user needs based on these explicit labels. Each label represents a specific type of motion, and users can clearly understand the correspondence between the input labels and the generated motion sequences.

[0046] For example, if a user needs to generate a "running" motion sequence, the temporal diffusion generation model can better understand the user's needs and generate a running motion sequence as the target motion sequence because it has an accurate "running" label as the target motion semantic label.

[0047] In this embodiment, by assigning specific semantic labels to different types of motion sequences, complex motion data is abstracted and classified, so that concise labels can be used to represent specific motion types in the future.

[0048] Please see Figure 2 , Figure 2 This is a schematic diagram of the overall framework of the humanoid robot motion simulation data generation method provided in the embodiments of this application.

[0049] First, this embodiment is based on publicly available real human motion datasets. According to the semantic features of human motion, such as walking and running, motion semantic labels are set for different types of human motion sequences, such as... Figure 2 The data preparation is shown in (a) above.

[0050] Then, in this embodiment, the diffusion generation model is combined with the temporal network model to obtain a temporal diffusion generation model. This model can predict multiple motion sequences through forward diffusion. Specifically, the diffusion generation model is used to learn the distribution characteristics of human motion data, while the temporal network model is used to extract the motion semantic features of the human motion sequences and the temporal features between adjacent motion frames, such as... Figure 2 The model construction shown in (b) is as follows.

[0051] Finally, when it is necessary to generate specific types of human motion data (i.e., target motion sequences), it is only necessary to specify the motion semantic type (i.e., target motion semantic label), such as walking. In this embodiment, the motion distribution features and semantic features extracted by the temporal diffusion generation model can be used to encode the semantic labels and constraints into a condition vector through distribution sampling, which serves as the guiding signal for subsequent models. Through inverse denoising processing by the temporal diffusion generation model, a continuous and stable human motion sequence can be generated as the target motion sequence, such as walking. Figure 2 The human motion sequence shown in (c) is generated.

[0052] Based on any of the above embodiments, the temporal diffusion generation model includes a human motion coding subnetwork and a human motion decoding subnetwork; the human motion coding subnetwork is used to extract the motion semantic features corresponding to the forward-noise-added human motion sequence; the human motion decoding subnetwork is used to receive the motion semantic features extracted by the human motion coding subnetwork, and combine them with the inter-frame temporal features of the human motion sequence to determine and output the human motion sequence.

[0053] In this embodiment, the temporal diffusion generation model is further refined into a human motion encoding sub-network and a human motion decoding sub-network. These two sub-networks work together to process the forward-noise-added human motion sequence, combining motion semantic features and inter-frame temporal features to finally output a human motion sequence that meets the requirements.

[0054] Specifically, the human motion coding sub-network receives a forward-noise-added human motion sequence as input. In this embodiment, forward-noise addition refers to gradually adding noise to the original human motion sequence during Markov noise addition, making it increasingly blurred. The main task of the human motion coding sub-network is to extract motion semantic features from this noisy human motion sequence. Motion semantic features can be used to describe the essential characteristics of the motion, such as the type of motion (e.g., walking, running, grasping), the direction of motion, and the amplitude of motion.

[0055] The human motion decoding subnetwork receives the motion semantic features extracted by the human motion encoding subnetwork and combines them with the inter-frame temporal features of the human motion sequence. The inter-frame temporal features reflect the temporal relationship and change patterns between adjacent motion frames, ensuring the coherence and rationality of the generated motion sequence. Based on this information, the human motion decoding subnetwork gradually removes noise from the noisy motion sequence during Markov denoising, ultimately determining and outputting the human motion sequence. The semantic information extracted by the encoding subnetwork and the inter-frame temporal features are integrated, and a natural and smooth human motion sequence corresponding to the target motion semantic label is recovered through denoising operations.

[0056] The human motion encoding subnetwork focuses on extracting motion semantic features, accurately capturing the essential information of the input noisy motion sequence. Building upon this, the human motion decoding subnetwork combines inter-frame temporal features for denoising and generation, making the generated motion sequence more consistent with the requirements of the target motion semantic label and improving the accuracy of the generated results. The introduction of inter-frame temporal features ensures that adjacent motion frames in the generated motion sequence have reasonable temporal relationships and trends, avoiding abrupt and discontinuous motion.

[0057] In this embodiment, the generated motion appears more natural and fluid, conforming to the actual state of human movement. For example, when generating a "walking" motion sequence, it ensures that the transitions between each step are natural and the rhythm of the steps is reasonable.

[0058] In summary, by processing motion semantic features and inter-frame temporal features separately, the temporal diffusion generation model can better adapt to various complex motion scenarios. For motions with complex combinations or variations of actions, the model can more accurately capture their semantic information and temporal relationships, generating more realistic motion sequences.

[0059] Furthermore, the human motion decoding subnetwork is used to fuse the motion semantic features extracted by the human motion coding subnetwork with the inter-frame temporal features, so that the generated human motion sequence has both motion semantic information and visual continuity.

[0060] As mentioned above, inter-frame temporal features reflect the temporal relationships and patterns of change between adjacent motion frames. In human movement, each action is composed of a series of consecutive frames. Inter-frame temporal features ensure that the transitions between these frames are natural and smooth, avoiding abrupt changes or discontinuities in the movements. For example, during running, the lifting and lowering of the legs, the forward leaning and backward swinging of the body, and other actions have specific temporal relationships between adjacent frames.

[0061] In this embodiment, the human motion decoding subnetwork fuses these two features. Optionally, it can use fully connected layers, recurrent neural networks (RNNs), or long short-term memory networks (LSTMs). These network structures can organically combine motion semantic features and inter-frame temporal features based on their characteristics. During the fusion process, the human motion decoding subnetwork can learn how to adjust the inter-frame temporal relationship according to the motion semantic features, and at the same time optimize the expression of motion semantics based on the inter-frame temporal features.

[0062] In this embodiment, by incorporating inter-frame temporal features, the generated motion sequence appears visually continuous. The motion looks natural and smooth, without sudden pauses or jumps. The motion sequence, possessing both semantic motion information and visual continuity, makes the motion simulation of the humanoid robot more realistic.

[0063] In some embodiments, the method for generating humanoid robot motion simulation data may further include: The motion semantic labels and forward-noiseed time steps are dimensionally aligned, and the aligned result is used as the start marker of the motion sequence to constrain the human motion coding sub-network to extract motion semantic features corresponding to the motion semantic labels.

[0064] The purpose of this embodiment is to constrain the feature extraction process of the human motion coding subnetwork so that it can extract motion semantic features corresponding to the input motion semantic labels more accurately, thereby improving the matching degree between the subsequently generated human motion sequence and the target motion semantics.

[0065] Using the dimension-aligned result as the start marker of the motion sequence can guide the human motion encoding subnetwork to pay more attention to information related to the input motion semantic label during the extraction of motion semantic features, thereby constraining it to extract motion semantic features corresponding to the motion semantic label.

[0066] By aligning the motion semantic labels and the forward noisy time steps dimensionally and using them as start markers, the human motion encoding subnetwork can more clearly focus on feature information related to the target motion semantic labels. This helps reduce interference from irrelevant information, allowing the extracted motion semantic features to more accurately reflect the essential characteristics of the target motion.

[0067] Based on any of the above embodiments, the step of inputting the target motion semantic label into the temporal diffusion generation model may further include: The initial temporal diffusion generation model is trained using a loss function to obtain the temporal diffusion generation model. The loss function is used to constrain the optimization direction of the model parameters. The loss function includes a reconstruction loss term and a geometric loss term. The reconstruction loss term is used to make the learned sample distribution continuously approximate the original sample distribution. The geometric loss term is used to ensure the inter-frame smoothness of the generated target motion sequence.

[0068] In humanoid robot motion simulation data generation methods, training is necessary to ensure that the temporal diffusion generation model can accurately generate human motion sequences that meet the requirements. The core of training lies in using an appropriate loss function to constrain the optimization direction of the model parameters, enabling the model to learn good feature representation and generation capabilities.

[0069] In this embodiment, the loss function used when training the initial temporal diffusion generation model includes a reconstruction loss term and a geometric loss term. These two terms constrain the model from different perspectives: The main function of the reconstruction loss term is to ensure that the sample distribution learned by the model continuously approximates the original sample distribution. In humanoid robot motion simulation scenarios, the original sample distribution is the distribution presented by real human motion data. During the learning process, the model attempts to generate output based on the input, and the reconstruction loss term measures the difference between the generated samples and the original samples. Through training with the reconstruction loss term, the model can better capture the features and patterns of the original human motion data, making the generated motion sequence more similar to real human motion in overall characteristics.

[0070] The geometric loss term ensures the smoothness of the generated target motion sequence between frames. In human motion, the changes between adjacent frames are usually continuous and smooth, without sudden jumps or abrupt changes. The geometric loss term constrains the geometric relationship between adjacent frames in the generated motion sequence, measuring the degree of difference between adjacent frames. Through training with the geometric loss term, the motion sequence generated by the model transitions naturally between adjacent frames, avoiding inconsistencies.

[0071] It should be noted that the initial temporal diffusion generation model is trained using a loss function that includes reconstruction loss and geometric loss terms. During training, the model continuously adjusts its parameters. As training progresses, the model's parameters optimize in a direction that satisfies the requirements of the reconstruction loss and geometric loss terms, ultimately resulting in a temporal diffusion generation model capable of generating high-quality human motion sequences.

[0072] Through the above embodiments, the method provided in this application has efficient data generation capabilities. After the temporal diffusion generation model is trained, the required human motion joint data can be directly generated without repeatedly acquiring real human motion data. In addition, this application embodiment can generate motion data of different motion styles through sampling, which is simple to operate and does not require building a complex motion capture environment. It only requires specifying the target motion type to automatically generate the corresponding motion sequence.

[0073] Through the above embodiments, the method provided by the embodiments of this application has the following effects: 1. Reducing the technical barriers and costs of acquiring human motion data: This application's embodiments utilize a temporal diffusion generation model to successfully establish a mapping relationship between high-dimensional motion semantic parameters and human motion data. By simply specifying the high-dimensional parameters, realistic and detailed human motion simulation data can be quickly generated. This process eliminates the need for expensive motion capture equipment, greatly simplifying the human motion data acquisition process and reducing the technical barriers and costs associated with generating human motion data.

[0074] 2. Provides multi-scenario motion simulation data for humanoid robots: It can quickly generate a large amount of motion simulation data of different motion types. This motion simulation data can be used for training, simulating, and testing the motion capabilities of humanoid robots in different scenarios, helping to improve their motion performance and adaptability in various real-world situations.

[0075] 3. Can be integrated into different simulation software: The algorithm of this application embodiment can be set as a plug-in, which can be quickly integrated into different simulation software.

[0076] To further illustrate the technical solution of the proposed method for generating motion simulation data for humanoid robots, the following detailed explanation is provided in conjunction with specific embodiments.

[0077] Step 1: Prepare human motion data.

[0078] Step 1-1: Parametric representation of human motion data.

[0079] To facilitate the extraction of semantic features of human motion from the network model, while ensuring the generated human motion sequences are realistic and detailed, this embodiment employs a parametric human model to represent each frame of human motion data as follows: ,in These are the body shape parameters for the human model. These are the pose parameters of the human body model, which are parameterized representations of joint angles defined on the human motion joint tree.

[0080] based on Complete 3D human body model data can be obtained using skinning functions. Since the semantic features of human motion are usually directly characterized by motion posture parameters, this embodiment can fix the human body shape parameters and only consider changes in posture parameters. In summary, a continuous segment of human motion data can be represented as: .

[0081] in, For human movement sequences, For fixed body shape parameters, For the first Human motion posture parameters in a frame. L The sequence length is given.

[0082] Step 1-2: Preprocess the human motion data.

[0083] In this embodiment, a publicly available real motion scan dataset is selected as the candidate dataset. By using the motion semantic features contained in the human motion data, motion semantic labels are assigned to different types of human motion sequences, while invalid human motion data is removed.

[0084] For motion semantic labels, this embodiment uses one-hot encoding. Combining steps 1-1, the dataset used in this embodiment can be represented as: .

[0085] in, It can be further simplified to: .

[0086] Step 2: Build and train the temporal diffusion generation model.

[0087] The temporal diffusion generation model consists of a human motion encoding subnetwork and a human motion decoding subnetwork. (See also...) Figure 3 , Figure 3 This is a schematic diagram of the network architecture of the temporal diffusion generation model provided in the embodiments of this application.

[0088] Step 2-1: The temporal diffusion generation model models the process of human motion generation as a time step with a time step of... The Markov noise addition / denoising process, where the sequence of human motion at adjacent time steps can be represented as: .

[0089] in, for Human motion sequences with noise added continuously This is the sequence of human motion corresponding to the previous moment. Parameters for controlling the noise level, Gaussian noise was added. This allows for the further derivation of the original motion sequence. and The correspondence between them is as follows: .

[0090] in, , To and Similar Gaussian noise.

[0091] Step 2-2: Construction of the human motion coding sub-network. The function of the human motion coding sub-network is to extract the motion semantic features corresponding to the human motion sequence with added noise.

[0092] To characterize the semantic features of human motion extracted by the human motion coding subnetwork The characteristics of human movement at specific times and corresponding movement types are encoded by a subnetwork using motion semantic labels. Time step as well as Human motion sequence corresponding to each moment For input, where and Constraints generated for human motion sequences, such as Figure 3 As shown in the left figure, for example, the Transformer Encoder module.

[0093] because The data dimensions differ significantly. The human motion encoding subnetwork first transforms the data into specific dimensions using linear mapping (e.g., linear transformation) and a multi-layer perceptron (MLP), and then adds the data according to the dimensions to obtain the constraint vector. .

[0094] Similarly, by using a linear mapping method to... and The data dimensions are unified and denoted as: Then, spliced ​​together First received ,in This will serve as the starting point of the sequence, i.e., the constraint condition when generating a human motion sequence.

[0095] Furthermore, to maintain the temporal relationship between adjacent sequence frames, this embodiment employs a relative position encoding method. Adding sequence position information yields... .

[0096] Finally, the embodiments of this application employ temporal network models that are not limited to Transformer for extraction. The corresponding feature sequence .

[0097] Optionally, because It is used only to identify the beginning of a sequence, therefore, embodiments of this application can... Discard, utilize Perform sequence generation.

[0098] Steps 2-3: Construction of the human motion decoding subnetwork.

[0099] The function of the human motion decoding subnetwork is to reconstruct the original human motion sequence using the motion semantic features extracted by the encoding subnetwork. Its inputs are the motion semantic feature sequences. and motion timing information ,like Figure 3 As shown in the diagram on the right, for example, is the Transformer Decoder module.

[0100] Similar to the human motion coding subnetwork, this embodiment first utilizes linear mapping (e.g., linear transformation) to... Mapping to Unify the data dimensions and add location information to them; then, and The components are added together dimensionally, and the fusion result is used as the query vector for the multi-head attention module in the Transformer network. Then, as key vectors and value vectors, through correlation calculation, a human motion sequence including posture parameters can be output frame by frame, that is, human motion simulation data.

[0101] In addition, such as Figure 3 As shown, this embodiment can also utilize a parametric human body model to convert the posture parameter sequence output by the network into a more intuitive three-dimensional human body model sequence.

[0102] Steps 2-4: Training the Temporal Diffusion Generation Model. To ensure the temporal diffusion generation model effectively removes noise introduced during the sampling and generation of human motion sequences, while maintaining sufficiently realistic and detailed visual effects, this embodiment introduces a series of loss functions to constrain the optimization direction of the model parameters. The loss function in this embodiment consists of two parts: a reconstruction loss term... and geometric loss term .

[0103] It can be represented as: .

[0104] in, The sample distribution of the original human motion sequence. This is the original human movement sequence. This represents the human motion decoding subnetwork. This represents the human motion coding subnetwork.

[0105] Optionally, the reconstruction loss term can use L2 loss, so that the learned sample distribution continuously approximates the original sample distribution.

[0106] It consists of three sub-items, including but not limited to the joint position loss item. Foot contact loss item and joint velocity loss term .

[0107] 1) Joint position loss item : Represents the deviation between the spatial locations of joints in the motion sequence generated by the model and the original samples, defined as follows: .

[0108] in, The joint transformation function is defined on the human motion joint tree, and the three-dimensional spatial coordinates of the joint can be calculated based on the rotational representation of the human joint. This represents the model output result, i.e. .

[0109] 2) Foot contact loss item This represents the positional deviation of the foot between adjacent motion frames when the foot contacts the ground, preventing slippage in the generated human motion sequence. It is defined as follows: .

[0110] in, This is a mask indicating whether the human foot is in contact with the ground; its value depends only on the position of the foot's joints. This refers to the number of joints in the human body.

[0111] 3) Joint velocity loss term This represents the deviation between the joint velocities in the human motion data generated by the model and the original samples. This loss term ensures the smoothness between frames of the generated motion sequence and is defined as follows: .

[0112] In summary, the loss function in this embodiment can be defined as follows: .

[0113] in, , , They are , , The weight parameters.

[0114] Step 3: Generating human motion sequences.

[0115] After the temporal diffusion generation model in step 2 has been trained to convergence, the weight parameters are loaded into the model to put it into inference mode. When it is necessary to generate a specific type of human motion sequence (i.e., target motion sequence), only the corresponding motion semantic label needs to be specified. The temporal diffusion generation model will then use this label as a constraint, and through random sampling in the learned sample space and inverse denoising process, it can quickly generate human motion sequences with different visual effects, providing materials for the design of humanoid robot motion actions.

[0116] The present application proposes a method for generating human motion simulation data directly driven by high-dimensional parameters. Only motion semantic labels need to be specified; the method can generate a sequence of human motion simulation data with motion semantic information and visual continuity. The key points are as follows: 1. The generation process of human motion sequences is modeled as a Markov noise addition / denoising process. Based on this, a diffusion generation model is combined with a temporal network model. The diffusion generation model is used to add noise to generate motion data at different time steps. Then, the temporal network model is used to extract the motion semantic features contained in the motion data, while ensuring the temporal continuity between adjacent motion frames. In addition, the embodiments of this application abandon the traditional approach of predicting noise using diffusion models and directly predict and generate the original motion data. This allows the temporal diffusion generation model to directly perform inverse denoising on the sampling results to generate human motion simulation data.

[0117] 2. To ensure that the generated motion sequence conforms to the semantic information of motion and effectively removes noise introduced by sampling, the human motion coding sub-network in this application uses motion semantic labels and a forward-noiseed time step t as constraints. It aligns these constraints with the dimensions of the motion data using methods such as multilayer perceptrons and linear mapping, and uses the aligned result as the start marker of the motion sequence. This constrains the motion semantic features extracted by the coding sub-network branches to be the corresponding category of motion type. Furthermore, since the constraints only serve as the start marker of the sequence, the sub-sequence after the start marker in the human motion coding sub-network branches is the final motion semantic feature sequence.

[0118] 3. In order to ensure that the human motion sequence generated by the decoding sub-network has both motion semantic information and visual continuity, the human motion decoding sub-network branch of this application uses the inter-frame order of the human motion sequence as temporal information, and fuses the human motion semantic feature sequence extracted by the human motion coding sub-network branch with the temporal information sequence. This processing method enables the decoding sub-network to retain both semantic features and temporal features when generating motion sequences, thereby improving the visual effect of the generated motion sequences.

[0119] This application also provides a humanoid robot motion simulation data generation device. The humanoid robot motion simulation data generation device provided in this application is described below. The humanoid robot motion simulation data generation device described below and the humanoid robot motion simulation data generation method described above can be referred to in correspondence.

[0120] Please see Figure 4 , Figure 4 This is a schematic diagram of the structure of the humanoid robot motion simulation data generation device provided in this embodiment. In this embodiment, the humanoid robot motion simulation data generation device may include a target motion semantic label module 410 and a target motion sequence module 420.

[0121] The target motion semantic label module 410 is used to determine the target motion semantic label from the motion semantic label set; the motion semantic label set is configured with multiple motion semantic labels; The target motion sequence module 420 is used to input the target motion semantic label into the temporal diffusion generation model to obtain the target motion sequence output by the temporal diffusion generation model corresponding to the target motion semantic label; the temporal diffusion generation model is used to model the process of human motion generation as a Markov noise addition and denoising process; the temporal diffusion generation model is constructed and trained by the diffusion generation model and the temporal network model.

[0122] Among them, the diffusion generation model is used to learn the distribution characteristics of human motion data to generate motion data at different time steps; the temporal network model is used to extract motion semantic features and temporal features between adjacent motion frames from the motion data.

[0123] Based on any of the above embodiments, the humanoid robot motion simulation data generation device further includes a motion semantic label set determination module, which is specifically used to: acquire a human motion dataset, and set different motion semantic labels for different types of human motion sequences according to the motion semantic features of the human motion sequences in the human motion dataset, so as to obtain multiple motion semantic labels; and determine a motion semantic label set based on the multiple motion semantic labels.

[0124] Based on any of the above embodiments, the temporal diffusion generation model includes a human motion coding subnetwork and a human motion decoding subnetwork; the human motion coding subnetwork is used to extract the motion semantic features corresponding to the forward-noise-added human motion sequence; the human motion decoding subnetwork is used to receive the motion semantic features extracted by the human motion coding subnetwork, and combine them with the inter-frame temporal features of the human motion sequence to determine and output the human motion sequence.

[0125] Based on any of the above embodiments, the target motion sequence module 420 is further configured to: align the motion semantic labels and the forward-noiseed time steps in dimensions, and use the aligned result as the start marker of the motion sequence, so as to constrain the human motion coding sub-network to extract motion semantic features corresponding to the motion semantic labels.

[0126] Based on any of the above embodiments, the human motion decoding subnetwork is used to fuse the motion semantic features extracted by the human motion coding subnetwork with the inter-frame temporal features, so that the generated human motion sequence has both motion semantic information and visual continuity.

[0127] Based on any of the above embodiments, the humanoid robot motion simulation data generation device further includes a model training module, which is specifically used to: train the initial temporal diffusion generation model through a loss function to obtain a temporal diffusion generation model; the loss function is used to constrain the optimization direction of the model parameters; wherein, the loss function includes a reconstruction loss term and a geometric loss term, the reconstruction loss term is used to make the learned sample distribution continuously approximate the original sample distribution; the geometric loss term is used to ensure the inter-frame smoothness of the generated target motion sequence.

[0128] On the other hand, this application also provides an electronic device, please refer to... Figure 5 , Figure 5 This is a schematic diagram of the physical structure of the electronic device provided in the embodiments of this application, such as... Figure 5 As shown, the electronic device may include a memory 520, a processor 510, and a computer program stored in the memory 520 and executable on the processor 510. When the processor 510 executes the program, it can implement a method for generating motion simulation data for a humanoid robot. This method may include: The process involves determining target motion semantic labels from a set of motion semantic labels; the set of motion semantic labels contains multiple motion semantic labels; the target motion semantic labels are input into a temporal diffusion generation model to obtain the target motion sequence output by the temporal diffusion generation model corresponding to the target motion semantic labels; the temporal diffusion generation model is used to model the process of human motion generation as a Markov noise addition and denoising process; the temporal diffusion generation model is constructed and trained from a diffusion generation model and a temporal network model; the diffusion generation model is used to learn the distribution characteristics of human motion data to generate motion data at different time steps; the temporal network model is used to extract motion semantic features from the motion data and temporal features between adjacent motion frames.

[0129] Optionally, the electronic device may further include a communication bus 530 and a communication interface 540, wherein the processor 510, the communication interface 540, and the memory 520 communicate with each other through the communication bus 530. The processor 510 can call the computer program in the memory 520 to execute the humanoid robot motion simulation data generation method provided by the above methods.

[0130] Furthermore, the logical instructions in the aforementioned memory 520 can be implemented as software functional units and, when sold or used as independent products, can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this application, in essence, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this application. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0131] On the other hand, this application also provides a computer program product, which includes a computer program that can be stored on a non-transitory computer-readable storage medium. When the computer program is executed by a processor, the computer can execute the humanoid robot motion simulation data generation method provided by the above methods. The steps and principles of the method have been described in detail in the above methods and will not be repeated here.

[0132] In another aspect, this application also provides a non-transitory computer-readable storage medium storing a computer program thereon. When the computer program is executed by a processor, it implements the humanoid robot motion simulation data generation method provided by the above methods. The steps and principles of the method have been described in detail in the above methods and will not be repeated here.

[0133] Non-transitory computer-readable storage media can be any available medium or data storage device that can be accessed by a processor, including but not limited to magnetic storage (e.g., floppy disks, hard disks, magnetic tapes, magneto-optical disks (MOs), etc.), optical storage (e.g., CDs, DVDs, BDs, HVDs, etc.), and semiconductor storage (e.g., ROMs, EPROMs, EEPROMs, non-volatile memory (NAND flash), solid-state drives (SSDs)).

[0134] The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate. The components shown as units may or may not be physical units; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. Those skilled in the art can understand and implement this without any creative effort.

[0135] Through the above description of the embodiments, those skilled in the art can clearly understand that each embodiment can be implemented by means of software plus necessary general-purpose hardware platforms, and of course, it can also be implemented by hardware. Based on this understanding, the above technical solutions, in essence or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as ROM / RAM, magnetic disk, optical disk, etc., and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments or some parts of the embodiments.

[0136] Finally, it should be noted that the above embodiments are only used to illustrate the technical solutions of this application, and are not intended to limit them. Although this application has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that modifications can still be made to the technical solutions described in the foregoing embodiments, or equivalent substitutions can be made to some of the technical features. Such modifications or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this application.

Claims

1. A method for generating motion simulation data for a humanoid robot, characterized in that, include: The target motion semantic tag is determined from the motion semantic tag set; the motion semantic tag set contains multiple motion semantic tags. The target motion semantic label is input into the temporal diffusion generation model to obtain the target motion sequence output by the temporal diffusion generation model corresponding to the target motion semantic label; the temporal diffusion generation model is used to model the process of human motion generation as a Markov noise addition and denoising process; the temporal diffusion generation model is constructed and trained by the diffusion generation model and the temporal network model. The diffusion generation model is used to learn the distribution characteristics of human motion data to generate motion data at different time steps; the temporal network model is used to extract motion semantic features and temporal features between adjacent motion frames from the motion data.

2. The method for generating motion simulation data for a humanoid robot according to claim 1, characterized in that, Before determining the target motion semantic label from the set of motion semantic labels, the method further includes: Obtain a human motion dataset, and based on the motion semantic features of the human motion sequences in the human motion dataset, set different motion semantic labels for different types of human motion sequences to obtain multiple motion semantic labels; Based on the multiple motion semantic tags, the motion semantic tag set is determined.

3. The method for generating motion simulation data for a humanoid robot according to claim 1, characterized in that, The temporal diffusion generation model includes a human motion coding subnetwork and a human motion decoding subnetwork; The human motion coding subnetwork is used to extract motion semantic features corresponding to forward-noiseed human motion sequences; The human motion decoding subnetwork is used to receive the motion semantic features extracted by the human motion coding subnetwork, and combine them with the inter-frame temporal features of the human motion sequence to determine and output the human motion sequence.

4. The method for generating motion simulation data for a humanoid robot according to claim 3, characterized in that, Also includes: The motion semantic labels and forward-noiseed time steps are dimensionally aligned, and the aligned result is used as the start marker of the motion sequence to constrain the human motion coding sub-network to extract motion semantic features corresponding to the motion semantic labels.

5. The method for generating motion simulation data for a humanoid robot according to claim 3, characterized in that, The human motion decoding subnetwork is used to fuse the motion semantic features extracted by the human motion coding subnetwork with the inter-frame temporal features, so that the generated human motion sequence has both motion semantic information and visual continuity.

6. The method for generating motion simulation data for a humanoid robot according to any one of claims 1 to 5, characterized in that, Before inputting the target motion semantic label into the temporal diffusion generation model, the method further includes: The initial temporal diffusion generation model is trained using a loss function to obtain the temporal diffusion generation model; the loss function is used to constrain the optimization direction of the model parameters. The loss function includes a reconstruction loss term and a geometric loss term. The reconstruction loss term is used to make the learned sample distribution continuously approximate the original sample distribution. The geometric loss term is used to ensure the inter-frame smoothness of the generated target motion sequence.

7. A humanoid robot motion simulation data generation device, characterized in that, include: The target motion semantic tag module is used to determine the target motion semantic tag from the motion semantic tag set; the motion semantic tag set is configured with multiple motion semantic tags; The target motion sequence module is used to input the target motion semantic label into the temporal diffusion generation model to obtain the target motion sequence output by the temporal diffusion generation model corresponding to the target motion semantic label; the temporal diffusion generation model is used to model the process of human motion generation as a Markov noise addition and denoising process; the temporal diffusion generation model is constructed and trained by the diffusion generation model and the temporal network model. The diffusion generation model is used to learn the distribution characteristics of human motion data to generate motion data at different time steps; the temporal network model is used to extract motion semantic features and temporal features between adjacent motion frames from the motion data.

8. An electronic device comprising a memory, a processor, and a computer program stored in the memory and running on the processor, characterized in that, When the processor executes the computer program, it implements the humanoid robot motion simulation data generation method as described in any one of claims 1 to 6.

9. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by the processor, it implements the humanoid robot motion simulation data generation method as described in any one of claims 1 to 6.

10. A computer program product, comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the humanoid robot motion simulation data generation method as described in any one of claims 1 to 6.