A physical denoising method and device for generative action
By constructing a physical control architecture of a frozen base and residual adapter and a multi-task adversarial differential discrimination system, physical noise in generative actions is identified and removed, solving the problem of insufficient dynamic constraints in existing models and achieving high-quality physical realism and visually realistic action generation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- 无锡先进内燃动力技术创新中心
- Filing Date
- 2026-03-20
- Publication Date
- 2026-06-19
AI Technical Summary
Existing generative motion models lack explicit dynamic constraints when modeling in kinematic space, resulting in serious physical noise in the generated motion, such as sliding, floating, and clipping, which affects visual realism and physical plausibility, making it difficult to widely apply in virtual reality, game animation, and embodied intelligence scenarios.
A physical control architecture based on residual strategy is constructed. By combining a frozen physical base model and a learnable residual adapter, a multi-task adversarial differential discrimination system and a Bayesian uncertainty weighting mechanism are used to identify and remove physical noise, preserve action intent, and achieve a smooth projection from kinematic space to dynamic space.
It enhances the physical realism and motion stability of generative actions, improves the training robustness of the model when dealing with non-uniform noise data and the ability to preserve high-dynamic action intent, resolves the contradiction between denoising and action intent preservation in existing methods, and achieves high-quality stylized action generation.
Smart Images

Figure CN122244246A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the fields of computer animation and embodied intelligence, and in particular to a physical denoising method and apparatus for generative motion. Background Technology
[0002] With the development of generative artificial intelligence, especially generative models represented by diffusion models, breakthroughs have been achieved in the field of text-driven 3D human motion generation. Existing motion generation models can generate semantically rich and diverse human motion sequences based on text prompts, demonstrating enormous application potential in fields such as game development, virtual reality, and embodied intelligence.
[0003] However, existing mainstream generative methods typically model and sample within kinematic space, lacking explicit dynamic constraints, resulting in severe "physical noise" in the generated motion. Specifically, the generated motion sequences often exhibit foot slippage, limb levitation, model clipping, and gravity trajectories that violate the laws of gravity. On one hand, because the human visual system is highly sensitive to the physical plausibility of motion details, these artifacts severely impair visual realism; on the other hand, this physical infeasibility also makes it difficult to use the generated data as effective expert demonstrations for strategy training in physical simulation environments. These shortcomings collectively restrict the widespread application of this technology in real-world scenarios such as virtual reality, game animation, and embodied intelligence.
[0004] To address the aforementioned issue of physical noise, existing correction methods mainly fall into two categories, but both have significant limitations. The first category is post-processing methods based on optimization or inverse dynamics. These methods have high computational costs and are prone to losing high-frequency details and style of motion due to over-smoothing. The second category is imitation learning methods based on physical simulation, which train the agent to imitate reference actions through reinforcement learning. However, this type of method often faces a contradiction between denoising and intent preservation when dealing with generated data containing physical noise. On the one hand, if the imitation of the reference action is too strong, the agent will forcibly fit noise such as sliding steps, leading to jittery motion or even simulation collapse. On the other hand, if the imitation constraints are relaxed, the agent may easily ignore the core intent of high-dynamic movements such as jumps and sharp turns, causing the action to degenerate into standing still or simple translation, failing to preserve the semantic essence of the original action while correcting physical errors.
[0005] These issues present a dual challenge to existing technologies in generating complex and dynamically changing human movements: ensuring both visual structural plausibility and physical dynamic consistency. From a visual application perspective, the geometric contact between the human body and its environment in space (such as foot-to-ground contact and hand-to-object interaction) must meet strict topological constraints. Existing clipping and levitation phenomena directly lead to visual distortion and unreliability. From a physical simulation perspective, time-series generation conforming to kinematic principles still needs to satisfy Newton's laws of motion. Existing models struggle to internalize physical laws such as friction and momentum conservation, resulting in inherent mechanical flaws in the generated movements. This dual deficiency in both spatiotemporal dimensions not only limits the expressiveness of generated content in visually intensive scenarios such as film special effects and virtual digital humans but also cuts off its migration path to physically intensive scenarios such as robot control. Therefore, a method for motion generation and denoising that can balance visual realism and physical compliance is urgently needed. Summary of the Invention
[0006] This invention provides a physical denoising method and apparatus for generative actions. This invention can identify and remove physical noise in generative actions while preserving the action intent, resulting in high-quality stylized actions. It solves the problems of existing generative actions lacking physical realism and existing physical methods being unable to handle noisy data. See the description below for details:
[0007] Firstly, a physical denoising method for generative actions, the method comprising:
[0008] A physical control architecture based on residual strategies is constructed, consisting of a frozen physical base model and a learnable residual adapter, which calculates the total control actions. and will control the overall action. Input the data into the physics simulation engine to execute the simulation and obtain a new physics simulation state. ;
[0009] Construct a multi-task adversarial differential discrimination system to calculate the feature difference between the physical simulation state and the reference trajectory;
[0010] Introducing an intent-aware physical enhancement mechanism to construct dynamic enhancement coefficients ; Using coefficients to analyze the root node feature flow Weighted amplification is performed to obtain the enhanced feature flow;
[0011] A Bayesian uncertainty weighting mechanism is introduced to construct an adaptive denoising function, and root node discriminators are built accordingly. and attitude discriminator The enhanced feature stream obtained is then input into the discriminator;
[0012] Based on the generated weighted adversarial reward signal, the residual adapter parameters are updated using the policy gradient algorithm until an action sequence is generated that removes physical artifacts and retains the original style.
[0013] The physical base model serves as a manifold constraint. It is pre-trained on a large-scale motion dataset containing physical noise and is used to output basic control signals that maintain the character's fundamental balance. The noisy input is initially projected onto the physical safety boundary;
[0014] The residual adapter, acting as a denoising operator, is a neural network with an initial output approaching zero, used to adjust the noise based on the current state. and reference trajectory data Learning small corrections in the action space ;
[0015] The final output action is formed by superimposing the base signal and the residual signal.
[0016] The feature difference between the computational physics simulation state and the reference trajectory is divided into:
[0017] Get the simulation state at the current moment With reference trajectory data Calculate the feature difference vector ;Will Decoupled into low-dimensional root node feature flow and high-dimensional pose feature flow .
[0018] Among these features, a learnable homoscedasticity uncertainty parameter is introduced into the discriminator. The inherent noise variances of the root node task and the pose task are respectively represented; a multi-task joint loss function based on maximum likelihood estimation is constructed. .
[0019] In the current denoising / adaptation phase, the network parameters of the physical base model are frozen, and the basic control signals are output. .
[0020] Among them, the overall control action The purpose is to fuse the basic signal and the residual signal to obtain the final control action received by the physics engine. :
[0021]
[0022] in The fusion coefficient is... This is a gradient truncation operator used to prevent gradients from being propagated back to the physical base model. Based on the control signal, This is the residual motion correction amount.
[0023] A second aspect is a physical denoising apparatus for generative actions, the apparatus comprising: a processor and a memory, the memory storing program instructions, the processor calling the program instructions stored in the memory to cause the apparatus to perform the method described in any of the first aspects.
[0024] Third aspect, a computer-readable storage medium storing a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the method described in any one of the first aspects.
[0025] The beneficial effects of the technical solution provided by this invention are:
[0026] 1. This invention enhances the physical realism of generative motions while balancing motion stability and noise reduction flexibility. It constructs an innovative "frozen base + residual adapter" physical control architecture. Unlike traditional methods that forcibly fit noisy data, this invention utilizes a frozen base model pre-trained on large-scale data to provide basic balance maintenance capabilities, using it as a physical manifold constraint. At the same time, it focuses on learning tiny corrections in the motion space through a zero-initialized residual adapter. This design not only effectively removes physical noises such as sliding, floating, and clipping common in generative motions, but also avoids motion stiffness or simulation collapse caused by over-optimization of physical constraints, achieving a smooth projection from kinematic space to dynamic space.
[0027] 2. Improved the training robustness of the model when dealing with non-uniform noise data. By decoupling action features through a multi-task adversarial differential discrimination system and introducing Bayesian homoscedastic uncertainty weighting, the model can dynamically adjust and optimize weights according to the physical credibility of the input data, thereby achieving soft thresholding of abnormal physical noise and improving the convergence stability of traditional adversarial imitation learning on low-quality data.
[0028] 3. It enhances the ability to retain the intention of high-dynamic actions. By combining the intention-aware physics enhancement mechanism, it weights sparse high-energy action signals such as jumping, which helps to suppress the "inertia" degradation tendency in physical simulation (such as degenerating into standing still). Thus, while correcting physical errors, it better takes into account the semantic consistency and visual style of the original action. Attached Figure Description
[0029] Figure 1 This is a flowchart illustrating the overall process architecture of the physical denoising method for generative actions proposed in this invention.
[0030] Figure 2 A comparison chart showing the generated jumping motion and the denoising results of different methods. Detailed Implementation
[0031] To make the objectives, technical solutions, and advantages of the present invention clearer, the embodiments of the present invention will be described in further detail below.
[0032] Example 1
[0033] To address the common physical noise issues in motion sequences generated by existing generative motion models, such as sliding, floating, and clipping, and the difficulty of existing physical correction methods in simultaneously denoising and preserving motion intent, this method is based on a Gen-to-Sim closed-loop logic of "generation-simulation-denoising." It constructs a residual control architecture of "base + adapter" and integrates a multi-task differential discrimination system (MTL-ADD) within the physical simulation loop. This system, as the key execution module of the method, receives the physical state and reference trajectory, and guides the residual adapter to remove physical noise through multi-task decoupling and uncertainty weighted calculation. This invention provides a physical denoising method for generative motion, see [link to relevant documentation]. Figure 1 The method includes the following steps:
[0034] Step 101: Obtain the original human motion sequence generated by the generative model, perform inverse kinematics on the original human motion sequence, map the skeletal joint node positions to the joint rotation angles and root node states required by the physical simulation model, and construct reference trajectory data. Reference trajectory data Input into the physical simulation environment;
[0035] Specifically, this step involves: obtaining the original human motion sequence generated by the generative model. The original human motion sequence contains the position and rotation information of three-dimensional skeletal joints, and due to the lack of dynamic constraints, it contains physical noise such as sliding, floating, or clipping. Furthermore, the original human motion sequence undergoes inverse kinematics solving to map the skeletal joint node positions to the joint rotation angles and root node states required by the physical simulation model, obtaining physical reference trajectory data. ; to physical reference trajectory data It is input into the physical simulation environment as the target tracking signal for the subsequent residual adapter.
[0036] Step 102: Construct a physical control architecture based on a residual strategy. The architecture consists of a frozen physical base model and a learnable residual adapter, and calculates the total control action. and will control the overall action. Input the data into the physics simulation engine to execute the simulation and obtain a new physics simulation state. ;
[0037] In this model, the physics base model acts as a manifold constraint, and its parameters are frozen during the current training phase. This physics base model is pre-trained on a large-scale motion dataset containing physical noise and is used to output basic control signals that maintain the character's fundamental balance. The noisy input is initially projected onto the physical safety boundary.
[0038] The residual adapter, acting as a denoising operator, is a neural network with an initial output approaching zero, used to denoise based on the current state. and reference trajectory data Learning small corrections in the action space ;
[0039] The final output action is formed by superimposing the base signal and the residual signal, and the calculation formula is as follows:
[0040]
[0041] Or write:
[0042]
[0043] in, For the target joint angle; This is a gradient truncation operator used to prevent gradients from being propagated back to the physical base model. The policy network is a physical base model, and its output is the basic control signal. ; This is a residual adapter network, and its output is the residual action correction amount. . Will Inputting data into the physics engine drives the virtual character to move one step, generating the physical simulation state for the next moment. This is used for the difference calculation in step 103 at the next time step.
[0044] Step 103: Construct a multi-task adversarial differential discrimination system (MTL-ADD) and calculate the feature difference between the physical simulation state and the reference trajectory;
[0045] This step includes: obtaining the simulation state at the current moment. Reference trajectory data from step 101 Calculate the feature difference vector ;
[0046] Will Decoupled into low-dimensional root node feature flow (Including root node position, linear velocity, and angular velocity) and high-dimensional attitude feature flow (Including joint local rotation and angular velocity).
[0047] Step 104: Introduce an intent-aware physical enhancement mechanism to protect sparse intent signals;
[0048] To address the issue that highly dynamic actions such as jumping are easily misjudged as noise by uncertainty mechanisms, the physical energy index of the reference trajectory is calculated in real time, and a dynamic enhancement coefficient is constructed. :
[0049] in, The vertical linear velocity of the root node of the reference trajectory, The angular velocity at the root node of the reference trajectory is used to characterize the physical energy intensity of the motion. The energy sensitivity coefficient is a constant greater than 0 (set to 10.0 in this embodiment of the invention), used to adjust the sensitivity of the physical energy index to the magnitude of error enhancement. ω is the angular velocity balance coefficient, which is a constant greater than 0 (set to 1.0 in this embodiment of the invention), used to balance the weight ratio of linear velocity and angular velocity in the energy index. The calculated dynamic enhancement coefficients are used to weight and amplify the root node differential signal of the input discriminator.
[0050] Utilization coefficient For the root node feature flow in step 103 Weighted amplification is performed to obtain the enhanced feature flow, which forces the residual adapter to retain the high-energy action intent.
[0051] Step 105: Introduce a Bayesian uncertainty weighting mechanism to construct an adaptive denoising function; construct root node discriminators respectively. and attitude discriminator Input the enhanced feature stream obtained in step 104 into the discriminator;
[0052] Introduce a learnable homoscedasticity uncertainty parameter into the discriminator. The inherent noise variances of the root node task and the pose task are represented respectively; a multi-task joint loss function based on maximum likelihood estimation (MLE) is constructed. :
[0053] in, and These are the adversarial losses for the root node discriminator and the attitude discriminator, respectively.
[0054] During training, the model is automatically updated based on the physical credibility of the input data. When the reference motion contains physically infeasible sliding noise, it leads to... When it cannot be reduced, the model automatically increases. This reduces the weight of the root node loss in the first term. This achieves soft threshold denoising.
[0055] Step 106: Joint training.
[0056] Based on the weighted adversarial reward signal generated in steps 103-105, the residual adapter parameters in step 102 are updated using a policy gradient algorithm (such as PPO) until an action sequence that removes physical artifacts and retains the original style is generated.
[0057] The operating environment for this embodiment of the invention is the NVIDIA Isaac Gym high-performance physics simulation platform. The simulation physics engine uses PhysX, with a simulation frequency set to 60Hz (i.e., simulating 60 physical state updates per second) and a control frequency set to 30Hz (i.e., performing 30 neural network inferences and action updates per second). The experimental hardware platform is equipped with an NVIDIA 4070Ti GPU.
[0058] In summary, the embodiments of the present invention can identify and remove physical noise in generative actions through the above steps 101-106, while retaining the action intent, to obtain high-quality stylized actions, thus solving the problems of existing generative actions lacking physical realism and existing physical methods being unable to handle noisy data.
[0059] Example 2
[0060] The acquisition of generative data and dynamic preprocessing in Example 1 will be further described below with reference to specific calculation formulas:
[0061] The embodiments of the present invention first solve the data mapping problem from "semantic space" to "kinematic space" and then to "dynamic control space".
[0062] 1.1 Obtaining Generative Action Sequences
[0063] The user inputs a text description. (For example, "a person continuously jumping in place" or "a person running fast and suddenly making a sharp turn"). This embodiment of the invention uses a pre-trained Motion Diffusion Model (MDM) as the generative model. Based on the Transformer architecture, MDM performs a denoising diffusion process in the latent space to output a segment of length... Original human movement sequence . Indicates the first Human body frame 3D spatial coordinates of key points (corresponding to skeletal nodes in the SMPL model) , Represents the real number field.
[0064] Because MDM lacks explicit physical constraints, It typically includes the following physical noise:
[0065] Gliding step: When the foot joints contact the ground, their horizontal velocity is not zero, resulting in the visual "skating" phenomenon.
[0066] Levitation: In non-jumping actions, both feet leave the ground simultaneously, or the root node height is abnormal, violating the laws of gravity.
[0067] Clipping: Body parts penetrating the ground or other parts of the body.
[0068] 1.2 Inverse Kinematics (IK) Solution and Redirection
[0069] Since agents in physics simulation engines are typically driven by joint motors, the input needs to be the target joint rotation angle rather than the joint position. Therefore, kinematic position data must be provided. This is mapped to joint angle data. This embodiment of the invention employs an optimization-based inverse kinematics (IK) method.
[0070] For each frame in the sequence Define the variable to be solved as ,in The rotation angles (expressed as axis angles) of the 23 joints of the SMPL model. The world coordinates of the root node (Pelvis) Let be the rotation quaternion of the root node.
[0071] Construct the following optimization objective function :
[0072]
[0073] Position error term : The Euclidean distance between the joint positions solved by IK and the positions generated by MDM:
[0074]
[0075] in, Let be the positive kinematic function, calculate the th The location of each joint; Set the joint weights (set the weights of end effectors such as hands and feet to 2.0, and the weights of others to 1.0).
[0076] Regularization term To constrain the degree to which joint angles deviate from the natural posture of the human body, preventing unnatural twisting such as reverse jointing.
[0077]
[0078] in, This refers to the joint rotation angles corresponding to the "rest pose" of the human model. Representing the The rotation angles of all skeletal joints of the human body model in the frame.
[0079] Time series smoothing term Constraining the amount of change between adjacent frames to prevent motion jitter (only when...) (Effective at time)
[0080]
[0081] In this embodiment of the invention, weights are set. The L-BFGS algorithm is used to iteratively solve the above objective function until convergence or the maximum number of iterations (e.g., 100) is reached. Finally, the optimized sequence is obtained. Defined as physical reference trajectory data .Should This will be used as input to step 102 in Example 1 to guide the training of the residual adapter.
[0082] Example 3
[0083] The following describes step 102 (constructing a residual physical control architecture and executing actions) in Example 1 in further detail with specific calculation formulas. This embodiment of the invention addresses how to generate physically compliant actions using a base and adapter, and drive the simulation environment to update its state.
[0084] To address the contradiction in traditional physical imitation learning when faced with noisy data—either overfitting to noise leading to jitter or underfitting leading to action loss—this invention proposes a residual strategy network architecture consisting of a "Base + Adapter".
[0085] 3.1 Observation Space Design
[0086] In each physical simulation step The input state of the intelligent agent It consists of two parts: ontological perception information and reference target information.
[0087] proprioception Includes: the root node's height, rotation matrix, linear velocity, angular velocity, and the local rotation angles and angular velocities of all joints. The dimension is approximately... Reference Target Reference motion features (from the current frame and several future frames, such as 0.1s, 0.2s, 0.5s) are included. This gives the agent predictive capabilities. Total observation vector: The total dimension is approximately .
[0088] 3.2 Physical Base Model (Base Policy) )
[0089] The base model acts as a "physical manifold constraint." Its core task is to ensure that the agent possesses basic movement capabilities (such as standing, walking, and running without falling) in the simulation environment and is robust to non-physical noise. The network structure adopts a multilayer perceptron (MLP), with the following structure:
[0090] Input(180) -> FC(1024) -> ReLU -> FC(1024) -> ReLU -> FC(ActionDim). The output layer uses the Tanh activation function to restrict the action to... Within the range.
[0091] The pedestal model is pre-trained. During pre-training, MDM-generated data containing slippage and noise is intentionally used as the training set, combined with the MTL-ADD (Multi-Task Adversarial Learning) uncertainty mechanism described later. This forces the pedestal model to learn to "ignore slippage"—because slippage is physically unreproducible, the pedestal model automatically converges to a "stable gait without slippage" strategy to maximize survival rewards. State: In the current denoising / adaptation phase, the network parameters of the pedestal model are frozen (Frozen, requires_grad=False). It outputs basic control signals. .
[0092] 3.3 Residual Adapter )
[0093] The adapter acts as a "denoising projection operator." Its task is to learn minute corrections based on the stable gait provided by the pedestal to recover stylistic details of the movement (such as specific arm swing amplitude and torso tilt angle) and respond to highly dynamic intention signals (such as the push-off force during a jump). The network structure is an MLP structure identical to the pedestal model (1024). 1024).
[0094] Zero initialization is crucial for successful residual learning. The weights and biases of the last output layer of the adapter are initialized to 0. This is done at the initial training time (Step 0). This means that the agent has the stability of a base from the beginning, avoiding the "falling over at the start" problem caused by random initialization, and greatly accelerating convergence.
[0095] Based on the current state and reference trajectory Output residual motion correction amount
[0096] 3.4 Motion Fusion and Execution
[0097] The basic signal and the residual signal are fused to obtain the final control action received by the physics engine. for:
[0098]
[0099] in This is the fusion coefficient (set to 1.0 in this embodiment of the invention). This action... The target position of the joint is interpreted and input to the underlying PD controller to calculate the joint torque. :
[0100]
[0101] in These are the stiffness coefficient and damping coefficient, respectively. For the current joint angle and angular velocity
[0102] Example 4
[0103] The following detailed description of step 103 in Example 1, using specific calculation formulas, focuses on how to prepare the feature data stream for discrimination.
[0104] In this embodiment of the invention, the input to the discriminator is not the original state of the agent, but rather the difference vector between the agent's generated action and the reference action, denoted as... .
[0105] First, obtain the current physical simulation state output by Example 3 at the previous moment. Reference trajectory data output in Example 2 Calculate the difference between the two at the level of physical characteristics. To address the problem of high-dimensional attitude noise obscuring key low-dimensional physical features, the system will... The data was split into two independent streams:
[0106] The first data stream is the root node feature stream (Root Stream). The feature flow includes root node height (1D), root node rotation (6D representation), root node linear velocity (3D), and root node angular velocity (3D), for a total dimension of [missing value]. It is approximately 13 to 15. This feature flow is mainly used to characterize the overall movement trend, balance state, and whether slippage occurs in the agent.
[0107] The second data stream is the pose feature stream (Pose Stream). The feature flow includes the local rotations (6D representation) of all 23 joints except the root node, joint angular velocities (3D), and the relative positions of key end effectors (such as hands and feet), with a total dimension of [missing information]. Greater than 200. This feature flow is mainly used to characterize the stylistic details, posture, and coordination of movements.
[0108] The characteristic flow obtained from the above calculation and The data will be transmitted to Example 5 as input data for the discriminator.
[0109] Example 5
[0110] The following describes steps 104 and 105 in Example 1 in further detail with specific calculation formulas. This example illustrates how to process input data to preserve sparsity and construct adaptive loss and reward functions.
[0111] 5.1 Intention-Aware Physical Enhancement
[0112] Simply relying on uncertainty mechanisms may cause the model to incorrectly treat "high-difficulty jumps" as noise and ignore them. Therefore, this embodiment introduces a physics-based enhancement mechanism.
[0113] Specifically, in each simulation step, the vertical velocity of the root node of the reference trajectory is calculated in real time. and angular velocity Based on this, the dynamic enhancement coefficient is calculated. :
[0114]
[0115] In this embodiment of the invention, the parameters are set as follows: .
[0116] In the root node characteristic error Before inputting the discriminator, the system multiplies the height error term by a coefficient. Inject a signal. When the reference action is a jump (i.e., ... When it is relatively large, The value will spike to between 5.0 and 10.0, thus drastically amplifying the error signal of the input discriminator and resulting in the enhanced root node feature stream. ...
[0117] 5.2 Construction of Independent Dual-Stream Discriminator Architecture
[0118] Before calculating the loss, we first construct two independent neural networks with completely different parameters as discriminators:
[0119] Root node discriminator ( ), receive the enhanced root node feature stream As input, a small network configuration is used. The specific structure is a fully connected layer FC(256) connected to the ReLU activation function, then connected to FC(128) and ReLU, and finally outputting a 1-dimensional scalar.
[0120] Attitude discriminator ( ): Receive the raw pose feature stream As input, given the high dimensionality of the pose features and the complex nonlinear relationships they contain, a large network configuration is adopted. The specific structure is an FC(1024) connected to ReLU, then an FC(512) connected to ReLU, and finally outputting a 1-dimensional scalar.
[0121] 5.3 Bayesian Uncertainty Modeling and Loss and Reward Calculation
[0122] In the Multi-Task Adversarial Learning (MTL-ADD) model, two learnable parameters are introduced. and , respectively, the log-variance of the root node discrimination task and the pose discrimination task (i.e. ).
[0123] During the initialization phase, set (corresponding to smaller) This means that the initial weight is relatively high, and at the same time, set (Corresponding to standard weights). This initialization strategy aims to give the root node a higher priority in fitting tasks during the early stages of training.
[0124] Based on the principle of maximum likelihood estimation, the following loss function is constructed:
[0125]
[0126] in, and The loss is the standard least squares GAN loss.
[0127] The adaptive denoising principle of this mechanism is as follows: During training, the optimizer strives to minimize... When the reference action generated by MDM contains severe non-physical slips, the agent constrained by the physics engine cannot reproduce the slips, making it easy for the discriminator to distinguish between real and fake samples, thus causing... Significantly increased.
[0128] To reduce the total loss, the model will automatically tend to increase... (i.e., increasing uncertainty) This is mathematically equivalent to reducing Weighting coefficients Through this mechanism, the model can automatically learn to "ignore" sliding noise that is physically unreproducible, thus achieving soft thresholding denoising.
[0129] To drive the residual adapter to generate actions that conform to physical laws, the reward signal is calculated using the logit probability output by the discriminator. The calculation formula is as follows:
[0130]
[0131] Finally, a weighted adversarial reward signal is calculated based on the logarithmic probability output by the discriminator. This serves as the basis for strategy updates in subsequent embodiment 6.
[0132] Example 6
[0133] The following describes step 106 (joint training process) in Example 1 in further detail with specific training parameters. This example illustrates how to use the reward signal generated in the aforementioned steps to update the model parameters in a closed loop.
[0134] 6.1 Policy Network Update
[0135] Input Acquisition: Acquire the weighted adversarial reward signal output from Example 5.
[0136] Algorithm Application: The Proximity Policy Optimization (PPO) algorithm is used to... As the basis for estimating the advantage function, the residual adapter in Example 3... Make minor adjustments.
[0137] Update goal: Maximize cumulative rewards This corrects the residual action of the adapter output. It can both deceive the discriminator (satisfying physical realism) and retain the original action intent.
[0138] 6.2 Training parameter configuration:
[0139] Optimizer and Learning Rate: The Adam optimizer is used. The learning rate of the adapter policy network is set to... To maintain training stability; the discriminator's learning rate is set to... .
[0140] Gradient penalty: To prevent the discriminator from becoming overconfident in a two-stream independent architecture, a gradient penalty coefficient is set. This setting forces the discriminator to satisfy the 1-Lipschitz continuity constraint, ensuring that it can still transmit effective gradient signals while maintaining high accuracy.
[0141] Training scale and hardware: 4096 parallel environments were set up for sampling, with each batch containing 131,072 samples. Training was performed on an NVIDIA RTX 4070ti graphics card for approximately 1000 iterations, with a total training time of approximately 30 minutes.
[0142] Example 7
[0143] The technical effect of the present invention is verified below with specific examples. In the embodiments of the present invention, 50 action sequences (including walking, running, jumping, backflipping and other actions) generated by MDM and containing different levels of physical noise were selected for comparative testing.
[0144] 1. Noise reduction effect evaluation (sliding elimination) using
[0145] Skate metric (skate distance, unit: cm / frame) was used as the evaluation index. Test results showed that the Skate value of the original MDM output was 3.52 cm; the Skate value of the baseline method (original ADD) was 1.89 cm, and the agent produced unnatural jitters while fitting the skate step. In contrast, the Skate value of our method was reduced to 0.42 cm. The results indicate that the uncertainty mechanism successfully identified and filtered out the skate step noise, enabling the agent to generate solid foot contact movements.
[0146] 2. Intent Preservation Assessment (Jump Height): In the "continuous jumping in place" action test, the maximum height error of the root node is evaluated. Test results show that in the baseline method (without intent enhancement), the agent fails to leave the ground, with a height error greater than 30cm. In this method, the agent successfully jumps, with a height error of less than 5cm, and the trajectory conforms to the law of gravity parabola. The results indicate that the intent perception enhancement mechanism effectively preserves sparse high-energy action intent.
[0147] 3. Training Stability Recording: The discriminator accuracy and reward standard deviation were recorded during the training process. Experimental data showed that the Reward Std of this method remained in a relatively high range of 0.4 to 0.6. This indicates abundant gradient signals and that no gradient vanishing or mode collapse occurred during training.
[0148] Example 8
[0149] A physical denoising device for generative actions includes a processor and a memory, wherein the memory stores program instructions, and the processor invokes the program instructions stored in the memory to cause the device to perform the following method steps in Embodiment 1:
[0150] A physical control architecture based on residual strategies is constructed, consisting of a frozen physical base model and a learnable residual adapter, which calculates the total control actions. and will control the overall action. Input the data into the physics simulation engine to execute the simulation and obtain a new physics simulation state. ;
[0151] Construct a multi-task adversarial differential discrimination system to calculate the feature difference between the physical simulation state and the reference trajectory;
[0152] Introducing an intent-aware physical enhancement mechanism to construct dynamic enhancement coefficients ; Using coefficients to analyze the root node feature flow Weighted amplification is performed to obtain the enhanced feature flow;
[0153] A Bayesian uncertainty weighting mechanism is introduced to construct an adaptive denoising function, and root node discriminators are built accordingly. and attitude discriminator The enhanced feature stream obtained is then input into the discriminator;
[0154] Based on the generated weighted adversarial reward signal, the residual adapter parameters are updated using the policy gradient algorithm until an action sequence is generated that removes physical artifacts and retains the original style.
[0155] The physical base model serves as a manifold constraint. It is pre-trained on a large-scale motion dataset containing physical noise and is used to output basic control signals that maintain the character's fundamental balance. The noisy input is initially projected onto the physical safety boundary;
[0156] The residual adapter, acting as a denoising operator, is a neural network with an initial output approaching zero, used to adjust the noise based on the current state. and reference trajectory data Learning small corrections in the action space ;
[0157] The final output action is formed by superimposing the base signal and the residual signal.
[0158] Among them, the characteristic difference between the computational physical simulation state and the reference trajectory is divided into:
[0159] Get the simulation state at the current moment With reference trajectory data Calculate the feature difference vector ;Will Decoupled into low-dimensional root node feature flow and high-dimensional pose feature flow .
[0160] Among these features, a learnable homoscedasticity uncertainty parameter is introduced into the discriminator. The inherent noise variances of the root node task and the pose task are respectively represented; a multi-task joint loss function based on maximum likelihood estimation is constructed. .
[0161] In the current denoising / adaptation phase, the network parameters of the physical base model are frozen, and the basic control signals are output. .
[0162] Among them, the overall control action The purpose is to fuse the basic signal and the residual signal to obtain the final control action received by the physics engine. :
[0163]
[0164] in The fusion coefficient is... This is a gradient truncation operator used to prevent gradients from being propagated back to the physical base model. Basic control signal; This is the residual motion correction amount.
[0165] It should be noted that the device descriptions in the above embodiments correspond to the method descriptions in the embodiments, and the embodiments of the present invention will not be repeated here.
[0166] The execution entities of the aforementioned processor and memory can be devices with computing functions such as computers, microcontrollers, and single-chip microcomputers. In specific implementations, the embodiments of the present invention do not limit the execution entities and can select them according to the needs of actual applications.
[0167] Data signals are transmitted between the memory and the processor via a bus, which will not be elaborated upon in this embodiment of the invention.
[0168] Based on the same inventive concept, embodiments of the present invention also provide a computer-readable storage medium, the storage medium including a stored program, which, when the program is running, controls the device where the storage medium is located to execute the method steps in the above embodiments.
[0169] The computer-readable storage medium includes, but is not limited to, flash memory, hard disk, solid-state drive, etc.
[0170] It should be noted that the description of the readable storage medium in the above embodiments corresponds to the description of the method in the embodiments, and the embodiments of the present invention will not be repeated here.
[0171] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. A computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on a computer, all or part of the flow or function according to the embodiments of the present invention is generated.
[0172] The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. Computer instructions can be stored in or transmitted through a computer-readable storage medium. The computer-readable storage medium can be any usable medium accessible to a computer or a data storage device such as a server or data center that integrates one or more usable media. The usable medium can be a magnetic medium or a semiconductor medium, etc. Unless otherwise specified, the model numbers of the devices in this embodiment of the invention are not limited; any device capable of performing the above functions is acceptable.
[0173] Those skilled in the art will understand that the accompanying drawings are merely schematic diagrams of a preferred embodiment, and the sequence numbers of the above embodiments of the present invention are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.
[0174] The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.
Claims
1. A physical denoising method for generative actions, characterized in that, The method includes: A physical control architecture based on residual strategies is constructed, consisting of a frozen physical base model and a learnable residual adapter, which calculates the total control actions. and will control the overall action. Input the data into the physics simulation engine to execute the simulation and obtain a new physics simulation state. ; Construct a multi-task adversarial differential discrimination system to calculate the feature difference between the physical simulation state and the reference trajectory; Introducing an intent-aware physical enhancement mechanism to construct dynamic enhancement coefficients ; Using coefficients to analyze the root node feature flow Weighted amplification is performed to obtain the enhanced feature flow; A Bayesian uncertainty weighting mechanism is introduced to construct an adaptive denoising function, and root node discriminators are built accordingly. and attitude discriminator The resulting enhanced feature stream is then input into the discriminator; Based on the generated weighted adversarial reward signal, the residual adapter parameters are updated using the policy gradient algorithm until an action sequence is generated that removes physical artifacts and retains the original style.
2. The physical denoising method for generative actions according to claim 1, characterized in that, The physical base model, acting as a manifold constraint, is pre-trained on a large-scale motion dataset containing physical noise and is used to output basic control signals to maintain the character's fundamental balance. The noisy input is initially projected onto the physical safety boundary; The residual adapter, acting as a denoising operator, is a neural network with an initial output approaching zero, used to adjust the noise based on the current state. and reference trajectory data Learning small corrections in the action space ; The final output action is formed by superimposing the base signal and the residual signal.
3. The physical denoising method for generative actions according to claim 1, characterized in that, The feature difference between the computational physics simulation state and the reference trajectory is divided into: Get the simulation state at the current moment With reference trajectory data Calculate the feature difference vector ;Will Decoupled into low-dimensional root node feature flow and high-dimensional pose feature flow .
4. The physical denoising method for generative actions according to claim 1, characterized in that, Introduce a learnable homoscedasticity uncertainty parameter into the discriminator. The inherent noise variances of the root node task and the pose task are respectively represented; a multi-task joint loss function based on maximum likelihood estimation is constructed. .
5. The physical denoising method for generative actions according to claim 1, characterized in that, During the current denoising / adaptation phase, the network parameters of the physical base model are frozen, and the basic control signal is output. .
6. The physical denoising method for generative actions according to claim 1, characterized in that, The overall control action The purpose is to fuse the basic signal and the residual signal to obtain the final control action received by the physics engine. : ; in The fusion coefficient is... This is a gradient truncation operator used to prevent gradients from being propagated back to the physical base model. Based on the control signal, This is the residual motion correction amount.
7. A physical denoising method for generative actions, characterized in that, The device includes a processor and a memory, the memory storing program instructions, the processor invoking the program instructions stored in the memory to cause the device to perform the method according to any one of claims 1-6.
8. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, the computer program including program instructions that, when executed by a processor, cause the processor to perform the method described in any one of claims 1-6.