A robot VLA model optimization method and storage medium based on Mean Flows

By constructing an average velocity field model and using single-step mapping technology, the inference latency and motion jitter problems of existing VLA models are solved, achieving efficient and smooth robot motion generation, which is suitable for resource-constrained edge devices.

CN121920252BActive Publication Date: 2026-06-19WUXI UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
WUXI UNIV
Filing Date
2026-03-27
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing VLA models suffer from high inference latency, high resource consumption, and motion jitter caused by accumulated errors in the motion generation process due to multi-step inference, making it difficult to meet the real-time control requirements of robots.

Method used

An optimization method based on Mean Flows is adopted, which replaces the instantaneous velocity field by constructing an average velocity field model, derives the single-step mapping using the core identity of Mean Flows, and combines a two-stage training strategy and edge optimization technology to achieve single-step generation from noise to action sequence.

🎯Benefits of technology

It significantly reduced edge-side inference latency, improved task success rate and motion smoothness, reduced inference latency to 28ms, achieved a task success rate of 95.8%, and reduced trajectory jitter index to 0.09, meeting real-time control requirements.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN121920252B_ABST
    Figure CN121920252B_ABST
Patent Text Reader

Abstract

This application discloses a robot VLA model optimization method and storage medium based on Mean Flows. Addressing the problems of high latency, high resource consumption, and motion jitter caused by multi-step inference in existing VLA models, the method acquires multimodal training data to construct an average velocity field model that satisfies the Mean Flows definition. Based on core identities, it derives single-step mappings and uses a Mean Flows Transformer diffusion model architecture to generate single-step actions. By optimizing the model through a two-stage training strategy and combining edge-side optimization techniques during the inference stage, it achieves a significant reduction in edge-side inference latency and trajectory jitter exponent. This invention achieves high-quality single-step action generation, significantly reduces latency and resource consumption, and provides a feasible solution for real-time deployment of VLA models on edge devices.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of robot learning and embodied intelligence technology, specifically to a robot VLA model optimization method and storage medium based on Mean Flows. Background Technology

[0002] With the rapid development of embodied intelligence technology, the Vision-Language-Action (VLA) model has become a core technology for realizing general-purpose robot intelligent control. A typical VLA model is shown below. (Pi0.5) and NVIDIA's Project GROOT enable robots to perform complex tasks based on multimodal instructions by integrating visual perception, natural language understanding and motion generation capabilities.

[0003] In the motion generation stage, existing VLA models generally employ diffusion-based strategies (DiffusionPolicy) or flow matching techniques to construct the Transformer Diffusion Model (DiT) architecture. DiffusionPolicy models robot motion generation as a conditional denoising diffusion process, gradually recovering a reasonable motion sequence from Gaussian noise through multiple iterations (typically 20-50 steps). The Flow Matching method, on the other hand, learns the continuous trajectory between the data distribution and the noise distribution, and still requires multi-step integration using an Ordinary Differential Equation (ODE) solver during inference to generate the final motion.

[0004] However, the aforementioned multi-step inference mechanism faces significant challenges when deployed on edge devices. First, multi-step iterations significantly increase computational latency; a typical 20-step inference process takes 200-500ms on an edge computing device, which is insufficient to meet the real-time control requirements of robots (typically <50ms). Second, the intermediate state caching during multi-step inference consumes substantial memory bandwidth, exacerbating the resource bottleneck of edge devices. Furthermore, the accumulated errors introduced by multi-step sampling can lead to jitter in motion trajectories, affecting the smoothness and safety of robot operations.

[0005] While some studies have attempted to compress multi-step diffusion models into single-step generators using distillation techniques, these methods lack rigorous theoretical guarantees, often leading to a significant decrease in generation quality, especially in complex multimodal action distribution scenarios. Although Flow Matching theoretically supports single-step generation, its modeling method based on instantaneous velocity fields struggles to accurately capture long-range dependencies during single-step sampling, making it difficult to balance accuracy and diversity in action generation.

[0006] Therefore, there is an urgent need for a single-step action generation technology that combines theoretical rigor with engineering practicality, which can significantly reduce the inference latency of VLA models on edge devices while ensuring generation quality. Summary of the Invention

[0007] To achieve the above objectives, this application provides a robot VLA model optimization method based on Mean Flows, the detailed technical solution of which is as follows:

[0008] A robot VLA model optimization method based on Mean Flows, characterized by the following steps:

[0009] Acquire multimodal training data for the robot, the training data including visual observation sequences, natural language commands, and corresponding motion trajectories;

[0010] Based on the training data, an average velocity field model is constructed. This model replaces the instantaneous velocity field in the Transformer diffusion model with an interval average velocity field and satisfies the Mean Flows definition.

[0011] ;

[0012] in, This is the initial state. and It is a time parameter and satisfies , Time interval The average velocity field on the surface, For the instantaneous velocity field function in flow matching, For integration variables, To start from the initial state Evolution of the flow field to time The trajectory function;

[0013] A single-step action generation model is constructed based on the average velocity field model, the visual observation sequence, the natural language commands, and the robot state observations. This single-step action generation model adopts the Mean Flows Transformer diffusion model architecture, and its inference process satisfies single-step mapping.

[0014] ;

[0015] in, Standard Gaussian noise, The visual features obtained by encoding the visual observation sequence The language features obtained after encoding the natural language instructions. These are robot state observations. For , , and The input is the conditional average velocity field. For the generated robot action sequence;

[0016] The single-step action generation model is optimized using a two-stage training strategy, which includes a pre-training stage based on a large-scale robot dataset and a fine-tuning stage for a specific task.

[0017] The optimized single-step action generation model is deployed on the edge device. During the inference phase, the robot action sequence is generated through single-step forward propagation based on real-time visual observations and language commands.

[0018] The generated action sequence is sent to the robot actuator to control the robot to complete the specified task.

[0019] In some embodiments, constructing the average velocity field model further includes:

[0020] Based on the definition of Mean Flows, derive the core identity of Mean Flows:

[0021] ;

[0022] in, To start from the initial state Evolution of the flow field to time The trajectory function, this identity shows that the displacement from the initial state to time t can be directly represented by the average velocity field;

[0023] When taking At that time, the core identity simplifies to a single-step mapping form:

[0024] ;

[0025] in, Standard Gaussian noise, This represents the true data distribution.

[0026] Based on the core identity, derive the explicit relationship between the average velocity field and the data distribution:

[0027] ;

[0028] in, For time The probability density function of the data distribution at that location. For time Data distribution at the location, For time State sampling at the location, This represents the expectation operation. It is a Dirac Delta function;

[0029] Differentiating both sides of the explicit relation with respect to t, and using the properties of the Dirac delta function, the general training objective of the average velocity field is derived:

[0030] ;

[0031] in, Let s and t be the general training loss function values, and s and t be the sampling time points, following a uniform distribution U[0,1] on the interval [0,1]. , For a given under conditions The expected value of the condition;

[0032] To simplify training and focus on end-to-end single-step generation, s=0 and t=1 are fixed, and conditional information is incorporated into the average velocity field, resulting in a simplified training objective for the conditional average velocity field:

[0033] ;

[0034] in, The loss function value is used for conditional training;

[0035] Using the Transformer architecture to analyze the conditional average velocity field Perform parametric modeling.

[0036] In some embodiments, the Mean Flows Transformer diffusion model architecture includes:

[0037] The visual encoder uses a Vision Transformer to process multi-view image inputs and generate visual features. ,in, For visual feature dimensions;

[0038] The language encoder uses a pre-trained language model to process natural language instructions and generate language features. ,in, For language feature dimensions;

[0039] The conditional fusion module fuses the visual features through a cross-attention mechanism. and the language features Generate multimodal conditional vectors ,in, The dimension of the conditional vector;

[0040] The Mean Flows diffusion model uses noise as a basis for action. The condition vector Robot state observation As input, the output range is processed through the Transformer backbone network. average speed ;

[0041] The action decoder adds the average velocity to the initial noise to generate the final action sequence. ,in The duration of the action sequence. This refers to the single-step action dimension.

[0042] In some embodiments, the pre-training phase employs a conditional flow matching loss function to optimize the average velocity field:

[0043] ;

[0044] in, The value of the pre-training loss function;

[0045] For The average velocity field implemented for network parameters .

[0046] In some embodiments, the fine-tuning phase introduces a motion quality perception loss:

[0047] ;

[0048] in, This represents the total loss function value during the fine-tuning phase. Loss due to mission success rate For loss of motion smoothness, Weighting coefficients to balance the contributions of each loss term.

[0049] In some embodiments, the task success rate loss is estimated using reinforcement learning:

[0050] ;

[0051] in, This is a metric function; it takes a value of 1 when the task is successful and 0 otherwise. For the model in parameters Generate actions under given conditions The probability of;

[0052] The smoothness loss of the action penalizes the second derivative of the action:

[0053] ;

[0054] in, Represents an action sequence The action vector at time step t is the action vector, and T is the total length of the action sequence.

[0055] In some embodiments, the inference phase further includes edge-side optimization of the Mean Flows Transformer diffusion model, wherein the edge-side optimization includes at least one of the following:

[0056] Operator fusion combines sequential operations into a single computational kernel to reduce kernel startup overhead and intermediate tensor memory allocation;

[0057] Quantization-aware training introduces pseudo-quantization operations during the fine-tuning stage to simulate numerical errors in integer inference. After training, the model weights and activation values ​​are converted to INT8 format.

[0058] Hardware-aware pruning involves structurally pruning the number of attention heads and the hidden layer dimension of the multilayer perceptron in the Transformer diffusion model based on the computational characteristics of the target hardware.

[0059] In some embodiments, after generating the action sequence, the method further includes a post-processing step of the action sequence: applying a Savitzky-Golay filter to smooth the action sequence in order to suppress high-frequency jitter.

[0060] In some embodiments, the acquisition of robot multimodal training data further includes a data augmentation step: randomly cropping and color perturbing the images in the visual observation sequence, performing synonym substitution and sentence transformation on the natural language instructions, and applying small-amplitude Gaussian noise to the motion trajectory while maintaining kinematic constraints.

[0061] This application also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of any of the methods described above.

[0062] This application addresses the problems of high inference latency (200-500ms), high resource consumption, and motion jitter caused by accumulated errors in existing VLA models that employ multi-step inference (such as diffusion strategies and flow matching) during the motion generation stage. It proposes an optimization method based on Mean Flows. By introducing an average velocity field model to replace the instantaneous velocity field and deriving a single-step mapping using the core identities of Mean Flows, single-step generation from noise to motion sequences is achieved.

[0063] This application theoretically guarantees the quality equivalence of single-step generation, avoiding the performance loss caused by distillation. By combining a two-stage training strategy and edge optimization (operator fusion, quantization, and pruning), the edge inference latency is reduced to 28ms (more than 10 times better than the 20-step diffusion strategy), while maintaining a task success rate of up to 95.8% and reducing the trajectory jitter index to 0.09. This significantly improves motion smoothness and resource efficiency, providing a practical technical solution for the real-time deployment of VLA models on resource-constrained edge devices. Attached Figure Description

[0064] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the accompanying drawings used in the embodiments will be briefly described below. The textual descriptions of these drawings are as follows, and actual drawing can be based on these descriptions:

[0065] Figure 1 A flowchart illustrating the robot VLA model optimization method based on Mean Flows provided by this invention;

[0066] Figure 2 A schematic diagram of the system composition of the Mean Flows DiT architecture provided by this invention;

[0067] Figure 3 This is a flowchart illustrating the two-stage training strategy provided by the present invention. Detailed Implementation

[0068] It should be noted that the following detailed descriptions are exemplary and intended to provide indicative explanations of the content of this application. It should be noted that all technical and scientific terms used in this application have the same meaning as commonly understood by a person skilled in the art to which this application pertains.

[0069] The system architecture and prior art solutions in the embodiments of this application will be clearly and completely described below with reference to the accompanying drawings. It should be noted that the described embodiments are only for explanation and illustration of this application, and not the entirety of the content. All other embodiments obtained by those skilled in the art based on the embodiments provided in this application without creative effort are within the protection scope of this application.

[0070] Example 1:

[0071] like Figure 1 As shown, this invention provides a robot VLA model optimization method based on Mean Flows, including the following steps:

[0072] S1. Obtain robot multimodal training data, the training data including visual observation sequences, natural language commands, robot state observations and corresponding motion trajectories.

[0073] The training data comes from various robot operation scenarios, including but not limited to: desktop operations (grasping, placing, and pushing objects), mobile operations (combined navigation and operation tasks), and dual-arm collaborative operations.

[0074] For example, each training sample contains triples ( I,L,A ),in The sequence consists of K image views, preferably K=4, including forward-looking, backward-looking, and bi-wrist views. L is a natural language instruction text, such as "pick up the red cube and put it in the blue container"; For the corresponding action trajectory, D represents the joint position or end effector pose at time step t, and D represents the motion dimension. For a 6-DOF robotic arm, D is usually 7, which includes 6-dimensional pose plus 1-dimensional gripper opening and closing.

[0075] To enhance data diversity, this embodiment employs data augmentation strategies: random cropping and color perturbation of images; synonym substitution and sentence transformation of language commands; and application of small-amplitude Gaussian noise to motion trajectories while maintaining kinematic constraints. The augmented dataset can be 3-5 times larger than the original data, effectively improving the model's generalization ability.

[0076] S2. Construct an average velocity field model based on the training data.

[0077] Traditional flow matching methods model instantaneous velocity fields Satisfying the ordinary differential equation ,in To start from the initial state The trajectory evolves to time t. Single-step sampling requires numerical integration of the ODE, which makes it difficult to guarantee accuracy.

[0078] This embodiment introduces an interval-averaged velocity field to replace the instantaneous velocity field in the Transformer diffusion model, and satisfies the Mean Flows definition:

[0079] ;

[0080] in, This is the initial state. and It is a time parameter and satisfies , Time interval The average velocity field on the surface, For the instantaneous velocity field function in flow matching, For integration variables, To start from the initial state Evolution of the flow field to time The trajectory function.

[0081] S3. Construct a single-step action generation model based on the average velocity field model, visual observation sequence, and natural language instructions.

[0082] like Figure 2 As shown, the single-step action generation model in this embodiment adopts the Mean Flows Transformer diffusion model architecture, and its inference process satisfies single-step mapping:

[0083] ;

[0084] in, For noise sampled from a standard Gaussian distribution, satisfying , This represents a Gaussian distribution with a mean of 0 and a covariance matrix of identity matrix I. The visual features obtained by encoding the visual observation sequence The language features obtained after encoding the natural language instructions. The robot's state observation value, For , , and The input is the conditional average velocity field. This is the generated sequence of robot actions.

[0085] S4. Optimize the single-step action generation model using a two-stage training strategy.

[0086] like Figure 3 As shown, the two-stage training strategy includes a pre-training stage based on a large-scale robot dataset and a fine-tuning stage for a specific task.

[0087] S5. Deploy the optimized single-step action generation model on the edge device. During the inference phase, based on real-time acquired visual observations and language commands, generate robot action sequences through single-step forward propagation.

[0088] S6. Send the generated action sequence to the robot actuator to control the robot to complete the specified task.

[0089] This embodiment constructs a complete end-to-end deployment system: the optimized Mean Flows Transformer diffusion model is deployed on the robot's edge computing unit; visual sensors (RGB-D cameras) and language input (speech recognition or text interface) provide multimodal observations in real time; after the model generates a sequence of actions step by step, it is converted into joint-level instructions by the motion planner to drive the actuator to complete the operation.

[0090] The system has a built-in quality monitoring module that records inference latency, action execution success rate, trajectory smoothness and other indicators in real time. When an abnormal performance is detected (such as three consecutive task failures), the data collection process is automatically triggered to store the observation-action pairs of the current scene into the playback buffer. The buffer data is periodically uploaded to the cloud for model iteration and optimization to form a continuous learning loop.

[0091] This application addresses the problems of high inference latency (200-500ms), high resource consumption, and motion jitter caused by accumulated errors in existing VLA models that employ multi-step inference (such as diffusion strategies and flow matching) during the action generation stage. It proposes an optimization method based on Mean Flows. By introducing an average velocity field model to replace the instantaneous velocity field and deriving a single-step mapping using the core identities of Mean Flows, single-step generation from noise to action sequences is achieved. This fundamentally solves the problem of high inference latency on the edge device.

[0092] Example 2:

[0093] Based on Example 1, this example further defines the process of acquiring robot multimodal training data in step S1.

[0094] Optionally, the acquisition of robot multimodal training data further includes a data augmentation step: randomly cropping and color perturbing the images in the visual observation sequence, performing synonym substitution and sentence transformation on the natural language instructions, and applying small-amplitude Gaussian noise to the motion trajectory while maintaining kinematic constraints.

[0095] Specifically, image enhancement employs random cropping, with the cropped area being 80%-100% of the original image while maintaining aspect ratio; color perturbation includes brightness adjustment (±20%), contrast adjustment (±20%), and saturation adjustment (±20%). Language enhancement uses synonym substitution, such as replacing "pick up" with "grab" or "pick up"; sentence transformation includes changing active sentences to passive sentences, such as changing "put the red square into the blue container" to "put the red square into the blue container." The standard deviation of the Gaussian noise applied for motion enhancement is set to 5% of the motion amplitude, while inverse kinematics checks ensure that the enhanced motion still meets joint constraint requirements.

[0096] By introducing diverse data augmentation strategies, the original dataset size was expanded by 3-5 times, effectively alleviating the problems of high data acquisition costs and annotation difficulties in the robotics field. Image augmentation improved the model's robustness to changes in lighting and viewpoint; language augmentation enabled the model to understand multiple expressions of the same instruction, enhancing the flexibility of human-computer interaction; and motion augmentation enriched the diversity of motion trajectories while maintaining physical constraints, allowing the model to better adapt to unseen scenes during the fine-tuning stage.

[0097] Example 3:

[0098] Based on Example 1, this example further refines the process of constructing the average velocity field model in step S2.

[0099] Optionally, the construction of the average velocity field model further includes:

[0100] Based on the definition of Mean Flows, derive the core identity of Mean Flows:

[0101] ;

[0102] in, To start from the initial state Evolution of the flow field to time The trajectory function, this identity shows that the displacement from the initial state to time t can be directly represented by the average velocity field.

[0103] When taking At that time, the core identity simplifies to a single-step mapping form:

[0104] ;

[0105] in, Standard Gaussian noise, This represents the actual data distribution.

[0106] Based on the core identity, derive the explicit relationship between the average velocity field and the data distribution:

[0107] ;

[0108] in, For time The probability density function of the data distribution at that location. For time Data distribution at the location, For time State sampling at the location, This represents the expectation operation. Let be the Dirac Delta function. This equation shows that the data distribution at any time t can be obtained from the distribution at the initial time s through a linear transformation of the average velocity field.

[0109] Differentiating both sides of the explicit relation with respect to t, and using the properties of the Dirac delta function, the general training objective of the average velocity field is derived:

[0110] ;

[0111] in, Let s and t be the general training loss function values, and s and t be the sampling time points, following a uniform distribution U[0,1] on the interval [0,1]. , For a given under conditions The expected condition.

[0112] To simplify training and focus on end-to-end single-step generation, s=0 and t=1 are fixed, and conditional information is incorporated into the average velocity field, resulting in a simplified training objective for the conditional average velocity field:

[0113] ;

[0114] in, The loss function value is used for conditional training.

[0115] Using the Transformer architecture to analyze the conditional average velocity field Perform parametric modeling. Specifically, model the noise action. Flattened into a token sequence, and conditional vector c (derived from visual features) and language features The combined inputs (obtained by fusion) are fed into the Transformer. The Transformer models the dependencies between elements of the action sequence through a self-attention mechanism, injects conditional information through a cross-attention mechanism, and finally outputs the average velocity field.

[0116] Through complete mathematical derivation, starting from the definition of Mean Flows, the core identities, single-step mapping forms, distribution relationships, general training objectives, and simplified training objectives are gradually derived, forming a complete logical chain from theory to practice. This derivation process provides a rigorous theoretical guarantee for single-step generation, avoiding the problem of traditional distillation methods lacking theoretical basis. The Transformer architecture is used to parameterize the conditional average velocity field, fully utilizing the advantages of Transformer in sequence modeling and long-range dependency capture, enabling the model to effectively model temporal correlations in action sequences.

[0117] Example 4:

[0118] Based on Example 1, this example provides a detailed description of the specific components of the Mean Flows Transformer diffusion model architecture in step S3.

[0119] Specifically, such as Figure 2 As shown, the Mean Flows diffusion architecture mainly includes:

[0120] The visual encoder uses a Vision Transformer to process multi-view image input. First, each image is divided into 16×16 pixel patches, and a token sequence is obtained through linear projection. Then, features are extracted using a 12-layer Transformer encoder. Finally, multi-view information is fused through cross-view attention to output visual features. .

[0121] The language encoder uses a frozen pre-trained BERT-based model to process natural language instructions. The input text is segmented using WordPiece and then fed into BERT; the output of the [CLS] token is used as the language features. To reduce computational overhead, the dimension is reduced to 256 using linear projection.

[0122] The conditional fusion module fuses visual and linguistic features through a cross-attention mechanism to generate multimodal conditional vectors. , as a conditional input for DiT.

[0123] The Mean Flows Transformer diffusion model is a modification of the standard DiT architecture. The input is noisy action. The preferred action sequence length is T=16, and the action dimension is D=7. This sequence is then converted into a token sequence via patch embedding. The core innovation lies in replacing the noise or velocity prediction output by DiT with an average velocity prediction. The DiT backbone network consists of 16 Transformer blocks, each integrating an AdaLN conditional normalization mechanism to inject the conditional vector c into the attention and MLP modules. The final output is an average velocity field with the same dimension as the input. .

[0124] Action decoder, performs single-step mapping Generate the final action sequence .

[0125] Through a meticulously designed Mean Flows Transformer diffusion model architecture, effective fusion of multimodal information and high-quality action generation are achieved. The visual encoder employs the ViT architecture, capable of extracting rich spatial features from multi-view images; the language encoder leverages the semantic understanding capabilities of pre-trained BERT to accurately parse natural language instructions; and the cross-attention fusion module enables visual features to guide language understanding, generating more task-specific multimodal conditional vectors. Replacing the DiT output with average velocity prediction allows the model to directly learn the single-step mapping from noise to action, laying the foundation for subsequent single-step inference.

[0126] Example 5:

[0127] The pre-training phase of this embodiment uses the simplified training objective of the conditional average velocity field from Embodiment 3. As a pre-training loss function.

[0128] Specifically, the pre-training phase uses a conditional flow matching loss function to optimize the average velocity field:

[0129] ;

[0130] in, The value of the pre-training loss function is t, where t is the sampling time point, and follows the order of... Uniform distribution over the interval To obtain from the actual action distribution Real action sequences sampled from the middle, Denotes the Euclidean norm. For The average velocity field implemented for network parameters .

[0131] The physical meaning of this loss function is: to reduce the average velocity predicted by the model. As close as possible to the real motion residual That is, the displacement from noise to actual action.

[0132] During pre-training, a course learning strategy was adopted: initially, a small action sequence length T=4 was used, and the model first learned the generation of short-term actions; as training progressed, T was gradually increased to 8, 12, and 16, allowing the model to gradually learn long-term dependencies. At the same time, progressive enhancements were applied to the visual input, gradually transitioning from using only the original image to using enhancement operations such as random cropping and color perturbation.

[0133] Example 6:

[0134] Based on Example 5, this example further refines the fine-tuning stage in step S4.

[0135] Optionally, the fine-tuning phase introduces a motion quality perception loss:

[0136] ;

[0137] in, This represents the total loss function value during the fine-tuning phase. Loss due to mission success rate For loss of motion smoothness, Weighting coefficients to balance the contributions of each loss term.

[0138] Specifically, the task success rate loss is estimated using reinforcement learning:

[0139] ;

[0140] in, This is a metric function; it takes a value of 1 when the task is successful and 0 otherwise. For the model in parameters Generate actions under given conditions The probability of.

[0141] The smoothness loss of the action penalizes the second derivative of the action:

[0142] ;

[0143] in, Represents an action sequence The action vector at time step t is the action vector, and T is the total length of the action sequence.

[0144] The default hyperparameters are set to: A small learning rate is used during fine-tuning. An early stopping strategy is used to prevent overfitting. At the same time, the low-level parameters of the visual encoder (such as the first 6 layers) are frozen, and only the high-level parameters (such as the last 6 layers) and the DiT part are fine-tuned to improve sample efficiency.

[0145] By introducing multi-task fine-tuning loss, the model can be finely optimized for the target task while maintaining pre-training knowledge. The task success rate loss introduces the feedback signal of reinforcement learning into the training process, enabling the model to improve itself based on the task execution results; the motion smoothness loss effectively suppresses high-frequency jitter in the motion trajectory, improving the safety and stability of robot motion.

[0146] Example 7:

[0147] Based on Example 1, this example further refines the end-side optimization in step S5.

[0148] Optionally, the edge-side inference further includes edge-side optimization of the Mean Flows Transformer diffusion model, wherein the edge-side optimization includes at least one of the following:

[0149] Operator fusion: This involves fusing sequential operations such as LayerNorm, Linear, and GELU in the Transformer diffusion model into a single CUDA kernel. For example, merging LayerNorm, Linear, and GELU into a single kernel reduces the overhead of multiple kernel launches and avoids memory allocation for intermediate tensors.

[0150] Quantization-aware training: Pseudo-quantization is introduced during the fine-tuning phase to simulate numerical errors during INT8 inference. Specifically, pseudo-quantization nodes are inserted in the forward propagation to quantize the weights and activation values ​​into INT8 and then dequantize them into FP32 for computation. The gradient is then updated using the FP32 weights. After training, the model weights are converted to INT8 format, and INT8 inference is used during deployment.

[0151] Hardware-aware pruning: Structured pruning based on the computational characteristics of the target hardware. For example, NVIDIA TensorCore is optimized for multiples of 4 channels, so the number of attention heads is pruned to multiples of 4, and the hidden dimensions of the MLP are pruned to multiples of 4 such as 256, 512, or 1024. The pruning ratio is controlled within 20%, and accuracy is restored through retraining.

[0152] Through multi-level edge optimization, the model can run efficiently on resource-constrained edge devices.

[0153] Example 8:

[0154] Based on Example 1, this example further refines the post-processing steps after generating the action sequence in step S3.

[0155] Optionally, after generating the action sequence, the method further includes a post-processing step: applying a Savitzky-Golay filter to smooth the action sequence to suppress high-frequency jitter.

[0156] The Savitzky-Golay filter is a smoothing filter based on local polynomial fitting, capable of filtering out high-frequency noise while preserving signal shape and characteristics. In this embodiment, the generated action sequence... Savitzky-Golay filtering is applied independently to each dimension, with a window size of 5 and a polynomial order of 2. The filtered motion sequence is smoother and suitable for direct transmission to the robot actuator.

[0157] Savitzky-Golay filtering post-processing effectively suppresses high-frequency jitter that may be caused by single-step generation while preserving the original features of the action sequence. Compared with unfiltered actions, the trajectory jitter index is significantly reduced after filtering, while the task success rate remains unchanged. This lightweight post-processing module has extremely low computational overhead (<1ms) and can be used as a standard component for model inference, significantly improving the smoothness and safety of robot motion.

[0158] Example 9:

[0159] To fully verify the technical effects of the present invention, a systematic comparative experiment was conducted on the Franka Emika Panda robot platform in this embodiment.

[0160] 1. Experimental setup:

[0161] Dataset: The experiment used a custom dataset containing five complex operation tasks: "grasping and placing," "opening a door to retrieve an object," "stacking blocks," "pouring water," and "tool use." Each task contained 5000 successful demonstration trajectories, divided into training and test sets in a 7:3 ratio. Each trajectory included four-view images (224×224), natural language instructions, and a 16-step 7D action sequence.

[0162] Comparison Models: The following representative models were selected for comparison: (1) Diffusion Policy (20-step reasoning), (2) Flow Matching (single step), and (3) the method of this invention (Mean Flows DiT).

[0163] Evaluation metrics: (1) End-side inference latency (ms), (2) Task success rate (%), (3) Trajectory jitter index (root mean square of the second derivative of the action).

[0164] The table below shows a performance comparison of each model on the Jetson Orin NX platform.

[0165] ;

[0166] Experimental results show that the present invention reduces the inference latency to 28ms while maintaining a task success rate comparable to that of multi-step Diffusion Policy (95.8% vs 96.2%), thus meeting the requirements of real-time control. The trajectory jitter index is significantly lower than that of the baseline method, proving that single-step generation effectively eliminates the cumulative error of multi-step sampling.

[0167] Example 10:

[0168] This embodiment also provides a computer-readable storage medium having a computer program stored thereon, which, when executed by a processor, implements the steps of the method described in any one of embodiments 1 to 9.

[0169] The method of this invention is solidified into a distributable and deployable software product in the form of a computer-readable storage medium. Those skilled in the art can implement the technical solution of this invention on any compatible computing device by loading the computer program on this storage medium, greatly facilitating the promotion and application of the technology.

[0170] The foregoing has provided a sufficiently detailed and specific description of this application. Those skilled in the art should understand that the descriptions in the embodiments are merely exemplary, and all changes made without departing from the true spirit and scope of this application should fall within the protection scope of this application. The scope of protection claimed in this application is defined by the claims, and not by the above descriptions in the embodiments.

Claims

1. A robot VLA model optimization method based on Mean Flows, characterized in that, Includes the following steps: Acquire multimodal training data for the robot, the training data including visual observation sequences, robot state observations, natural language commands, and corresponding motion trajectories; Based on the training data, an average velocity field model is constructed. This model replaces the instantaneous velocity field in the Transformer diffusion model with an interval average velocity field and satisfies the Mean Flows definition. ; in, This is the initial state. and It is a time parameter and satisfies , Time interval The average velocity field on the surface, For the instantaneous velocity field function in flow matching, For integration variables, To start from the initial state Evolution of the flow field to time The trajectory function; A single-step action generation model is constructed based on the average velocity field model, the visual observation sequence, and the natural language instructions. This single-step action generation model adopts the Mean Flows Transformer diffusion model architecture, and its inference process satisfies single-step mapping. ; in, Standard Gaussian noise, The visual features obtained by encoding the visual observation sequence The language features obtained after encoding the natural language instructions. The robot's state observation value, For , , and The input is the conditional average velocity field. For the generated robot action sequence; The single-step action generation model is optimized using a two-stage training strategy, which includes a pre-training stage based on a large-scale robot dataset and a fine-tuning stage for a specific task. The optimized single-step action generation model is deployed on the edge device. During the inference phase, the robot action sequence is generated through single-step forward propagation based on real-time acquired vision, robot state observations and language commands. The generated action sequence is sent to the robot actuator to control the robot to complete the specified task.

2. The robot VLA model optimization method based on Mean Flows as described in claim 1, characterized in that, The construction of the average velocity field model also includes: Based on the definition of Mean Flows, derive the core identity of Mean Flows: ; in, To start from the initial state Evolution of the flow field to time The trajectory function, this identity shows that the displacement from the initial state to time t can be directly represented by the average velocity field; When taking At that time, the core identity simplifies to a single-step mapping form: ; in, Standard Gaussian noise, This represents the true data distribution. Based on the core identity, derive the explicit relationship between the average velocity field and the data distribution: ; in, For time The probability density function of the data distribution at that location. For time Data distribution at the location, For time State sampling at the location, This represents the expectation operation. It is a Dirac Delta function; Differentiating both sides of the explicit relation with respect to t, and using the properties of the Dirac delta function, the general training objective of the average velocity field is derived: ; in, Let s and t be the general training loss function values, and s and t be the sampling time points, following a uniform distribution U[0,1] on the interval [0,1]. , For a given under conditions The expected value of the condition; To simplify training and focus on end-to-end single-step generation, s=0 and t=1 are fixed, and conditional information is incorporated into the average velocity field, resulting in a simplified training objective for the conditional average velocity field: ; in, The loss function value is used for conditional training; Using the Transformer architecture to analyze the conditional average velocity field Perform parametric modeling.

3. The robot VLA model optimization method based on Mean Flows as described in claim 1, characterized in that, The Mean Flows Transformer diffusion model architecture includes: The visual encoder uses a Vision Transformer to process multi-view image inputs and generate visual features. ,in, For visual feature dimensions; The language encoder uses a pre-trained language model to process natural language instructions and generate language features. ,in, For language feature dimensions; The conditional fusion module fuses the visual features through a cross-attention mechanism. With the language features Generate multimodal conditional vectors ,in, The dimension of the conditional vector; Mean Flows Transformer diffusion model, with noise action Robot state observations and the condition vector As input, the output range is processed through the Transformer backbone network. average speed ; The action decoder adds the average velocity to the initial noise to generate the final action sequence. ,in The duration of the action sequence. This refers to the single-step action dimension.

4. The robot VLA model optimization method based on Mean Flows as described in claim 2, characterized in that, The pre-training phase employs a conditional flow matching loss function to optimize the average velocity field. ; in, The value of the pre-training loss function; For The average velocity field implemented for network parameters .

5. The robot VLA model optimization method based on Mean Flows as described in claim 4, characterized in that, The fine-tuning phase introduces an F-action quality perception loss: ; in, This represents the total loss function value during the fine-tuning phase. Loss due to mission success rate For loss of motion smoothness, Weighting coefficients to balance the contributions of each loss term.

6. The robot VLA model optimization method based on Mean Flows as described in claim 5, characterized in that, The task success rate loss is estimated using reinforcement learning: ; in, This is a metric function; it takes a value of 1 when the task is successful and 0 otherwise. For the model in parameters Generate actions under given conditions The probability of; The smoothness loss of the action penalizes the second derivative of the action: ; in, Represents an action sequence The action vector at time step t is the action vector, and T is the total length of the action sequence.

7. The robot VLA model optimization method based on Mean Flows as described in claim 1, characterized in that, The inference phase also includes edge-side optimization of the Mean Flows Transformer diffusion model, wherein the edge-side optimization includes at least one of the following: Operator fusion combines sequential operations into a single computational kernel to reduce kernel startup overhead and intermediate tensor memory allocation; Quantization-aware training introduces pseudo-quantization operations during the fine-tuning stage to simulate numerical errors in integer inference. After training, the model weights and activation values ​​are converted to INT8 format. Hardware-aware pruning involves structurally pruning the number of attention heads and the hidden layer dimension of the multilayer perceptron in the Transformer diffusion model based on the computational characteristics of the target hardware.

8. The robot VLA model optimization method based on Mean Flows as described in claim 1, characterized in that, After generating the action sequence, the process also includes a post-processing step: applying a Savitzky-Golay filter to smooth the action sequence in order to suppress high-frequency jitter.

9. The robot VLA model optimization method based on Mean Flows as described in claim 1, characterized in that, The acquisition of robot multimodal training data also includes a data augmentation step: randomly cropping and color perturbing the images in the visual observation sequence, performing synonym substitution and sentence transformation on the natural language instructions, and applying small-amplitude Gaussian noise to the motion trajectory while maintaining kinematic constraints.

10. A computer-readable storage medium having a computer program stored thereon, characterized in that, When the computer program is executed by a processor, it implements the steps of the method according to any one of claims 1 to 9.