Robot action model training method, computer device and computer-readable storage medium

By constructing cross-ontology and cross-perspective models and combining them with end-to-end generative models, the problems of limited data scale and error accumulation in existing technologies are solved. This enables efficient mapping from third-person videos to robot first-person data, improving the training effect and accuracy of robot motion models.

CN122244595APending Publication Date: 2026-06-19ZHONGKE YUNGU TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
ZHONGKE YUNGU TECH
Filing Date
2026-04-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies rely on expensive real robot data when training robot motion models, which limits the data scale and lacks a system-cascaded cross-ontology transfer and cross-viewpoint conversion scheme. As a result, the converted data cannot simultaneously satisfy the requirements of motion executability and viewpoint availability, and there are problems of error accumulation and loss of semantic consistency.

Method used

We construct cross-ontology and cross-perspective models, and achieve end-to-end mapping from third-person human videos to robot first-person data through video-to-video diffusion model architecture and reconstruction generation model. We utilize a large amount of third-person operation data for efficient transfer, and train the generation model on this basis. We then fine-tune it by combining it with real robot data to form a complete technical framework.

Benefits of technology

It achieves a direct mapping from raw human videos to robot first-person data, breaking through the bottleneck of scarce real data, improving the training effect of robot motion models, ensuring a balance between data scale and quality, suppressing cascade errors, and improving the model's generalization ability and operational accuracy.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244595A_ABST
    Figure CN122244595A_ABST
Patent Text Reader

Abstract

This application discloses a robot motion model training method, a computer device, and a computer-readable storage medium. The robot motion model training method includes the following steps: constructing and training a cross-ontology model and a cross-viewpoint model; constructing and training a generative model based on the trained cross-ontology model and the trained cross-viewpoint model; generating a first video using the trained generative model; acquiring a second video and combining it with the first video to form a training set for training the motion model. Therefore, this application achieves the transformation from raw human video to robot first-person data by constructing and training a cross-ontology model and a cross-viewpoint model, and on this basis, training a third-person to first-person generative model; finally, the generated video data is mixed with real robot data to train a high-performance motion model, thereby utilizing the large amount of existing third-person operation data for efficient transfer of robot ontology, overcoming the bottleneck of scarce real data.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application belongs to the field of robot training technology, and in particular relates to a robot motion model training method, computer equipment, and computer-readable storage medium. Background Technology

[0002] Current mainstream technologies primarily rely on two types of data sources for training VLA (Vision Language Action Model): First, real robot first-person perspective data collected via teleoperation devices. This data boasts high precision and direct executable capabilities, but its acquisition cost is extremely high, making large-scale scalability difficult. Second, publicly available human operation video data from the internet. This data is vast in scale and rich in scenarios, but suffers from two core gaps—a ontology gap (differences in kinematic structure between humans and robots) and a perspective gap (differences between the third-person acquisition perspective and the robot's first-person execution perspective). Existing methods typically employ motion relocation techniques for ontology transfer or image generation techniques for perspective transformation, but lack a scheme to cascade and optimize the end-to-end mapping of these two systems. This results in the transformed data failing to simultaneously satisfy both action executableness and perspective usability, and the cascading process suffers from error accumulation. Therefore, how to utilize more real-world human data for training is a technical problem urgently needing to be solved by those skilled in the art.

[0003] The preceding description is intended to provide general background information and does not necessarily constitute prior art. Summary of the Invention

[0004] The purpose of this application is to provide a robot motion model training method, computer device, and computer-readable storage medium that can effectively utilize existing real human operation videos to convert them into training videos for training motion models.

[0005] To achieve the above objectives: In a first aspect, embodiments of this application provide a robot motion model training method, comprising the following steps: constructing and training a cross-ontology model and a cross-view model; constructing and training a generative model based on the trained cross-ontology model and the trained cross-view model; generating a first video using the trained generative model; acquiring a second video and forming a training set with the first video to train the motion model.

[0006] In an optional embodiment of this application, constructing and training a cross-ontology model includes: acquiring first video data, including third-person human operation videos and robot operation videos; extracting and aligning skeletal key points from the third-person human operation videos and robot operation videos to establish a first training sample set; constructing an initial cross-ontology model, wherein the initial cross-ontology model adopts a video-to-video diffusion model architecture; inputting the first training sample set into the initial cross-ontology model for inference to obtain a predicted velocity field; obtaining the true velocity field of the first training sample set, and completing the initial cross-ontology model training by minimizing the loss between the predicted velocity field and the true velocity field.

[0007] In an optional embodiment of this application, constructing and training a cross-view model includes: acquiring second video data, including multi-view anchor point views and corresponding anchor point semantics; constructing an initial cross-view model, which includes a reconstruction sub-model and a generation sub-model; inputting the multi-view anchor point views as training samples into the reconstruction sub-model for processing, and outputting wrist pose and condition map; inputting the condition map and anchor point semantics as joint input into the generation sub-model for processing, and outputting wrist view video; constructing a loss function based on the difference between the wrist view video and the real wrist view video in the second video data, and iteratively optimizing the parameters of the reconstruction sub-model and the generation sub-model until the training conditions are met.

[0008] In an optional embodiment of this application, the multi-view anchor point view is used as a training sample input into the reconstruction sub-model for processing, and the wrist pose and condition map are output. This includes: encoding the multi-view anchor point view into aggregated visual features through a visual encoder in the reconstruction sub-model; performing cross-attention interaction on the aggregated visual features through the wrist-head module to extract wrist camera pose parameters; recovering the wrist view projection from the multi-view anchor point view through a preset multi-view geometric reconstruction method; and reconstructing the condition map based on the wrist view projection and the wrist view projection.

[0009] In an optional embodiment of this application, the conditional graph and anchor semantics are processed as joint inputs to generate a sub-model to output a wrist-view video. This includes: encoding the conditional graph into latent variables using a variational autoencoder; performing spatiotemporal alignment and fusion of the latent variables and anchor semantic embeddings, and concatenating them with preset global semantic features and text encodings to obtain semantic conditional embeddings; starting with preset noise, using the semantic conditional embeddings and latent variables as guiding conditions, and gradually reconstructing them through a denoising and diffusion process to obtain reconstructed latent variables; and sending the reconstructed latent variables into a variational autodecoder to decode them into a wrist-view video.

[0010] In an optional embodiment of this application, a generative model is constructed and trained based on a trained cross-ontology model and a trained cross-viewpoint model, including: acquiring third video data, which is a third-person human operation video; jointly constructing an initial generative model with the cross-ontology model and the cross-viewpoint model; inputting the third video data as training samples into the initial generative model and outputting a wrist-view video with pseudo-labels; calculating reconstruction loss and perceptual loss based on the pseudo-labels; calculating the intermediate layer feature alignment loss of the cross-ontology and cross-viewpoint models; calculating the semantic consistency loss based on the wrist-view video and the third video data; weighted summing the reconstruction loss, perceptual loss, intermediate layer feature alignment loss, and semantic consistency loss to obtain a total loss value; updating the parameters of the initial generative model through backpropagation with the total loss value as the optimization objective until the training conditions are met; and determining the initial generative model that meets the training conditions as the generative model and outputting it.

[0011] In an optional embodiment of this application, obtaining a second video and forming a training set with the first video to train an action model includes: labeling samples corresponding to the first video in the training set as first samples and samples corresponding to the second video as second samples; obtaining an initial action model and designing an action prediction loss function; pre-training the initial action model based on the action prediction loss function and the first samples; fine-tuning the pre-trained action model based on the action prediction loss function and the second samples after the pre-training converges until the initial action model meets the training conditions; and labeling the trained initial action model as the target action model and outputting it.

[0012] In an optional embodiment of this application, obtaining a second video and forming a training set with the first video to train the action model further includes: labeling the samples corresponding to the first video in the training set as first samples and the samples corresponding to the second video as second samples; obtaining an initial action model and designing an action prediction loss function; labeling the loss value calculated by the action prediction loss function and the first sample as a first loss value; labeling the loss value calculated by the action prediction loss function and the second sample as a second loss value; weighting and summing the first loss value and the second loss value as a total loss value, optimizing and training the initial action model based on the total loss value; and labeling the trained initial action model as a target action model and outputting it.

[0013] Secondly, embodiments of this application provide a computer device, including: a processor and a memory storing a computer program, wherein when the processor runs the computer program, the steps of the above-described method are implemented.

[0014] Thirdly, embodiments of this application provide a computer-readable storage medium storing a computer program that, when executed by a processor, implements the steps of the above-described method.

[0015] The embodiments of this application have the following beneficial effects: The robot motion model training method provided in this application includes the following steps: constructing and training a cross-ontology model and a cross-viewpoint model; constructing and training a generative model based on the trained cross-ontology model and the trained cross-viewpoint model; generating a first video using the trained generative model; acquiring a second video and combining it with the first video to form a training set for training the motion model. Therefore, this application can efficiently transfer the robot ontology using a large amount of existing third-person action data, overcoming the bottleneck of scarce real data. It constructs a complete technical framework of "two-stage transformation + end-to-end generation + hybrid training," which uses massive amounts of third-person human video data to construct and train cross-ontology and cross-viewpoint models, and then trains an end-to-end third-person to first-person generative model on this basis, achieving a direct mapping from the original human video to the robot's first-person data; finally, the generated large-scale data is mixed with a small amount of real robot data to train a high-performance motion model. Specifically, firstly, a cross-ontology model and a cross-viewpoint model are constructed and cascaded, and then an end-to-end third-person to first-person generative model is trained based on the two models, enabling it to learn direct mapping and forming a complete transformation link from the original data to the target data. The output of the two-stage pipeline is used as a pseudo-label, and a consistency constraint is introduced in the feature space to train the generative model to achieve direct prediction from the original input to the target output, effectively suppressing cascade errors. The massive amount of first-person data generated by the generative model is used for VLA model pre-training, and then high-precision data from real robots is used for fine-tuning, achieving a balance between data scale and data quality.

[0016] The above description is merely an overview of the technical solution of this application. In order to better understand the technical means of this application and to implement it according to the contents of the specification, and to make the above and other objects, features and advantages of this application more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. It should be understood that the above general description and the following detailed description are merely exemplary and explanatory, and do not limit this application. Attached Figure Description

[0017] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0018] Figure 1 This is a flowchart illustrating a robot motion model training method provided in one embodiment.

[0019] Figure 2 This is a schematic block diagram of the structure of a computer device provided in one embodiment. Detailed Implementation

[0020] Exemplary embodiments will now be described in detail, examples of which are illustrated in the accompanying drawings. In the following description relating to the drawings, unless otherwise indicated, the same numerals in different drawings denote the same or similar elements.

[0021] It should be noted that, in this document, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes that element. Furthermore, components, features, and elements with the same names in different embodiments of this application may have the same meaning or different meanings, the specific meaning of which must be determined by its interpretation in that specific embodiment or further in conjunction with the context of that specific embodiment.

[0022] It should be understood that although the terms first, second, third, etc., may be used herein to describe various information, such information should not be limited to these terms. These terms are used only to distinguish information of the same type from one another. For example, without departing from the scope of this document, first information may also be referred to as second information, and similarly, second information may also be referred to as first information. Depending on the context, the word "if," as used herein, can be interpreted as "when," "when," or "in response to determination." Furthermore, as used herein, the singular forms "a," "an," and "the" are intended to also include the plural forms unless the context indicates otherwise. It should be further understood that the terms "comprising," "including," indicate the presence of the stated feature, step, operation, element, component, item, kind, and / or group, but do not exclude the presence, occurrence, or addition of one or more other features, steps, operations, elements, components, items, kinds, and / or groups. The terms "or" and "and / or" as used herein are to be interpreted as inclusive, or mean any one or any combination thereof. Therefore, "A, B, or C" or "A, B, and / or C" means "any one of the following: A; B; C; A and B; A and C; B and C; A, B, and C". Exceptions to this definition will only occur if the combination of elements, functions, steps, or operations is inherently mutually exclusive in some way.

[0023] It should be understood that although the steps in the flowcharts of this application's embodiments are shown sequentially according to the arrows, these steps are not necessarily executed in the order indicated by the arrows. Unless explicitly stated herein, there is no strict order restriction on the execution of these steps, and they can be executed in other orders. Moreover, at least some of the steps in the figures may include multiple sub-steps or multiple stages. These sub-steps or stages are not necessarily completed at the same time, but can be executed at different times, and their execution order is not necessarily sequential, but can be performed alternately or in turn with other steps or at least a portion of the sub-steps or stages of other steps.

[0024] It should be noted that step designations such as S110 and S120 are used in this document for the purpose of more clearly and concisely describing the corresponding content, and do not constitute a substantial limitation on the order. In specific implementation, those skilled in the art may execute S120 first and then S110, etc., but these should all be within the protection scope of this application.

[0025] It should be understood that the specific embodiments described herein are merely illustrative of this application and are not intended to limit this application.

[0026] In the following description, the use of suffixes such as "module," "part," or "unit" to denote elements is solely for the purpose of illustrative purposes and has no specific meaning in itself. Therefore, "module," "part," or "unit" may be used interchangeably.

[0027] To facilitate understanding of this application, the following explanations are provided for any terms and technical terms that may be used in this application: VLA model: Vision-Language-Motion model, which can simultaneously process visual observations, language commands, and generate a base model for robot joint space motion.

[0028] Cross-entity transfer: a technique that maps the motion patterns of a source entity (human) to a target entity (robot), solving the problem of motion inoperability caused by differences in kinematic structure.

[0029] Cross-perspective conversion: a technique that converts an image from an observational perspective (third person) to an execution perspective (first person) to bridge the perspective gap.

[0030] Third-person to first-person generation model: The core model proposed in this invention can convert human third-person videos into robot first-person data end-to-end.

[0031] Flow Matching: A generative modeling framework that learns a velocity field to describe the continuous transformation path from a noise distribution to a target data distribution.

[0032] Velocity Field: In the flow matching framework, a vector field function that describes the instantaneous direction and magnitude of intermediate states as they change over time.

[0033] There are two types of data sources for training embodied intelligent robots: one is directly collected real robot execution data, and the other is human demonstration videos. The former is costly and has poor generalization ability; the latter, although easily obtained, is difficult to use directly for robot training due to differences in ontological structure and perspective. Therefore, existing technologies have the following shortcomings: 1. Single data source: Existing methods rely excessively on expensive real robot data and fail to fully explore the value of massive and readily available human third-person video data, resulting in limited training data scale and insufficient model generalization ability. 2. Broken transformation chain and error accumulation: Existing research lacks a complete technical solution for systematically cascading and end-to-end optimizing cross-ontology transfer and cross-perspective transformation. Errors are amplified step by step during cascading processing, and the quality of the transformed data deteriorates. 3. Loss of semantic consistency: In the process of cross-perspective transformation, existing methods are prone to losing object-level semantic consistency, causing the robot to be unable to accurately understand the spatial position and interaction relationship of the target object, affecting operational accuracy. 4. Low training efficiency: Directly using raw human videos for VLA pre-training introduces a large amount of task-irrelevant dynamic noise, reducing the model's convergence speed and decision quality, and lacks training strategies that effectively integrate with real machine data.

[0034] How to utilize the abundant resource of third-person human video for robot model training and improve the training effect of robot motion models has become a problem that needs to be solved. This application proposes a robot training method to overcome the shortcomings of the prior art. For a clear description of the method provided in this embodiment, please refer to... Figure 1 This includes steps S110 to S130.

[0035] The method provided in this application can be divided into three stages: Stage 1: Two-stage conversion; Stage 2: End-to-end generation; Stage 3: Hybrid training. These correspond to steps S110 to S130, respectively. In Stage S110, joint pre-training of cross-ontology transfer and cross-viewpoint conversion is completed, achieving data preparation and pre-trained model construction, and solidifying the parameters of the two models. In Step S120, based on the parameters solidified in S110, an end-to-end third-person to first-person generation model is constructed. Finally, in Step S130, the generated large-scale first-person video data and a small amount of real robot video data are used to train the motion model, significantly improving the motion model's ability to realistically perceive and generalize operations in the physical world. Each stage and step will be described in detail below.

[0036] Step S110: Construct and train cross-ontology and cross-perspective models.

[0037] In one implementation, a large amount of existing third-person videos cannot be directly used for robot training, requiring the resolution of two major issues: the ontology gap and the viewpoint gap. The ontology gap refers to the fundamental differences between humans and robots in terms of form, degrees of freedom, and kinematic constraints. The viewpoint gap refers to the essential inconsistency between the human third-person observation perspective and the robot's first-person execution perspective in terms of geometric configuration, spatial reference frame, and visual semantic representation. To achieve the conversion from human third-person videos to robot first-person data, two models can be initially built and trained to address the aforementioned issues, with their parameters fixed to provide training targets for subsequent end-to-end generative models. It is noteworthy that the construction and training of the two models are implemented independently and do not interfere with each other.

[0038] In one embodiment, constructing and training a cross-ontology model includes: acquiring first video data, including third-person human operation videos and robot operation videos; extracting and aligning skeletal keypoints in the third-person human operation videos and robot operation videos to establish a first training sample set; constructing an initial cross-ontology model, which adopts a video-to-video diffusion model architecture; inputting the first training sample set into the initial cross-ontology model for inference to obtain a predicted velocity field; obtaining the true velocity field of the first training sample set, and completing the initial cross-ontology model training by minimizing the loss between the predicted velocity field and the true velocity field.

[0039] In one embodiment, to address the transition from human to robot form, a cross-ontology model can be constructed and trained. This model can accurately map human action semantics to the robot joint space, thereby taking a third-person human operation video as input and outputting a joint trajectory sequence or robot operation video that conforms to robot kinematic constraints.

[0040] Training this model requires acquiring initial video data, including a large amount of third-person human operation videos and a small amount of robot operation videos for alignment and learning. The robot operation videos can be real-world robot execution videos, which may also directly include or extract robot joint trajectory data as geometric anchors for cross-body mapping.

[0041] First, data synthesis and alignment are performed on third-person human and robot operation videos. This process can leverage a game engine to build a scalable data generation pipeline. Inverse kinematics (IK) skeleton repositioning technology is used to address the skeletal incompatibility between humans and the target robot (such as Unitree G1), ensuring that the same actions can be reused across characters. A single motion animation is baked onto all human and robot characters, thereby extracting and aligning the skeletal keypoints of the third-person human and robot operation videos, generating "human-robot" paired videos that guarantee complete motion consistency. Videos are recorded in diverse virtual scenes using the same camera parameters and motion trajectories, intentionally incorporating challenging conditions such as occlusion. The resulting "human-robot" paired videos can be considered as samples, aggregated to obtain the first training sample set.

[0042] An initial cross-ontology model is constructed, which adopts a video-to-video diffusion model architecture. The core of the model is a video-to-video diffusion Transformer (DiT) model. Its working mechanism is based on the Flow Matching framework, which learns a velocity field to describe the continuous transformation path from random noise to the target video.

[0043] The first training sample set is input into the initial cross-ontology model for inference to obtain the predicted velocity field. Specifically, as mentioned above, each sample in the first training sample set is a "human-robot" paired video. This video is input into the human activity video encoder in the model for encoding, resulting in a human video latent vector representation, labeled as... Human video latent vector representation through stream matching. The processing is used to generate latent vector representations of the robot video, labeled as This process involves learning a velocity field. It is implemented. From a physical perspective, the velocity field can be viewed as a "motion guidance function"—it tells the system how to adjust its current intermediate state at each time step of the diffusion process. This process gradually brings the video closer to the target robot. It's similar to gradually "carving" clear video frames from a chaotic mass of noise. First, the velocity field needs to be defined. Define a latent vector representation from noise to the target robot video. The linear probability path is defined by the following mathematical expression.

[0044] (1) In the above formula, Indicates diffusion time step The intermediate state; For noise. The velocity field is defined as... It is trained to predict the instantaneous rate of change on the path, i.e., the target velocity, as expressed by the following formula.

[0045] (2) During inference, the latent vector representation of human videos is first performed. Encoded as a condition token , and the generated token sampled from Gaussian noise The data are then concatenated and input into the model for processing to obtain the predicted velocity field. The calculation method for the predicted velocity field is shown in the following formula.

[0046] (3) in For diffusion time step, For Transformer models, the output is used for updating. The speed. Based on the predicted velocity field, the latent vectors of human videos can be represented. Processed into latent vector representation of the target robot video The specific processing method involves processing pure noise. Initially, the intermediate states are iteratively updated by integrating backward along the predicted velocity field using the Euler method. The update process can be referenced in the following formula.

[0047] (4) go through Step iteration (for a preset integer, usually) (Up to 50 steps), finally obtained The decoder then reconstructs the robot video. This design ensures that every frame of the generated video is strictly aligned with the input video, while fully preserving the original background information—thanks to the condition token. Protected by a one-way mask, it is not affected by the denoising process.

[0048] It is evident that the reasoning process depends on the accuracy of the velocity field, and the velocity field itself carries the physical consistency constraint of cross-body motion mapping. (Velocity field) It serves as a bridge connecting "input human video" and "output robot video," encoding complete mapping rules from human actions to robot actions. This includes geometric transformations from human joint motion to robot joint motion, strict alignment constraints for motion timing, and complete preservation of the background scene. Through a stream matching mechanism, the model can learn this complex cross-ontology mapping, thereby achieving high-quality, high-fidelity video conversion. Therefore, the training of the pre-defined cross-ontology model primarily focuses on the velocity field. Specifically, this application designs a loss function for the velocity field, which can be found in the following formula.

[0049] (5) This loss function minimizes the mean squared error between the predicted velocity and the target velocity. The initial cross-ontology model training is completed by minimizing the loss between the predicted velocity field and the true velocity field. The parameters of the initial cross-ontology model are iteratively optimized until the training conditions are met. The initial cross-ontology model that has met the training conditions is marked as a cross-ontology model, and its parameters are fixed in preparation for subsequent processing.

[0050] In one embodiment, constructing and training a cross-view model includes: acquiring second video data, including multi-view anchor point views and corresponding anchor point semantics; constructing an initial cross-view model, which includes a reconstruction sub-model and a generation sub-model; processing the multi-view anchor point views as training samples in the reconstruction sub-model to output wrist pose and condition map; processing the condition map and anchor point semantics as joint inputs in the generation sub-model to output wrist view video; constructing a loss function based on the difference between the wrist view video and the real wrist view video in the second video data, and iteratively optimizing the parameters of the reconstruction sub-model and the generation sub-model until the training conditions are met.

[0051] In one implementation, in addition to training the cross-ontology model, a cross-viewpoint transformation model is also trained simultaneously. This cross-viewpoint transformation model transforms the original third-person video into a first-person video. First-person video, typically filmed from the wrist's perspective, directly reflects the robot's visual perception during task execution. Therefore, the perspective can be transformed into two core tasks: "geometric reconstruction" and "visual generation": the former addresses "where the space is," and the latter addresses "what it looks like." Thus, each stage includes two sub-models: the reconstruction sub-model focuses on geometric reconstruction, and the generation sub-model focuses on visual generation. The construction and training of these two sub-models in each stage will be explained later.

[0052] In this embodiment, the training data is the second video data, specifically including multi-view anchor point views and corresponding anchor point semantics. The multi-view anchor point views are captured by multiple fixed camera positions simultaneously, covering the omnidirectional geometric information of the operation area; the anchor point semantics are generated by manual annotation or weak supervision methods, accurately describing key wrist action nodes and interactive object attributes.

[0053] In one embodiment, the multi-view anchor point view is used as a training sample and input into the reconstruction sub-model for processing, outputting a wrist pose and condition map. This includes: encoding the multi-view anchor point view into aggregated visual features through a visual encoder in the reconstruction sub-model; performing cross-attention interaction on the aggregated visual features through a wrist-head module to extract wrist camera pose parameters; recovering the wrist view projection from the multi-view anchor point view using a preset multi-view geometric reconstruction method; and reconstructing the condition map based on the wrist view projection and the wrist view projection.

[0054] In one implementation, the goal of reconstructing the sub-model is to recover the 3D geometry of the scene from multiple anchor point (third-person) views and estimate the camera pose from the target's wrist perspective, providing precise spatial guidance for subsequent view generation. The multiple anchor point views are labeled as follows: , This represents the number of video frames.

[0055] For the inference process of the reconstruction sub-model, visual features of the multi-view anchor point views are first extracted through a shared-weight visual encoder. The mathematical expression for the extraction is shown in the following formula.

[0056] (6) In the above formula, The spatial dimensions of the feature map. For feature dimension. To obtain a global scene understanding, features extracted from multiple frames are aggregated along the time dimension to form a unified scene representation. The aggregation method is shown in the following formula.

[0057] (7) In the above formula, To aggregate visual features, aggregation methods can employ temporal attention or simple feature concatenation followed by projection, ensuring the model can acquire consistent scene information from consecutive frames.

[0058] Furthermore, based on the visual geometry model, a specially designed Wrist Head module is added to aggregate visual features. The wrist-head module regresses the wrist camera's pose to achieve wrist camera pose estimation. This is achieved through a set of learnable wrist queries (labeled as...). The system performs cross-attention interaction with features to extract geometric information related to the wrist viewpoint. The extraction method can be found in the following formula.

[0059] (8) The wrist camera pose parameters for each frame are then regressed using a multilayer perceptron (MLP). The processing method is described in the following formula.

[0060] (9) In the above formula, Let be a rotation matrix. It is a translation vector.

[0061] Since ground truth labels for wrist pose are difficult to obtain, this application innovatively proposes Spatial Projection Consistency (SPC) loss, which achieves self-supervised training through geometric constraints between multiple views without the need for labeled data.

[0062] Specifically, the 3D point cloud of the scene is first recovered from the anchor point view using multi-view geometric reconstruction methods (such as motion-based structure reconstruction or multi-view stereo vision). Simultaneously, a dense 2D-2D correspondence between the anchor point view and the wrist view is established using feature matching methods. .

[0063] Each anchor pixel Its corresponding 3D point Linking to form 3D-2D pairs For each pair Calculate its projection under the predicted wrist pose. The calculation method is as follows.

[0064] (10) In the above formula, Let K be the camera projection function, and K be the camera intrinsic parameter.

[0065] The SPC loss is defined as shown in the following equation.

[0066] (11) In the above formula, It is the set of 3D points that lie in the image plane after projection; This is the set of points located behind the camera after projection (with negative depth). To balance the hyperparameter contributions from foreground and background, this loss function forces wrist pose estimation to be consistent with the 3D scene geometry: the projected position of foreground points should be aligned with their ground truth counterparts, while background points should be projected behind the camera (with negative depth), thus avoiding geometrically unreasonable pose estimation. Based on this method, the wrist view projection is recovered from the multi-view anchor point view.

[0067] Using the estimated wrist pose of each frame The reconstructed 3D point cloud is projected onto the wrist-view image plane to form a time-aligned conditional map sequence. The projection method is shown in the following formula.

[0068] (12) In the above formula, This represents the resolution of the wrist-view image. Conditional plot. The spatial geometry of each frame is encoded, including the outline, depth, and relative positional relationships of objects, providing precise geometric guidance for subsequent generation stages.

[0069] The core value of the reconstructed sub-model lies in transforming abstract pixel-level information into explicit geometric representations, solving spatial positioning problems such as "where is the camera" and "where is the object," providing a physical basis for perspective switching, and ensuring the geometric consistency of the generated video.

[0070] In one embodiment, the conditional graph and anchor semantics are processed as joint inputs to generate a sub-model to output a wrist-view video. This includes: encoding the conditional graph into latent variables using a variational autoencoder; spatiotemporally aligning and fusing the latent variables with the anchor semantic embeddings, and concatenating them with preset global semantic features and text encodings to obtain semantic conditional embeddings; starting with preset noise, using the semantic conditional embeddings and latent variables as guiding conditions, and gradually reconstructing them through a denoising and diffusion process to obtain reconstructed latent variables; and sending the reconstructed latent variables into a variational autodecoder to decode them into a wrist-view video.

[0071] In one implementation, the goal of generating the sub-model is to synthesize a realistic, temporally coherent wrist-view video based on the sequence of geometric condition maps provided by the reconstructed sub-model. This stage is essentially a conditional video generation task, taking the sequence of condition maps and semantic features of anchor point views as input, and outputting a high-quality wrist-view video.

[0072] The generative sub-model can employ a Conditional Diffusion Transformer (DiT) as its core architecture. The diffusion model iteratively denoises from pure noise to gradually generate the target video. Specifically, the generative sub-model uses the conditional graph obtained from the reconstruction sub-model. Using the anchor semantics in the second video data as input, a first-person video that conforms to physical constraints is generated by guiding the generation of the video under preset conditions.

[0073] Specifically, firstly, the condition graph sequence Encoded into a latent representation sequence by a variational autoencoder (VAE). Secondly, the latent variables and anchor semantic embeddings are spatiotemporally aligned and fused, and then concatenated with preset global semantic features and text encoding to obtain semantic conditional embeddings. This alignment process can begin by merging the latent representation sequence... With the initial latent variable sampled from Gaussian noise Concatenate along the channel dimension, using the method described below. First, combine with the initial latent variables sampled from Gaussian noise. The splicing is done along the channel dimension, and the splicing method is as follows.

[0074] (13) Due to the condition diagram To address the potential for overlooking details of small or blurry objects, this application introduces an external semantic path for enhancement. Specifically, global semantic features are extracted from the anchor point view using a pre-trained CLIP image encoder. Text encoding of task instructions The elements are then merged to form a semantic conditional embedding.

[0075] (14) To enhance temporal consistency, time position encoding can also be added to the semantic embedding. and viewpoint identifier encoding , see the following embedded reference formula.

[0076] (15) As a guiding condition, this semantic condition, together with the geometric condition graph, guides the generation process, ensuring that the synthesized video is not only geometrically correct but also visually realistic.

[0077] During reasoning, the concatenated latent variables will be... with diffusion time step The encoding is used as a common input, and prediction is made based on the DiT model architecture that generates the sub-model to predict the noise component at the current time step. The mathematical expression for this inference process is shown below.

[0078] (16) During the inference phase, the generative sub-model starts with pure noise and, guided by the conditional graph sequence and semantic conditions, gradually denoises to generate a clear wrist-view video. Assume the initial latent variables are represented as follows: For diffusion time step The noise component is predicted using the above formula. and update the latent variables. Thus, the reconstructed latent variables are obtained. Reconstructing latent variables The video is fed into a variational autodecoder to decode and restore it to a wrist-view video. .

[0079] Accordingly, the training of the generated sub-model is performed by minimizing the mean square error between the predicted noise and the actual noise. The loss function used for training is shown in the following formula.

[0080] (17) The generative sub-model is used to "render" the geometric condition graph into visually realistic video frames, while ensuring temporal coherence through spatiotemporal modeling, thus solving the visual quality problem of "what it looks like." Furthermore, the spatiotemporal attention mechanism in the DiT model can learn the motion continuity between adjacent frames. Specifically, the model treats video as spatiotemporally unified 3D data, simultaneously modeling spatial structure and temporal evolution through 3D attention. During training, the model learns natural motion priors from a large amount of video data, enabling it to generate smooth and coherent action sequences.

[0081] In summary, the cross-view model forms a complete "geometry-vision" collaborative link by reconstructing and generating two sub-models. The reconstructed sub-model takes multi-view anchor point views as input, solves the problem of spatial location, and outputs the wrist pose. Conditional diagram ; Generate sub-models using conditional graphs Taking anchor point semantics as input, the algorithm solves the problem of "what it looks like", and finally processes and outputs a wrist-view video. The parameters of the two trained sub-models are fixed and they are not included in subsequent end-to-end training. They are then cascaded and bound to obtain a cross-view model.

[0082] This two-stage "reconstruct first, generate later" design decomposes the difficult cross-viewpoint conversion problem into two relatively simple sub-problems: geometric estimation (which can be self-supervised using geometric constraints) and conditional video generation (which can be efficiently solved using a generative model). Through this decoupling, this application can generate high-quality, geometrically consistent wrist view videos from anchor point views without requiring ground truth labels for wrist views, providing high-quality paired data for subsequent end-to-end generative model training.

[0083] The cross-view model can receive a "pseudo-robot" third-person video as input and output a robot first-person wrist view video and corresponding semantic alignment labels.

[0084] Step S120: Construct and train a generative model based on the trained cross-ontology model and the trained cross-viewpoint model.

[0085] In one implementation, while the two cross-ontology models and cross-viewpoint models trained above can produce a large number of simulated first-person robot videos, errors may exist during the cascading process. Directly using the produced videos for training may result in unavoidable biases. To ensure the output quality of the generated first-person robot videos, a new end-to-end generation model can be trained based on the two fixed cross-ontology models and cross-viewpoint models. This model can directly predict robot first-person data from the original third-person human videos, thereby learning error compensation and feature correction during the cascading process and improving the conversion quality.

[0086] In one embodiment, a generative model is constructed and trained based on a trained cross-ontology model and a trained cross-viewpoint model, including: acquiring third video data, which is a third-person human operation video; jointly constructing an initial generative model using the cross-ontology model and the cross-viewpoint model; inputting the third video data as training samples into the initial generative model and outputting a wrist-view video with pseudo-labels; calculating reconstruction loss and perceptual loss based on the pseudo-labels; calculating the intermediate layer feature alignment loss of the cross-ontology and cross-viewpoint models; calculating the semantic consistency loss based on the wrist-view video and the third video data; weighted summing the reconstruction loss, perceptual loss, intermediate layer feature alignment loss, and semantic consistency loss to obtain a total loss value; updating the parameters of the initial generative model through backpropagation with the total loss value as the optimization objective until the training conditions are met; and determining the initial generative model that meets the training conditions as the generative model and outputting it.

[0087] In one implementation, the generative model is trained using third-party video data, which consists of a large amount of raw, authentic third-person human-performed video. The purpose of building and training this generative model is to enable it to process the input raw video frames... The output is a first-person view video frame of the target robot. and the corresponding semantic alignment tags .

[0088] The generative model can be constructed by jointly building an initial generative model using a cross-ontology model and a cross-viewpoint model; it can also be obtained by distillation based on the cross-ontology model and the cross-viewpoint model; or it can be obtained by building a student model and using the outputs of the cross-ontology model and the cross-viewpoint model as the teacher; the student model is then trained. There are no specific restrictions on the construction and training methods. For ease of understanding, this application uses the construction of an initial generative model using a joint cross-ontology model and a cross-viewpoint model as an example. Third-party video data is input into the initial generative model as training samples, and the output is a wrist-view video with pseudo-labels. Supervised pre-training is then performed on this generative model. The loss function used for training the generative model in this application is a composite loss, specifically including reconstruction loss, perceptual loss, intermediate layer feature alignment loss, and semantic consistency loss, which will be explained one by one later.

[0089] Obtain the "raw video-target data" pair based on cross-ontology model and cross-viewpoint model. As pseudo-labels, the pixel-level reconstruction loss (L1 loss) is calculated, and the loss function is shown in the following formula.

[0090] (18) The loss function for Perceptual Loss (LPIPS) is shown below.

[0091] (19) A consistency loss is introduced into the composite loss function to align the intermediate features of the constrained model with the intermediate features of the two solidified models in the two-stage pipeline in the feature space. This can be achieved through knowledge distillation, and the loss function is shown in the following equation.

[0092] (20) In the above formula, To generate intermediate layer features for the model, This loss is used to solidify intermediate layer features of the model (such as a cross-view conversion model). This loss forces the end-to-end model to learn the same effective intermediate representations as the cascaded pipeline, while compensating for accumulated errors through joint optimization at the end-to-end level.

[0093] At the same time, a semantic alignment loss is introduced to ensure that the category and position of objects in the generated wrist-view video are consistent with those in the original video. The loss function is shown in the following formula.

[0094] (twenty one) Ultimately, the loss will be reconstructed. Perceived loss Intermediate layer feature alignment loss With semantic consistency loss The total loss value is obtained by weighted summation. The total loss function is shown in the following formula.

[0095] (twenty two) In the above formula, This represents the total loss value. These are preset hyperparameters used to balance the contributions of various losses. The total loss value is used as the benchmark. To optimize the objective, the parameters of the initial generated model are updated through backpropagation until the training conditions are met.

[0096] An initial generative model that meets the training conditions is identified as the final generative model and output. The trained generative model can efficiently convert any third-person human video into usable first-person data for the robot in real time. Based on the two pre-trained models, the end-to-end generative model learns a direct mapping, avoiding error accumulation in cascaded processing, significantly improving the quality of the converted data, and effectively suppressing cascade errors.

[0097] Step S130: Generate the first video using the trained generative model; obtain the second video and combine it with the first video to form a training set, and train the action model.

[0098] In one implementation, the scheme described above ultimately trains a generative model. This model can take any third-person human video as input and convert it into usable first-person data for the robot in real time. Through an end-to-end generative model, massive amounts of readily available third-person human video are efficiently converted into first-person training data for the robot, breaking the dependence on expensive teleoperation data and enabling an exponential increase in the pre-training data scale of the VLA model. Subsequently, the generated video data can be used to train the motion model. It is important to clarify that the distinction between the first video and the second video is only as a difference in the acquisition channel: the former is obtained by the generative model processing first-person or third-person human operation videos, or it can be a first-person robot video generated through text or image prompts; while the latter is a pre-provided first-person robot operation video. That is to say, the only difference between the two is the acquisition channel; the actual content is the same. Both are first-person videos recorded by the robot performing a specific task, which may include annotation information such as motion trajectory, task objectives, environmental interference, etc.

[0099] Two alternative action model training methods are subsequently provided: two-stage training and joint training. The former uses a large amount of the first video for large-scale pre-training and then uses the second video for fine-tuning; the latter does not directly distinguish between the two types of training data, but directly trains them together and balances the data scale and quality based on different weights.

[0100] In one implementation, regardless of the training method, a first video needs to be generated based on the generative model. This can be achieved by acquiring a large number of third-person, real-person operation videos as input through specific databases, web crawlers, etc., and processing them using the generative model to obtain the first video. The first video transforms the original third-person, real-person operation video into a first-person, robot operation video. Furthermore, in a preferred implementation, the first video includes semantic tags, enabling the motion model to intuitively understand motion trajectories, environmental states, task instructions, etc., allowing the motion module to more efficiently learn the physical laws and constraints of the real world. Correspondingly, a small number of second videos are also needed. These second videos are actual first-person, robot operation videos, also with semantic tags. The motion model is trained using the first and second videos as samples.

[0101] In one embodiment, acquiring a second video and forming a training set with the first video to train an action model includes: labeling samples corresponding to the first video in the training set as first samples and samples corresponding to the second video as second samples; acquiring an initial action model and designing an action prediction loss function; pre-training the initial action model based on the action prediction loss function and the first samples; fine-tuning the pre-trained action model based on the action prediction loss function and the second samples after the pre-training converges until the initial action model meets the training conditions; and labeling the trained initial action model as the target action model and outputting it.

[0102] In one implementation, an initial action model is constructed and designed, which can be a pre-defined VLA base model (such as Prismatic-7B). For two-stage training, the first video and the second video need to be distinguished. The sample corresponding to the first video is labeled as the first sample, and the sample corresponding to the second video is labeled as the second sample.

[0103] In the first stage of training, the model is first trained using the first sample. The loss function for optimizing the action model is designed as follows.

[0104] (twenty three) In the above formula, For potential action tokens, This is a VLA model. The first sample is input into the initial action model for inference to obtain the predicted action. The predicted action and the original action corresponding to the first sample are input into the above loss function to calculate the loss value, and the initial model is pre-trained using this loss value. Iterative training continues until the initial action model meets the preset conditions, after which the second stage of training begins.

[0105] The second training phase utilizes the second set of samples, specifically those corresponding to the second video, for fine-tuning. This fine-tuning can employ the same loss function as the first phase, calculating the loss value and using it as the optimization target to fine-tune the pre-trained action model. In this phase, efficient parameter fine-tuning techniques such as LoRA can be used, updating only a small number of parameters to adapt the model to the real physical world.

[0106] Finally, the initial motion model that has completed training is labeled as the target motion model and output. The trained robot motion decision model can be directly deployed on real robots to achieve zero-shot or few-shot execution of various operational tasks.

[0107] In one embodiment, acquiring a second video and forming a training set with the first video to train the action model further includes: labeling samples corresponding to the first video in the training set as first samples and samples corresponding to the second video as second samples; acquiring an initial action model and designing an action prediction loss function; labeling the loss value calculated by the action prediction loss function and the first sample as a first loss value; labeling the loss value calculated by the action prediction loss function and the second sample as a second loss value; weighting and summing the first loss value and the second loss value as a total loss value, and optimizing the initial action model based on the total loss value; labeling the trained initial action model as the target action model and outputting it.

[0108] In one embodiment, in addition to two-stage training, this application also provides an alternative joint training method. This training method no longer distinguishes between videos, using all samples within the training set as unified training data to train the initial action model. The architecture of the constructed initial action model and the designed loss function can refer to the methods described above, and will not be repeated here.

[0109] However, different labels are assigned to the loss values ​​calculated for different videos to distinguish them. The action prediction loss function and the loss value calculated from the first sample are labeled as the first loss value. The loss value calculated using the action prediction loss function and the second sample is labeled as the second loss value. The total loss value is calculated by weighting the first loss value and the second loss value. The calculation method for the total loss value is as follows.

[0110] (twenty four) In the above formula, This represents the total loss value. and Preset sampling weights. These sampling weights can be dynamically adjusted based on data quality and scale, thus balancing the impact of data size and data quality. The final total loss value will then be... As an optimization objective, the initial model is iteratively trained. Once the initial motion model meets preset conditions, the trained initial motion model is marked as the target motion model and output. The trained robot motion decision model can be directly deployed on a real robot to achieve zero-shot or few-shot execution of various operational tasks.

[0111] In addition, the solutions provided in this application also include several alternative solutions, and the embodiments provided above are alternative solutions.

[0112] Alternative Solution 1: Direct end-to-end generative model, without relying on a two-stage pre-trained model. That is, starting directly from step S120. Step S120 directly trains the end-to-end generative model, mapping from third-person human video to robot first-person data, without using any pre-trained cross-ontology transfer or cross-viewpoint transformation modules. This solution requires a massive amount of paired "raw video-target data" samples, which are scarce, making the directly obtained generative model difficult to converge and exhibiting poor generalization ability. This application significantly reduces the training difficulty of the end-to-end model and improves the conversion quality by constructing two pre-trained models to provide prior knowledge.

[0113] Alternative Solution 2: Use only a two-stage pipeline without training the generative model. That is, ignore step S120; after step S110, directly process the original video using the cross-ontology model and the cross-viewpoint model to generate the first video for training the action model. Use the generated data for subsequent training without end-to-end optimization. This solution's first video generation method requires processing by two models, resulting in cascading errors and slow inference speed (requiring the serial running of two models). This application achieves single-pass forward inference by training the generative model, which is faster, and compensates for cascading errors through end-to-end optimization.

[0114] Alternative Option 3: Train the action model using only the generated first video, without using the second video. That is, train the action model using only the large-scale generated first video data, or only a small amount of real-device data. This alternative may result in insufficient accuracy of the model in the real physical world; using only real-device data limits generalization ability. This application combines the advantages of both by training with a mixture of the two types of data.

[0115] It is understood that the above alternative solutions are examples of feasible solutions in this application, and not preferred implementation methods, but rather examples of alternative solutions.

[0116] The robot motion model training method provided in this application includes the following steps: constructing and training a cross-ontology model and a cross-viewpoint model; constructing and training a generative model based on the trained cross-ontology model and the trained cross-viewpoint model; generating a first video using the trained generative model; acquiring a second video and combining it with the first video to form a training set for training the motion model. Therefore, this application can efficiently transfer the robot ontology using a large amount of existing third-person action data, overcoming the bottleneck of scarce real data. It constructs a complete technical framework of "two-stage transformation + end-to-end generation + hybrid training," which uses massive amounts of third-person human video data to construct and train cross-ontology and cross-viewpoint models, and on this basis trains an end-to-end third-person to first-person generative model, achieving a direct mapping from original human videos to robot first-person data; finally, it mixes the generated large-scale first video data with a small amount of real robot second video data to train a high-performance motion model.

[0117] Specifically, firstly, a cross-ontology model and a cross-perspective model are constructed and cascaded for transformation. Then, an end-to-end third-person to first-person generative model is trained based on these two models to learn direct mapping, forming a complete transformation chain from raw data to target data. Each module can be independently optimized and upgraded, demonstrating strong technological evolution capabilities. The output of the two-stage pipeline serves as a pseudo-label, and consistency constraints are introduced in the feature space. The generative model is trained to achieve direct prediction from raw input to target output, effectively suppressing cascade errors. By training the end-to-end generative model to learn direct mapping based on two pre-trained models (cross-ontology model and cross-perspective model), error accumulation in cascaded processing is avoided, significantly improving the quality of the transformed data. Through the semantic alignment mechanism introduced in the cross-perspective transformation stage and the end-to-end prediction of semantic labels by the generative model, the transformed data possesses both high-fidelity visual details and accurate object semantic information, enhancing the spatial reasoning capabilities of downstream models.

[0118] The massive amounts of first-person data generated by the generative model are used for VLA model pre-training, and then high-precision data from real robots are used for fine-tuning, achieving a balance between data scale and data quality. A hybrid training strategy is employed, utilizing large-scale generated data to learn generalization priors and using a small amount of real-machine data for fine-tuning to adapt to the real physical world. This reduces data acquisition costs while ensuring the final execution accuracy of the model. Through an end-to-end generative model, massive amounts of readily available third-person human video are efficiently converted into robot first-person training data, breaking the dependence on expensive teleoperation data and enabling an exponential increase in the scale of VLA model pre-training data.

[0119] Figure 2 An internal structural diagram of a computer device in one embodiment is shown. This computer device can specifically be a terminal or a server. Figure 2As shown, the device includes: a processor 310 and a memory 311 storing a computer program; wherein, Figure 2 The processor 310 shown in the diagram does not indicate that there is only one processor 310, but only indicates the positional relationship of the processor 310 relative to other devices. In practical applications, there can be one or more processors 310; similarly, Figure 2 The memory 311 illustrated herein has the same meaning, that is, it is only used to indicate the positional relationship of memory 311 relative to other devices. In practical applications, there can be one or more memories 311. When the processor 310 runs the computer program, the method applied to the above-mentioned device is implemented.

[0120] The device may also include at least one network interface 312. The various components of the device are coupled together via a bus system 313. It is understood that the bus system 313 is used to implement communication between these components. In addition to a data bus, the bus system 313 also includes a power bus, a control bus, and a status signal bus. However, for clarity, in... Figure 2 The general designated all buses as Bus System 313.

[0121] The memory 311 can be volatile memory or non-volatile memory, or both. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), ferromagnetic random access memory (FRAM), flash memory, magnetic surface memory, optical disc, or compact disc read-only memory (CD-ROM); magnetic surface memory can be disk storage or magnetic tape storage. Volatile memory can be random access memory (RAM), used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static Random Access Memory (SRAM), Synchronous Static Random Access Memory (SSRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), Double Data Rate Synchronous Dynamic Random Access Memory (DDRSDRAM), Enhanced Synchronous Dynamic Random Access Memory (ESDRAM), SyncLink Dynamic Random Access Memory (SLDRAM), and Direct Rambus Random Access Memory (DRRAM).The memory 311 described in the embodiments of the present invention is intended to include, but is not limited to, these and any other suitable types of memory.

[0122] The memory 311 in this embodiment of the invention is used to store various types of data to support the operation of the device. Examples of this data include: any computer programs used to operate on the device, such as operating systems and applications; contact data; phonebook data; messages; pictures; videos, etc. The operating system includes various system programs, such as the framework layer, core library layer, driver layer, etc., used to implement various basic services and handle hardware-based tasks. Applications can include various applications, such as media players, browsers, etc., used to implement various application services. Here, the program implementing the method of this embodiment of the invention can be included in the application.

[0123] Based on the same inventive concept as the foregoing embodiments, this embodiment also provides a computer-readable storage medium storing a computer program. The computer-readable storage medium can be a magnetic random access memory (FRAM), a read-only memory (ROM), a programmable read-only memory (PROM), an erasable programmable read-only memory (EPROM), an electrically erasable programmable read-only memory (EEPROM), a flash memory, a magnetic surface memory, an optical disc, or a compact disc read-only memory (CD-ROM), etc.; it can also be various devices including one or any combination of the above-mentioned memories, such as mobile phones, computers, tablet devices, personal digital assistants, etc. When the computer program stored in the computer-readable storage medium is run by a processor, it implements the above method. For the specific steps implemented when the computer program is executed by the processor, please refer to [link to relevant documentation]. Figure 1 The description of the illustrated embodiments will not be repeated here.

[0124] The technical features of the above embodiments can be combined in any way. For the sake of brevity, not all possible combinations of the technical features in the above embodiments are described. However, as long as there is no contradiction in the combination of these technical features, they should be considered to be within the scope of this specification.

[0125] In this document, the terms “comprising,” “including,” or any other variations thereof are intended to cover non-exclusive inclusion, which includes not only the elements listed but also other elements not expressly listed.

[0126] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A method for training a robot motion model, characterized in that, Includes the following steps: Building and training cross-ontology models and cross-perspective models; A generative model is constructed and trained based on the trained cross-ontology model and the trained cross-perspective model; The trained generative model is used to generate the first video; Obtain the second video and combine it with the first video to form a training set, and train the action model.

2. The robot motion model training method as described in claim 1, characterized in that, The construction and training of the cross-ontology model includes: Acquire first-person video data, including third-person human operation videos and robot operation videos; After extracting and aligning the skeletal key points of the third-person human operation video and the robot operation video, a first training sample set is established. Construct an initial cross-ontology model, which adopts a video-to-video diffusion model architecture; The first training sample set is input into the initial cross-ontology model for inference to obtain the predicted velocity field; Obtain the true velocity field of the first training sample set, and complete the initial cross-ontology model training by minimizing the loss between the predicted velocity field and the true velocity field.

3. The robot motion model training method as described in claim 1, characterized in that, The construction and training of the cross-view model includes: Acquire the second video data, including multi-view anchor point views and corresponding anchor point semantics; Construct an initial cross-view model, which includes a reconstruction sub-model and a generation sub-model; The multi-view anchor point view is used as a training sample and input into the reconstructed sub-model for processing, outputting a wrist pose and condition map; The conditional graph and the anchor point semantics are processed as joint inputs into the generating sub-model to output a wrist-view video. A loss function is constructed based on the difference between the wrist-view video and the real wrist-view video in the second video data. The parameters of the reconstruction sub-model and the generation sub-model are iteratively optimized until the training conditions are met.

4. The robot motion model training method as described in claim 3, characterized in that, The step of inputting the multi-view anchor point view as a training sample into the reconstructed sub-model for processing, and outputting the wrist pose and condition map, includes: The multi-view anchor point view is encoded into aggregated visual features by the visual encoder in the reconstructed sub-model; The wrist camera pose parameters are extracted by performing cross-attention interaction on aggregated visual features through the wrist-head module. The wrist view projection is recovered from the multi-view anchor point view using a preset multi-view geometric reconstruction method. The condition map is obtained based on the wrist view projection and the reconstruction of the wrist view projection.

5. The robot motion model training method as described in claim 3, characterized in that, The step of processing the conditional graph and the anchor point semantics as joint inputs into the generating sub-model to output a wrist-view video includes: The conditional graph is encoded into latent variables using a variational autoencoder; The latent variables and the anchor semantic embeddings are spatiotemporally aligned and fused, and then concatenated with preset global semantic features and text encodings to obtain semantic conditional embeddings; Starting with preset noise, the semantic conditions are embedded in the latent variables as guiding conditions, and the reconstructed latent variables are gradually reconstructed through a denoising diffusion process. The reconstructed latent variables are fed into a variational autodecoder to decode the wrist-view video.

6. The robot motion model training method as described in claim 1, characterized in that, The process of constructing and training a generative model based on the trained cross-ontology model and the trained cross-perspective model includes: Acquire third video data, wherein the third video data is a third-person human operation video; An initial generative model is constructed by combining the cross-ontology model and the cross-perspective model; The third video data is used as a training sample and input into the initial generation model to output a wrist-view video with pseudo-labels. Calculate reconstruction loss and perceptual loss based on the pseudo-labels; calculate intermediate layer feature alignment loss between the cross-ontology and the cross-view model; calculate semantic consistency loss based on the wrist-view video and the third video data; The total loss value is obtained by weighted summing of the reconstruction loss, the perceptual loss, the intermediate layer feature alignment loss, and the semantic consistency loss; Using the total loss value as the optimization objective, the parameters of the initial generated model are updated through backpropagation until the training conditions are met; The initial generative model that meets the training conditions is determined as the generative model and output.

7. The robot motion model training method as described in claim 1, characterized in that, The step of acquiring the second video and forming a training set with the first video to train the action model includes: Within the training set, the sample corresponding to the first video is labeled as the first sample, and the sample corresponding to the second video is labeled as the second sample; Obtain the initial action model and design the action prediction loss function; The initial action model is pre-trained based on the action prediction loss function and the first sample. Once the pre-training converges, the pre-trained action model is fine-tuned based on the action prediction loss function and the second sample until the initial action model meets the training conditions. The initial action model that has completed training is labeled as the target action model and output.

8. The robot motion model training method as described in claim 1, characterized in that, The step of acquiring the second video and forming a training set with the first video to train the action model further includes: The sample corresponding to the first video in the training set is labeled as the first sample, and the sample corresponding to the second video is labeled as the second sample; Obtain the initial action model and design the action prediction loss function; The loss value calculated using the action prediction loss function and the first sample is labeled as the first loss value; the loss value calculated using the action prediction loss function and the second sample is labeled as the second loss value; The first loss value and the second loss value are weighted and summed to obtain the total loss value. The initial action model is then optimized and trained based on the total loss value. The initial action model that has completed training is labeled as the target action model and output.

9. A computer device, characterized in that, Including processor and memory; The processor is configured to execute a computer program stored in the memory to implement the method as described in any one of claims 1 to 8.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that, when executed by a processor, implements the method as described in any one of claims 1 to 8.