Multimodal knowledge generation and annotation, robotic control methods, systems, and media
By using digital gene methods to parametrically represent and annotate objects, the problems of data scarcity and low annotation efficiency in understanding the availability of object functions are solved, and efficient object knowledge annotation and robot operation generalization are achieved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI JIAOTONG UNIV
- Filing Date
- 2026-03-12
- Publication Date
- 2026-06-19
Smart Images

Figure CN122242640A_ABST
Abstract
Description
[0001] This application requests the following patent application as the basis for priority: Application No.: CN202510597606.6; Application Date: 2025-05-09; Application Title: Method, System and Medium for Automatic Generation and Annotation of Multimodal Knowledge of 3D Objects. Technical Field
[0002] This invention relates to the field of artificial intelligence, specifically to a multimodal knowledge generation and annotation method, robot control method, system, and medium. Background Technology
[0003] Affordance understanding concerns the interaction between humans and the environment. In robotics, affordance refers to the possible actions that can be performed, or the potential ways in which objects in the environment can interact with a robot. Examples include grasping a cup or picking up a backpack. Understanding the affordance of objects from visual information is crucial for robots to perform operations in dynamic and complex environments. Object affordance understanding has wide applications, such as behavior prediction and prediction of effective object functions. In the field of computer vision, existing work has focused on object affordance based on visual information, known as visual affordance understanding. Many of these works rely on deep neural network algorithms, thus requiring large amounts of labeled data for network training and performance testing.
[0004] However, the current problems encountered in network training and performance testing are as follows: 1. Data scarcity: There are few training data samples available, and acquiring them is costly. For example, data on objects with movable joints is expensive to obtain, and existing datasets are limited in size. For knowledge such as manipulability, manual annotation is extremely complex due to the diversity of object shapes, resulting in a scarcity of relevant labeled datasets.
[0005] 2. Low annotation efficiency: Currently, data sample annotation mainly relies on manual annotation, typically annotating only one type of knowledge for one object at a time. Researchers need to develop different annotation platforms for different knowledge types, and annotators also need to participate in annotation multiple times, consuming a lot of time and manpower.
[0006] 3. Poor generalization: The annotation rules cannot be reused across categories, and existing methods are difficult to support complex tasks.
[0007] In acquiring digital assets, traditional methods mainly include scanning with 3D scanners and manual 3D modeling. Both methods consume significant human and financial resources, making them difficult to promote and implement on a large scale. Furthermore, digital assets acquired through these two methods also face the difficulties mentioned earlier when used for training deep learning-based models.
[0008] In robotic manipulation of objects, most existing methods start at the object or object component level and heavily rely on high-quality 3D data for training. Acquiring and labeling large amounts of high-quality 3D data is costly, labor-intensive, and difficult to comprehensively cover the diverse object categories in the real world. Therefore, the effectiveness of such methods is severely limited.
[0009] VLA model: Vision-Language-Motion model, representing a class of models designed to handle multimodalities. It combines robot vision, human language, and uses a large end-to-end model to directly output a series of models for robot actions.
[0010] Current robot control schemes based on VLA models generally require massive amounts of data for training, resulting in prohibitive data collection and training costs. When the data scale is limited, VLA models cannot achieve sufficient generalization and universality, making it difficult for them to complete tasks efficiently in unfamiliar scenarios. Summary of the Invention
[0011] To address the shortcomings of existing technologies, the purpose of this invention is to provide a multimodal knowledge generation and annotation method, robot control method, system, and medium.
[0012] A method for automatic generation and annotation of multimodal knowledge according to the present invention includes: Step 1: Parameterize the different structures of the object using digital genes to obtain the digital gene object representation; Step 2: Transfer the knowledge annotations defined in the digital gene to the object; The digital gene includes: a parameterized template with knowledge annotation defined by mathematical rules, written in computer-executable code, which describes a pre-defined connection structure in the real world through spatial pose constraints and describes the general characteristics of the pre-defined structure in the real world through a combination of basic geometric shapes.
[0013] Further, step 1 includes: Step 101: Display the object in the interactive interface, establish a three-dimensional coordinate axis with the center of the object as the origin and the normal pose as the reference, and determine the category to which the object belongs. Step 102: Based on the determined category, read the structural composition of the object and prompt the user to select the corresponding digital gene from the digital gene repository based on the structural composition. Then, input and adjust the parameters corresponding to each selected digital gene.
[0014] Furthermore, the knowledge annotation includes regional annotation; Migrating the regional labels to the object includes: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose; 2) Sample a large number of discrete points in both the digital gene object representation and the object's own surface; 3) For any point on the surface of the object, select the point that is closest to that point from the points on the surface represented by the digital gene object, and establish a correspondence; 4) Using the correspondence, the regional knowledge contained in the digital gene object representation and defined in the digital gene is transferred to the corresponding position of the object.
[0015] Furthermore, the knowledge annotation includes pose annotation; Transferring the pose annotations to the object includes: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose; 2) Directly label the pose knowledge defined in the digital gene, which is contained in the digital gene object representation, onto the object.
[0016] According to the present invention, an automatic multimodal knowledge generation and annotation system is provided, comprising: Module M1: Parameterizes different structures of an object using digital genes to obtain a digital gene object representation; Module M2: Transfers the knowledge annotations defined in the digital gene to the object; The digital gene includes: a parameterized template with knowledge annotation defined by mathematical rules, written in computer-executable code, which describes a pre-defined connection structure in the real world through spatial pose constraints and describes the general characteristics of the pre-defined structure in the real world through a combination of basic geometric shapes.
[0017] Furthermore, the module M1 includes: The object is displayed in the interactive interface, and a three-dimensional coordinate axis is established with the center of the object as the origin and the normal pose as the reference, and the category to which the object belongs is determined. Based on the determined category, the system reads the structural composition of the object and prompts the user to select the corresponding digital gene from the digital gene repository. The user then inputs and adjusts the parameters corresponding to each selected digital gene.
[0018] Furthermore, the knowledge annotation includes regional annotation; Migrating the regional labels to the object includes: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose; 2) Sample a large number of discrete points in both the digital gene object representation and the object's own surface; 3) For any point on the surface of the object, select the point that is closest to that point from the points on the surface represented by the digital gene object, and establish a correspondence; 4) Using the correspondence, the regional knowledge contained in the digital gene object representation and defined in the digital gene is transferred to the corresponding position of the object.
[0019] Furthermore, the knowledge annotation includes pose annotation; Transferring the pose annotations to the object includes: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose; 2) Directly label the pose knowledge defined in the digital gene, which is contained in the digital gene object representation, onto the object.
[0020] According to the present invention, a computer-readable storage medium storing a computer program is provided, wherein when the computer program is executed by a processor, the steps of the aforementioned method for automatic generation and annotation of multimodal knowledge are implemented.
[0021] A robot control method according to the present invention includes the following steps: Point cloud acquisition steps: acquire visual observation information of the current environment, natural language task instructions and robot status information, and identify and segment the target object based on the visual observation information and task instructions to generate three-dimensional point cloud data of the target object. Parameterization step: Based on the point cloud data of the target object, determine the preset digital gene template ID that matches the target object and the corresponding template parameters, which together constitute the parameterized structural representation of the target object; Digital gene representation step: Based on the parameterized structural representation, generate a digital gene representation including at least one of structural information representation and knowledge information representation using the method described above; Fusion step: The digital gene representation is fused with at least one feature selected from the following through a VLA model to generate fused features: visual features generated based on the visual observation information, language features generated based on the task instructions, and robot state features; Decoding step: Input the fused features into the action decoder to generate a sequence of action instructions that the robot can execute.
[0022] Furthermore, the step of determining the parameterized structure representation includes: The point cloud data is processed by a point cloud coding network to extract high-dimensional features; Based on the high-dimensional features, the digital gene template ID is determined through classification; Based on the high-dimensional features, the template parameters are obtained through a parameter estimation network.
[0023] Furthermore, the step of generating the 3D point cloud data of the target object includes: The task instructions are analyzed using a large language model to generate prompt words; The three-dimensional point cloud data is generated using object recognition and semantic segmentation models based on the prompts and visual observation information.
[0024] Furthermore, the knowledge information representation includes: Regional knowledge representation generated by encoding predefined functional area information; A pose-based knowledge representation generated by encoding predefined interactive pose information.
[0025] Furthermore, when the knowledge information representation includes the regional knowledge representation, the step of generating the regional knowledge representation includes: The functional region information bound to the parameterized structure representation is matched and transferred to the relevant points constituting the target object's point cloud according to the nearest point principle. This results in each point in the target object's point cloud having a binary (0 or 1) label corresponding to that function for each type of functional region information. This label can be considered as the color of the point cloud (e.g., 0 for black, 1 for white), and thus the colored point cloud can be reprojected into the image space using the camera's intrinsic and extrinsic parameters to obtain a two-dimensional mask image, which is then further encoded using an image coding neural network.
[0026] When the knowledge information representation includes the pose-type knowledge representation, since the pose-type knowledge representation is generally represented by a 7-dimensional vector (three-dimensional space + quaternion), we directly use a multilayer perceptron (MLP) to extract its features.
[0027] Furthermore, the digital gene representation step employs a diffusion model, which fine-tunes the digital gene representation at the current moment based on the digital gene representation at the previous moment and the visual observation information at the current moment.
[0028] Furthermore, the VLA model is a model obtained by model distillation.
[0029] A robot control system according to the present invention includes: Point cloud acquisition module: Acquires visual observation information of the current environment, natural language task instructions and robot status information, and identifies and segments the target object based on the visual observation information and task instructions, and generates three-dimensional point cloud data of the target object; Parameterization module: Based on the point cloud data of the target object, determine the preset digital gene template ID that matches the target object and the corresponding template parameters, which together constitute the parameterized structural representation of the target object; Digital gene representation module: Based on the parameterized structure representation, a digital gene representation including at least one of structural information representation and knowledge information representation is generated using the method described in any one of claims 1; Fusion module: fuses the digital gene representation with at least one feature selected from the following to generate a fused feature: a visual feature generated based on the visual observation information, a language feature generated based on the task instructions, and a robot state feature; Decoding module: Inputs the fused features into the action decoder to generate a sequence of action instructions that the robot can execute.
[0030] Compared with the prior art, the present invention has the following beneficial effects: This invention significantly improves the knowledge annotation efficiency of existing 3D objects by using a digital gene-based object representation method, increasing the knowledge annotation efficiency by more than 60%.
[0031] This invention utilizes a digital gene-based object representation method to rapidly synthesize an unlimited number of new, knowledge-annotated objects based on existing 3D objects, thereby nearly infinitely reducing the cost of acquiring digital assets. Using these newly generated objects to train neural networks can improve the effectiveness of their knowledge annotation in fields such as computer vision and robotics.
[0032] This invention applies object knowledge representation based on digital genes to large-scale robot models, enabling current robot VLA models to utilize object digital gene knowledge to better output robot actions. Thanks to the parameterization, scalability, and programmability of digital genes, neural network models can encode digital genes, better integrating object information into model computation, significantly improving the generalization and versatility of the final results for current large-scale robot models. Attached Figure Description
[0033] Other features, objects, and advantages of the present invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings: Figure 1This is a flowchart of the process of the present invention; Figure 2 A schematic diagram showing the combination of the basic geometric shapes of the door panel; Figure 3 A diagram illustrating the knowledge-based annotations for handles defined by mathematical rules; Figure 4 A schematic diagram of a code template for a digital gene; Figure 5 A schematic diagram showing how an object is displayed in an interactive interface; Figure 6 This is a schematic diagram of a differentiable rendering method. Figure 7 A comparison image of the digital gene (right) and the object (left); Figure 8 A flowchart of the robot control method provided by this invention; Figure 9 A schematic diagram of the timing interaction of the real-time fine-tuning mechanism in this invention. Detailed Implementation
[0034] The present invention will now be described in detail with reference to specific embodiments. These embodiments will help those skilled in the art to further understand the present invention, but do not limit the invention in any way. It should be noted that those skilled in the art can make several changes and improvements without departing from the concept of the present invention. These all fall within the protection scope of the present invention.
[0035] Example 1 like Figure 1 As shown, this embodiment provides a method for automatic generation and annotation of multimodal knowledge of 3D objects, including: Step 1: Parameterize the different structures and joints of the object using digital genes to obtain a digital gene object representation.
[0036] A digital gene refers to a parameterizable, mathematically defined, knowledge-annotated template written in computer-executable code. This template describes a connection structure in the real world through spatial pose constraints or a combination of basic geometric shapes, illustrating the general characteristics of a structure in the real world. It is defined as a "class" in computer code and can be extended through inheritance.
[0037] For example, regarding “describing a certain connection structure in the real world through spatial pose constraints”, taking the horizontal ground as the XY plane in spatial coordinates, the pose of a drawer contained in a piece of furniture is subject to the following constraints relative to the pose of the furniture itself: the difference between their Z coordinates remains unchanged, and their relative motion is restricted to a certain fixed straight line on the XY plane.
[0038] "Basic geometric shapes" include, but are not limited to, cuboids, cylinders, spheres, prisms, pyramids, and tori. For example, "describing a general characteristic of a structure in the real world through a combination of basic geometric shapes" is an example of... Figure 2 As shown, the structure of a door panel with a "Π" shaped handle can be composed of four cuboids assembled in a specific manner, one of which represents the door panel, while the other three cuboids are assembled to form the handle.
[0039] For example, the length, width, height, position, orientation, and other attributes of the aforementioned cuboids can all be represented by parameters to represent a series of individuals with the same structural description but different actual shapes.
[0040] Regarding "defining knowledge-based annotations through mathematical rules," for example, for the furniture mentioned above, we can define: 1) the "handle" of this type of structure is the area occupied by three specified cuboids; 2) the part that the robot can grasp when performing the door-opening operation is the area corresponding to the horizontal cuboid in the middle of the "Π"-shaped handle. In other examples, for a door structure with an L-shaped handle, some definable knowledge includes... Figure 3 As shown, it includes the handle area, the area suitable for applying force, the gripping posture, and the overall handle posture.
[0041] "Computer-executable code" includes, but is not limited to, languages such as C / C++ / Python. For example, code written in Python describing a generalized cuboid (including frustums) would look like this: Figure 4 As shown, the collection of all digital genes is called a digital gene repository. In actual use, users can select the digital genes they need from the repository; when the required digital genes are not present in the repository, users can define and write the required digital genes themselves and add them to the repository for subsequent operations.
[0042] "Extensions can be made through inheritance" means that when writing new digital gene-related code, existing digital gene code templates can be reused to achieve rapid expansion. For example, Figure 4 The code template for the digital gene shown can be called in any digital gene containing a cuboid without having to be rewritten.
[0043] In step 1, the basic geometric structure and joint representation of the object are completed by parameterizing a series of digital genes corresponding to different structures and joints. To achieve this process, the following steps can be followed to ultimately obtain the digital gene representation of an object: Step 101: For a given object, display the object in the interactive interface (the display result of a 3D object is as follows). Figure 5 As shown in the diagram, a three-dimensional coordinate axis is established with the center of the object as the origin and its normal pose as the reference. The user selects the category to which the object belongs in the system; categories include, but are not limited to, furniture such as chairs and tables, and tools such as scissors and pliers.
[0044] Step 102: Based on the selected category, the system reads the structure / component composition of the object from a predefined database. According to the structural composition requirements, the user selects appropriate digital genes from the digital gene repository, then inputs and adjusts the parameters (including pose parameters, i.e., simultaneously adjusting the pose of the digital gene parameter instances). During the user's parameter adjustment, the system synchronously displays the joints / shapes determined by the digital gene code template and the current parameters for the user's reference. The user adjusts the parameters until a series of selected digital gene instances effectively represent the structure of the current object as a whole.
[0045] Step 103 (Optional): After completing the selection of digital genes and parameter adjustment, if you believe that the accuracy of the parameters needs to be further improved, you can fine-tune the parameters through the provided parameter optimization system.
[0046] by Figure 4 Taking the code template shown as an example, according to its code definition, parameters such as height and top_length are passed into the template, and a geometric shape (digital gene instance) whose shape is determined by the digital gene itself and the parameters is instantiated.
[0047] "Displaying in the interactive interface" is done by rendering images or 3D object models.
[0048] "Normal posture" refers to the most common standard posture of an object in daily life.
[0049] "Parameter optimization system" refers to leveraging the differentiable rendering feature of digital gene design to directly correlate the parameters of the digital gene with the shape of the digital gene instance. By calculating the difference between the shape of the digital gene instance and the shape of the current object, and using gradient backpropagation, the parameter values are optimized. A simple example is as follows: Given a target object with a shape close to a cuboid, a cuboid-shaped digital gene template (called a "parameterized cuboid template") can be used, along with random or approximate length, width, and height parameters, to obtain a cuboid mesh obtained through differentiable rendering. Due to the nature of differentiable rendering, the difference between this mesh and the target object's shape can be written as a continuous function with the length, width, and height parameters of the digital gene as variables. By iteratively adjusting the parameters of the digital gene, the value of this continuous function is minimized until the difference between the generated mesh and the target object is minimized, thus obtaining the optimal parameters. "Differentiable rendering" refers to the process of mapping parameters to geometric shapes through continuously differentiable mathematical operations.
[0050] Figure 6 The example demonstrates a differentiable rendering approach. The "difference between the shape of the gene instance and the shape of the current object" can be calculated using methods including, but not limited to, Chamfer distance and point-to-mesh distance.
[0051] Figure 7 This image shows a comparison between a digital genetic representation of an object (right) and the object itself (left). As can be seen, the digital genetic representation of an object effectively preserves its geometric information in three-dimensional space, while also allowing for a fully parametric representation. This greatly facilitates tasks such as data annotation and new object generation.
[0052] Step 2: Transfer the knowledge annotations defined in the digital gene to the object.
[0053] Using the above-mentioned object representation method based on digital genes, knowledge annotations defined on digital genes can be automatically transferred to objects represented by digital gene instances. In this way, the overall knowledge annotation acquisition speed can be increased by more than 60%.
[0054] "Knowledge annotation" refers to two parts: regional annotation (e.g., the area within a container-shaped structure that can be used to hold objects) and pose annotation (e.g., the pose of an effective grasping handle-shaped structure).
[0055] For regional annotation, the correspondence between the digital gene object representation and the object itself can be obtained, thereby propagating the knowledge information defined in the digital gene to the object itself. The "correspondence" is obtained through the following steps: 1) Align the center of the digital gene object representation with the center of the object itself, restoring both to their normal poses. 2) Sample a large number of discrete points on both the digital gene object representation and the object's surface. 3) For any point on the object's surface, select the point closest to that point from the digital gene object representation surface, establishing a correspondence. "Propagation" refers to using the correspondence to transfer the regional knowledge defined in the digital gene, contained in the digital gene object representation, to a specified location on the object itself.
[0056] For pose annotation, the following steps can be taken: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose. 2) Directly annotate the pose knowledge defined in the digital gene within the digital gene object representation onto the object itself.
[0057] Example 2 This invention also provides an automatic generation and annotation system for multimodal knowledge of 3D objects. This system can be implemented by executing the steps of the automatic generation and annotation method for multimodal knowledge of 3D objects. That is, those skilled in the art can understand the automatic generation and annotation method for multimodal knowledge of 3D objects as a preferred embodiment of the automatic generation and annotation system for multimodal knowledge of 3D objects. The system includes: module M1: parameterizing different structures and joints of the object using digital genes to obtain a digital gene object representation; module M2: transferring the knowledge annotations defined in the digital genes to the object.
[0058] In module M1, the basic geometric structure and joint representation of the object are completed by parameterizing a series of digital genes corresponding to different structures and joints. To achieve this process, the following steps can be taken in module M to ultimately obtain the digital gene representation of an object: Module M101: For a given object, display the object in the interactive interface (the display result of the 3D object is as follows). Figure 5 As shown in the diagram, a three-dimensional coordinate axis is established with the center of the object as the origin and its normal pose as the reference. The user selects the category to which the object belongs in the system; categories include, but are not limited to, furniture such as chairs and tables, and tools such as scissors and pliers.
[0059] Module M102: Based on the selected category, the system reads the structure / components of the object from a predefined database. According to the structural requirements, the user selects appropriate digital genes from the digital gene repository, then inputs and adjusts the parameters (including pose parameters, i.e., simultaneously adjusting the pose of the digital gene parameter instances) for each selected digital gene. During parameter adjustment, the system simultaneously displays the joints / shapes determined by the digital gene code template and the current parameters for the user's reference. The user adjusts the parameters until a series of selected digital gene instances effectively represent the structure of the current object.
[0060] Module M103 (optional module): After completing the selection and parameter adjustment of the digital gene, if you believe that the accuracy of the parameters needs to be further improved, you can fine-tune the parameters through the provided parameter optimization system.
[0061] Module M2, through the aforementioned object representation method based on digital genes, can automatically transfer knowledge annotations defined on digital genes to objects represented by digital gene instances. In this way, the overall knowledge annotation acquisition speed can be increased by more than 60%.
[0062] "Knowledge annotation" refers to two parts: regional annotation (e.g., the area within a container-shaped structure that can be used to hold objects) and pose annotation (e.g., the pose of an effective grasping handle-shaped structure).
[0063] For regional annotation, the correspondence between the digital gene object representation and the object itself can be obtained, thereby propagating the knowledge information defined in the digital gene to the object itself. The "correspondence" is obtained through the following steps: 1) Align the center of the digital gene object representation with the center of the object itself, restoring both to their normal poses. 2) Sample a large number of discrete points on both the digital gene object representation and the object's surface. 3) For any point on the object's surface, select the point closest to that point from the digital gene object representation surface, establishing a correspondence. "Propagation" refers to using the correspondence to transfer the regional knowledge defined in the digital gene, contained in the digital gene object representation, to a specified location on the object itself.
[0064] For pose annotation, the following steps can be taken: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose. 2) Directly annotate the pose knowledge defined in the digital gene within the digital gene object representation onto the object itself.
[0065] In other embodiments, a sample generation method can be formed based on embodiment 1, which generates samples for training and performance testing of neural networks based on knowledge-annotated objects. The trained neural network can be used to identify and annotate 3D objects.
[0066] In other embodiments, a sample generation system can be formed based on embodiment 2, which generates samples for training and performance testing of neural networks based on knowledge-annotated objects. The trained neural network can be used to identify and annotate 3D objects.
[0067] In other embodiments, the digital gene code in Embodiment 1 can be extended to form an extensible method for digital gene code. When writing new digital gene-related code, the digital gene code corresponding to the inherited object can be generated through inheritance based on the digital gene code annotated with existing knowledge.
[0068] In other embodiments, the digital gene code in Embodiment 2 can be extended to form an extensible digital gene code system. When writing new digital gene-related code, the digital gene code corresponding to the inherited object can be generated through inheritance based on the digital gene code annotated with existing knowledge.
[0069] In other embodiments, a computer-readable storage medium storing a computer program may also be provided, which, when executed by a processor, implements the steps of the above-described method for automatic generation and annotation of multimodal knowledge of 3D objects, a sample generation method, or an extensible method for digital genetic code.
[0070] Example 3 Based on Examples 1 and 2, another example of a... Figure 7 The basic process for annotating the 3D model of the chair shown is as follows, for the understanding of those skilled in the art: A 3D model of a chair without any knowledge annotations is loaded, and then the parameterization step begins. The classification model identifies the target object model as belonging to the "chair" category. Based on this high-level semantics, the system selects the digital genes necessary to construct the chair from a digital gene repository. For a typical four-legged chair, its structure can be decomposed into a seat, a backrest, and four legs. Accordingly, the operator can select a "plate-like" gene to represent the seat, another "plate-like" gene to represent the backrest, and "four cuboid" genes (here we abstract the "four legs of the chair" as a gene template) to represent the four legs. These genes selected from the repository have predefined relevant knowledge within them. For example, the top surface of the "plate-like" gene is labeled as a "supportable surface" (a type of regional labeling), while the sides of the "cuboid" gene are labeled as "gripable areas" (another type of regional labeling), and a "recommended gripping pose" (a type of pose labeling) for handling may also be defined.
[0071] After selecting the genes, the three selected digital genes (one for the seat, one for the back, and four for the legs) are instantiated. The overall shape of the combination of these three gene instances is adjusted to roughly align with the original chair target object model in terms of structure, proportion, and spatial layout. This aligned gene combination is the digital gene object representation of the chair, as shown below. Figure 7 The example chair is shown in the image. Next, the system initiates knowledge transfer.
[0072] For the transfer of regional knowledge annotations, such as transferring the "supportable surface" annotation from the seat gene to the chair model, the following operations are performed: First, high-density random sampling is performed on the surface of the seat gene instance and the surface of the original chair model, generating two sets of discrete point clouds. Then, for each point in the point cloud of the original chair model surface, the algorithm finds the nearest point in the point cloud of the seat gene instance by calculating Euclidean distance or other distance metrics, thus establishing a dense point-to-point geometric correspondence between the two surfaces. Finally, the system checks the knowledge label of each sampled point on the seat gene instance. If a gene sampled point is labeled as "supportable surface," then through the aforementioned correspondence, all chair model sampled points with that gene as the nearest point, and their corresponding triangular facets, are also correspondingly labeled as "supportable surface." In this way, regional knowledge defined on abstract genes is "propagated" or "projected" onto the corresponding surface regions of the concrete model. Similarly, the knowledge of "gripable regions" is also transferred from the four chair leg genes to the actual chair leg surfaces in a similar manner.
[0073] For the transfer of pose-based knowledge annotations, such as transferring the "recommended grasping pose" defined in the chair leg gene to the chair model, the process is more direct. Since each chair leg gene instance has been precisely aligned to the actual chair leg position during the parameterization step, knowledge transfer only requires reading the pose (a transformation matrix relative to the chair leg gene coordinate system) and applying it to the global coordinate system to obtain a specific grasping pose on the original chair model. This pose information (usually including position and orientation) is recorded as a knowledge annotation for the chair model.
[0074] After the above steps, a 3D chair model with multimodal knowledge annotations, including "supportable surfaces," "gripable areas," and "recommended gripping poses," is finally output. Compared to traditional methods that require users to manually select faces or vertices for annotation, this embodiment transforms most of the work into high-level semantic matching (gene selection) and coarse-grained parameter adjustment, while the core annotation step—knowledge assignment—is automatically completed by the system, thereby greatly improving annotation efficiency and consistency.
[0075] To further reduce the workload of manually adjusting parameters and improve the accuracy of fitting between the digital gene object representation and the target object, automatic parameter optimization is achieved through a technique based on differentiable rendering. All parameters involving geometry and pose in the digital gene object representation (e.g., the length, width, and thickness parameters of the chair surface gene, the height and cross-sectional dimensions of the chair leg gene, and the position and rotation parameters of each of the six gene instances) are treated as a trainable parameter vector p.
[0076] The shape of a geometric solid is determined by the positions of all its vertices (e.g., v1-v8). In the definition of a digital gene, the position of a vertex is calculated from a parameter p through a series of continuously differentiable mathematical operations (e.g., addition and multiplication); that is, the vertex position is a function f(p) of the parameter p. This ensures that the entire process of generating a 3D shape from a parameter is differentiable.
[0077] The iterative optimization loop follows these steps: 1. Differentiable rendering and difference calculation: In each iteration, using the current parameters... The system computes a precise 3D mesh representing a digital genetic object in real time using a differentiable rendering pipeline. Then, it calculates the geometric difference between the rendered mesh and the target object's mesh. This difference is typically calculated using a metric called "bevel distance." Specifically, the system samples a large number of points on the surfaces of both objects, forming two point clouds. The bevel distance consists of two parts: first, calculating the average nearest distance from the rendered mesh point cloud to the target mesh point cloud; and second, calculating the average nearest distance from the target mesh point cloud to the rendered mesh point cloud. The sum of these two parts constitutes the loss function value under the current parameters. .
[0078] 2. Gradient Calculation and Backpropagation: Due to the parameters The entire process, from rendering the mesh to calculating the chamfer distance, is differentiable. The system can use automatic differentiation techniques (similar to gradient calculation in deep learning frameworks) to calculate the loss function. Regarding parameter vectors gradient of each parameter The gradient vector indicates the direction in which each parameter should be adjusted to reduce geometric differences.
[0079] 3. Parameter Update: A gradient-based optimization algorithm (such as stochastic gradient descent or its improved version, such as the Adam optimizer) is used to update the parameter vector based on the calculated gradient. The update rules are usually as follows: ,in It is the learning rate, used to control the step size of each update.
[0080] 4. Iteration and Convergence: The system repeats steps 1 to 3 above. In each iteration, the parameters... All will be fine-tuned to make the rendered digital genetic object representation increasingly closer to the target object model, and the loss function value The loss function value also decreases accordingly. The optimization process terminates when the loss function value converges to below a preset minimum threshold, or when the number of iterations reaches a preset upper limit.
[0081] After optimization, a digital gene object representation that closely matches and is precisely aligned with the target object is obtained. Based on this high-precision representation, the knowledge transfer step is then performed. Understandably, due to the significant improvement in fitting accuracy, the transferred knowledge annotations (such as the boundaries of functional regions) also become more accurate.
[0082] Example 4 This embodiment illustrates a robot control method based on digital genetic knowledge enhancement. This method aims to improve the robot's success rate and generalization ability in complex tasks, especially when facing unfamiliar scenarios, by providing the vision-language-action model with structured prior knowledge that precisely corresponds to the target object instance.
[0083] The described vision-language-action model mainly consists of two parts: an input encoding module and an action decoding module. The input encoding module includes a language encoding module, an image encoding module, a digital gene region knowledge encoding module, a digital gene pose knowledge encoding module, and an input information fusion module. Except for the digital gene pose knowledge encoding module (implemented using an MLP), the other modules are implemented based on the Transformer architecture (the image encoding module and the digital gene region knowledge encoding module are both based on the Visual Transformer (ViT)). The outputs of all encoding modules are passed to the input information fusion module in the format of high-dimensional feature sequences, and the final output, the fused "input information representation," is used for subsequent calculations. The action decoding module uses a Transformer-based diffusion strategy model and takes the "input information representation" as input. It starts with Gaussian noise for denoising and, after multiple iterations, obtains an action sequence output, which is then used to guide the actual movement of the robotic arm.
[0084] The hardware architecture of the robot control system includes a robot body, such as a six- or seven-axis collaborative robotic arm, with an end effector gripper at its end for grasping or manipulating objects. To perceive the surrounding environment, the system is equipped with sensors. In one embodiment of this application, the sensor may be a depth camera capable of simultaneously acquiring color images and depth information, which may be mounted on the wrist of the robot body or on an external mounting bracket to provide an "eye-on-the-hand" or global visual observation.
[0085] All data processing, algorithm execution, and decision generation are completed on a single computing device. Specifically, this computing device can be a high-performance industrial computer, server, or workstation with a graphics processing unit, internally equipped with a processor and memory containing computer program instructions for executing the methods described in this application. The computing device receives natural language task instructions from the operator through a user input interface, such as via keyboard input or voice recognition. The computing device processes observation information from sensors and instructions from the user input interface, performs a series of calculations, generates a sequence of action instructions, and sends it to the robot controller. Accordingly, the robot controller is responsible for parsing these instructions and driving the various joints of the robot body and the end effector to complete the specified physical operation task.
[0086] The following will combine Figure 8 The flowchart shown illustrates the specific steps of the robot control method provided in this embodiment.
[0087] First, the system acquires multimodal input and generates a point cloud of the target object. In a specific operational scenario, for example, when a red, oddly shaped cup that the robot has never seen before is placed on a table, the user can issue a natural language task command through the user input interface: "Pick up the red cup." Simultaneously, the system continuously acquires the robot's current state information, such as the joint angles and angular velocities of the robotic arm, and the opening and closing width of the end effector gripper. Sensors capture visual observation information of the current environment, including a color image of the red cup and its corresponding depth map.
[0088] After receiving these multimodal inputs, the computing device first parses the task instructions. For example... Figure 8 As shown, the user command is fed into a large language model analysis module. This large language model, for example, could be a pre-trained language model based on the Transformer architecture, which analyzes the intent of the command "pick up the red cup" and extracts key descriptions of the target object to generate one or more prompt words, such as "red cup".
[0089] Subsequently, the system uses the prompt word to identify and segment the target object. Specifically, the prompt word can be input together with the currently observed color image into an open-vocabulary object recognition and segmentation model combination. For example, an object recognition model (such as DINO) can first be used to locate the bounding box that may contain the object in the image based on the prompt word "red cup". Then, the bounding box and the prompt word are used as input to a semantic segmentation model (such as SAM) to accurately segment the region occupied by the red cup at the pixel level, thereby generating a binary mask image.
[0090] After obtaining the precise pixel mask of the target object, the system combines the depth map acquired at the same time with the sensor's intrinsic and extrinsic parameters (including focal length, principal point coordinates, distortion coefficients, and the camera's pose in the world coordinate system) to back-project each pixel in the mask area from the two-dimensional image coordinate system to the three-dimensional world coordinate system, thereby generating a cluster of three-dimensional point cloud data representing the geometry of the surface of the red cup, i.e., the target object point cloud.
[0091] Next, the system extracts a parameterized structural representation, namely a digital gene instance. To enable the robot to understand the specific structure of an object, rather than simply processing a random point cloud, this embodiment introduces the concept of a "digital gene." A digital gene instance is a structured, parameterized description of an object instance, mainly composed of two parts: a template ID and template parameters. The template ID is a discrete identifier used to characterize the object's category and basic topological structure; for example, "cup_template_01" represents a cup with a single handle. The template parameters are a set of continuous or discrete numerical values used to describe the specific geometric attributes of the instance, such as overall dimensions (e.g., height, rim radius), main shape (e.g., whether it is a standard cylinder), and additional components (e.g., the position, size, and orientation of the handle).
[0092] To extract the digital gene instance from the point cloud of the target object generated in the previous step, the system employs a specially designed deep learning network. For example... Figure 8 As shown, the process includes point cloud feature extraction, digital gene classification, and digital gene parameter estimation. Specifically, the target object point cloud is input into a point cloud encoding network, which can adopt an architecture such as PointTransformer that can effectively process point cloud data. This network captures the local and global geometric relationships between points in the point cloud through a multi-layer self-attention mechanism, and finally outputs a high-dimensional feature vector that can summarize the shape of the entire point cloud.
[0093] The high-dimensional feature vector is fed in parallel into two different network heads: a classification head and a parameter estimation network (also known as a regression head). The classification head can be a fully connected network that receives the high-dimensional feature vector and outputs a probability distribution across all preset digital gene template IDs. By selecting the category with the highest probability, the system determines the template ID that best matches the current target object, such as "cup_template_01". The parameter estimation network can be a multilayer perceptron or a Transformer-based regression network, which also receives the high-dimensional feature vector and outputs a set of template parameters corresponding to the determined template ID. For example, for "cup_template_01", the network would output specific values such as a cup height of 10 cm, a radius of 4 cm, a handle position vector of [x, y, z], and dimensions of [dx, dy, dz]. Through this process, the system generates a unique and accurate digital gene instance for the red cup in the scene, completing the parameterized representation of its entity structure.
[0094] Then, the system generates digital gene representations. After obtaining digital gene instances, they need to be converted into a format that the visual-language-action model can understand and utilize, i.e., high-dimensional feature vectors. In a preferred embodiment of this application, the system simultaneously generates and uses two core digital gene representations: structural information representation and knowledge information representation.
[0095] The generation of structural information representations aims to provide parametric information about the overall structure of an object to a vision-language-action model. For example... Figure 8 As shown in the "symbolization" step, the system performs unified symbolization processing on discrete template IDs and continuous template parameters. Specifically, the template ID "cup_template_01" is mapped to a predefined or learnable embedding vector. For continuous template parameters, such as a height of 10 cm, the system first quantizes them. For example, if the preset height range is 0 to 20 cm with a step size of 0.5 cm, then 10 cm will be quantized as an index value of 20. Subsequently, this index value of 20 is also mapped to an embedding vector. After all parameters undergo similar processing, the system obtains a series of feature vector sequences representing the structure of the object. To enable the model to better understand the relationships between these parameters, this feature vector sequence can be input into a small Transformer encoder for context encoding, and its output is the final "digital gene structure information" representation.
[0096] The generation of knowledge information representation aims to explicitly provide functional knowledge directly related to physical interaction, predefined in digital gene templates, to the vision-language-action model. In this embodiment, knowledge information representation is further divided into region-based knowledge representation and pose-based knowledge representation. The process of generating region-based knowledge representation is as follows: The digital gene template "cup_template_01" predefines functional region information, such as the "handle" region. This definition can be a set of parametric surfaces or point sets based on the template coordinate system. The system first extracts the theoretical definition of the "handle" from the digital gene instance, and then, through algorithms such as coordinate transformation and nearest-point search, accurately maps and transfers this theoretical handle region to the actual point cloud of the current target object, thereby identifying the point set belonging to the handle part in the actual cup point cloud. Next, in order to utilize the powerful two-dimensional vision model for encoding, the system reprojects this three-dimensional point cloud marked with the handle region back into the two-dimensional image space using camera parameters, generating a binary mask image of the same size as the original observed image, in which the pixel value of the handle region is 1, and the pixel value of the other regions is 0. Finally, this highly concentrated mask image is fed into a visual model (e.g., a visual Transformer), which encodes it into one or more high-dimensional feature vectors, namely, "digital genetic knowledge information (region)" representations. These representations provide the visual-language-action model with explicit information about specific functional areas of an object (such as a handle).
[0097] The process of generating pose-based knowledge representations is as follows: The digital gene template "cup_template_01" can also predefine key pose information related to the interaction, such as one or more recommended grasping poses. Each pose is typically represented as a six-degree-of-freedom transformation, including three-dimensional position and three-dimensional orientation (usually represented by quaternions or Euler angles). The system extracts these predefined grasping poses from the digital gene instance; these poses are relative to the object's own coordinate system. Through the object's current pose transformation, the representation of these grasping poses in the world coordinate system can be obtained. Then, this six- or seven-degree-of-freedom (position + quaternion) pose vector is input into a simple neural network (e.g., a multilayer perceptron) for encoding, outputting a high-dimensional feature vector, i.e., the "digital gene knowledge information (pose)" representation. This representation provides direct suggestions for feasible interaction poses for the vision-language-action model.
[0098] Finally, the system performs fusion decision-making and action generation. At this point, the system has prepared all the information input to the core vision-language-action model. For example... Figure 8As shown, the model receives a feature vector sequence that integrates multiple modal information, including but not limited to: visual features obtained by processing the original color image by the main visual encoder; linguistic features obtained by processing the user instruction "pick up the red cup" by the text encoder; robot state features encoded by the robot's current joint angle, gripper width, and other state information; and digital gene structure information representation, digital gene regional knowledge representation, and digital gene pose knowledge representation generated in the aforementioned steps.
[0099] In a Transformer-based vision-language-action model, these feature vector sequences from different sources are concatenated and then deeply fused through multi-layered cross-attention and self-attention modules. During the fusion process, the model learns to associate the linguistic command "pick up" with the "handle" region and "grasping pose" from digital genetic knowledge, while simultaneously combining visual information to confirm the actual position of the object. Understandably, this fusion guides the model's decision-making process with explicit structural and functional prior knowledge.
[0100] Finally, the fully fused global features are fed into an action decoder. This decoder uses a diffusion strategy to generate a sequence of action commands that the robot can execute step by step. Each action command can be a seven-dimensional vector, containing the target pose (6-dimensional) that the robotic arm's end effector needs to reach and the target opening degree (1-dimensional) of the end effector gripper. This series of action commands is sent to the robot controller, driving the robot to move precisely to the vicinity of the cup's handle, grasp the handle with the appropriate posture and opening degree, and finally pick up the cup, successfully completing the task.
[0101] Thanks to the precise geometric and functional prior knowledge provided by digital genes, the method in this embodiment can guide the robot to interact correctly even if the shape of the target cup has never appeared in the training data, thus demonstrating a superior generalization ability and task success rate compared to traditional vision-language-action (VLA) models.
[0102] By extracting parameterized structures that precisely correspond to object instances from real-time point clouds, accurate geometric and functional prior knowledge is provided to the vision-language-action model. This enables the model to better understand and manipulate previously unseen objects, significantly improving its success rate in physical manipulation in unfamiliar scenes. Furthermore, due to the introduction of structured prior knowledge, the model no longer needs to learn the physical properties and interaction methods of all objects from scratch, effectively reducing its dependence on massive and diverse training data and the cost of model training. In addition, by making the structure and functional regions of objects explicit, the model's decision-making process has better interpretability. When both structural and functional information about objects are simultaneously introduced into the model, the success rate of manipulation is further improved.
[0103] Example 5 This embodiment provides a variant of the knowledge information representation generation method in Embodiment 4. As an optional implementation, this embodiment aims to illustrate that the technical solution of this application is not limited to a specific encoding method and may retain richer information in certain scenarios.
[0104] The overall process of this embodiment is basically the same as that of Embodiment 4, including steps such as acquiring multimodal input, extracting digital gene instances, generating digital gene representations, and fusion decision-making. The core difference lies in the specific method for generating "regional knowledge representations".
[0105] In Example 4, to generate a regional knowledge representation, the system projects a subset of the 3D point cloud corresponding to the identified functional regions (such as handles) back into the 2D image space to form a binary mask image, which is then encoded using a visual Transformer designed for image processing. It should be noted that while this method effectively utilizes a powerful pre-trained 2D vision model, some spatial depth and geometric structure information may be lost during the projection from 3D to 2D. For each type of functional region information, each point in the target object's point cloud carries a binary (0 or 1) label corresponding to that function. This label can be considered as the color of the point cloud (e.g., 0 for black, 1 for white), and thus, the colored point cloud can be reprojected into the image space using the camera's intrinsic and extrinsic parameters to obtain a 2D mask image, which is then further encoded using an image encoding neural network.
[0106] When the knowledge information representation includes the pose-type knowledge representation, since the pose-type knowledge representation is generally represented by a 7-dimensional vector (three-dimensional space + quaternion), we directly use a multilayer perceptron (MLP) to extract its features.
[0107] As an alternative, this embodiment proposes a method for directly encoding functional regions in three-dimensional space. The specific process is as follows: Similar to Embodiment 4, the system first extracts the definition of the functional region (such as "handle") from the digital gene instance 100 and transfers it to the actual point cloud of the target object to obtain a three-dimensional point cloud subset containing only the handle region points.
[0108] Next, unlike in Example 3, the system does not perform two-dimensional projection. Instead, it directly inputs the three-dimensional point cloud subset into a small, dedicated coding network for processing point cloud data. This network can be a lightweight PointNet, PointNet++, or a PointTransformer model with fewer layers, designed to directly extract the three-dimensional geometric features from the input point cloud subset and output a single high-dimensional feature vector. This vector is the "regional knowledge representation" in this example.
[0109] The remaining steps, including the generation of structural information representation and pose-based knowledge representation, as well as the final multimodal fusion and action generation, are the same as those described in Example 4. Finally, this regional knowledge representation, directly encoded from a subset of the 3D point cloud, is fed into the vision-language-action model along with all other representations for decision-making.
[0110] The beneficial effect of this embodiment is that by directly encoding the functional area in three-dimensional space, it avoids the information loss that may result from three-dimensional to two-dimensional projection. For functional areas with complex three-dimensional structures (such as a spiral handle or a recessed button), this method can preserve more complete geometric details, thereby potentially generating more accurate and information-rich knowledge representations, which helps robots perform more refined and challenging operational tasks. Therefore, the core of the steps for generating region-based knowledge representations proposed in this application lies in encoding the information of the functional area, rather than being limited to the specific technical path of two-dimensional projection.
[0111] Example 6 This embodiment addresses the dynamic issues in the robot-environment interaction process by proposing an optimization scheme for the method described in Embodiment 4, aiming to improve the method's operational efficiency and response speed in real-time dynamic scenarios. This scheme can be combined with... Figure 9 To understand the real-time fine-tuning mechanism shown.
[0112] In many practical applications, the process of a robot performing a task is continuous and dynamic. For example, as the robot arm extends towards a target object, changes in the observation perspective due to visual occlusion, slight movement of the object, or the robot's own movement require the system to have an updated understanding of the target object's state in each control cycle (typically tens of milliseconds). If the digital gene instance extraction process (including point cloud segmentation, feature extraction, classification, and regression) in Example 4 is completely repeated in every frame, the computational overhead would be enormous, or it would be difficult to meet the real-time requirements of high-frequency closed-loop control of the vision-language-action model.
[0113] To address this issue, this embodiment introduces a mechanism of "initial calculation plus incremental fine-tuning." Please refer to [link / reference]. Figure 4This time-series interaction diagram illustrates the collaborative process between the perception module, the digital gene update module, and the vision-language-action decision-making module.
[0114] First, at the initial moment of the task (t=0), the system executes a complete digital gene extraction process. As shown in step S101, the perception module acquires initial visual observation information. Subsequently, in step S102, the system performs the complete set of calculations described in Example 4, from point cloud generation to digital gene instance (template ID and template parameters) extraction. This step involves a large amount of computation, but it is only executed once at the start of the task to obtain an initial, comprehensive, and structured understanding of the target object.
[0115] Next, at each subsequent time step (t=1, 2, 3, …) during task execution, the system enters an efficient real-time fine-tuning loop. In step S201, the perception module acquires the visual observation information at the current time t. The key step is S202, where the system no longer repeats the complete extraction process but instead uses a pre-trained diffusion model as the digital gene update module. This diffusion model is specifically trained to quickly predict minute changes in digital gene parameters. Its inputs include: template parameter 120 from the digital gene instance at the previous time step (t-1); and the visual observation information at the current time step (t), or, for further speed improvements, simplified visual features extracted by a lightweight encoder.
[0116] The diffusion model, through a single fast forward propagation (i.e., denoising process), can directly predict the template parameters at the current time step (t) based on the parameters from the previous time step and the current visual changes. For example, if an object is slightly pushed, the model will quickly update its pose-related parameters. Understandably, since the template ID typically remains unchanged throughout a task, only fine-tuning of the template parameters is required.
[0117] In step S203, the system uses the latest digital gene instance obtained after fine-tuning to quickly generate updated digital gene structure information representation and knowledge information representation according to the method in Example 4. Finally, in step S204, these latest representations are sent to the vision-language-action decision module and fused with other real-time information to generate robot action commands at time t.
[0118] The computational load of the entire real-time fine-tuning loop (S201 to S204) is far less than that of a single complete extraction process, which can meet the high-frequency operation requirements of the robot control system. In this way, this embodiment can achieve rapid and continuous tracking of the structured knowledge of the target object in dynamic scenes, ensuring that the vision-language-action model is based on the latest information at each decision point, thereby achieving smoother, more stable, and more responsive closed-loop control.
[0119] Example 7 This embodiment aims to address the issue of increased computational overhead in visual-language-action models due to the introduction of additional digital genetic representations. As an optional implementation, this embodiment employs model distillation techniques to reduce the computational complexity and resource consumption of the final deployed model while maintaining task performance.
[0120] In Example 4, the length of the input sequence for the vision-language-action model increases due to the addition of digital gene structure information representation, region-based knowledge representation, and pose-based knowledge representation. In Transformer-based architectures, computational complexity is typically proportional to the square of the input sequence length. This means that additional input significantly increases the computational burden and memory footprint of the model, posing a challenge for deployment on computationally limited robotic platforms such as mobile robots or embedded systems.
[0121] To address this challenge, this embodiment employs a knowledge distillation strategy, the core idea of which is to train a lightweight "student model" to mimic the behavior of a powerful "teacher model." The specific process is as follows: The first step is to train the teacher model. First, following the complete method described in Example 4, a large-scale, high-performance "teacher vision-language-action model" is constructed and trained. This model can have a large number of Transformer layers, a large hidden layer dimension, and a large number of attention heads. During training, it receives all information, including visual, linguistic, robot state, and all three types of digital genetic representations. Through optimization on a large amount of training data, the teacher model is made to achieve the highest possible task success rate and generalization performance.
[0122] The second step is to train the student model. Then, a simpler, fewer-parameter "student vision-language-action model" is designed. For example, this student model could have fewer Transformer layers, fewer hidden dimensions, or fewer attention heads. The input to the student model is exactly the same as the teacher model, including all the digit gene representations.
[0123] When training the student model, its loss function is specially designed, comprising two parts: task loss and distillation loss. The task loss, similar to regular training, is used to calculate the difference between the actions predicted by the student model and the actions demonstrated by real experts, for example, using cross-entropy loss or mean squared error loss. Distillation loss is the core of knowledge distillation, requiring the student model to mimic the output of the teacher model. This mimicry can take various forms; for example, the probability distribution of actions output by the student model can be made as close as possible to the probability distribution of actions output by the teacher model (e.g., using KL divergence as the loss); or, it can mimic not only the final output but also the hidden states or feature representations of the teacher model in the intermediate layers of the decoder. By minimizing the distillation loss, the student model is guided to learn the decision logic and knowledge representation methods inherent in the teacher model. The total loss function is a weighted sum of the task loss and the distillation loss. By training under this combined loss, the student model can learn to complete tasks while simultaneously absorbing and internalizing the complex decision logic extracted from rich digital genetic information by the teacher model.
[0124] The third step is to deploy the student model. After training, this lightweight student model is deployed on the actual robot control system, i.e., the computing device, to perform online fusion decision-making and action generation. During runtime, the front-end of the system (from perception to digital genetic representation generation) is exactly the same as in Example 4, but the student model, which is more computationally efficient, ultimately makes the decisions.
[0125] The beneficial effect of this embodiment is that, through knowledge distillation, the powerful capabilities of the teacher model are transferred to a small-scale student model. The final deployed student model is significantly faster in inference than the teacher model and has a smaller memory footprint. However, because it has learned the decision-making logic of the teacher model, its task success rate typically only decreases slightly and is far higher than training a small model of the same size from scratch. This makes the knowledge augmentation method proposed in this application easier to deploy on various computationally limited robotic platforms, improving the practicality and universality of the solution.
[0126] Those skilled in the art will understand that, besides implementing the system and its various devices, modules, and units provided by this invention in the form of purely computer-readable program code, the same functions can be achieved entirely through logical programming of the method steps, making the system and its various devices, modules, and units of this invention function in the form of logic gates, switches, application-specific integrated circuits, programmable logic controllers, and embedded microcontrollers. Therefore, the system and its various devices, modules, and units provided by this invention can be considered as a hardware component, and the devices, modules, and units included therein for implementing various functions can also be considered as structures within the hardware component; alternatively, the devices, modules, and units for implementing various functions can be considered as both software modules implementing the method and structures within the hardware component.
[0127] Example 8 The 3D object multimodal knowledge automatic generation and annotation method and system described in Examples 1 and 2 can generate an unlimited number of new objects as digital assets based on existing digital gene object representations through randomization methods and geometric detail filling. Furthermore, since the objects generated in this process inherently possess digital gene object representations, they can be directly applied to the knowledge annotation process. Digital assets refer to files stored in a computer system through digital encoding, which can be used by the computer to retrieve the required information during runtime.
[0128] The aforementioned "randomization method" includes: 1) randomly perturbing the parameters of the digital gene instances inside the existing digital gene object representation; and 2) randomly replacing the digital gene instances inside the digital gene object.
[0129] The process of "replacing a digital gene instance within a digital gene object representation" is performed as follows: 1) Randomly select a digital gene instance representing a structure / component of an object from the digital gene object representation as the object to be replaced. 2) Randomly select a digital gene from the digital gene repository that represents a structure similar to but different from the object to be replaced as the replacement object. 3) Based on the macroscopic information such as the geometric scale of the object to be replaced (the digital gene instance), determine the relevant parameters that determine the macroscopic scale of the replacement object, so that the replacement object and the object to be replaced have similar scales in space. 4) Randomly fill in the remaining parameters of the replacement object within a reasonable range to obtain a new digital gene instance. 5) Replace the object to be replaced with the digital gene instance of the replacement object.
[0130] The “reasonable range” refers to the range of parameters that ensure the shape of the instantiated digital gene conforms to physical rules, and can be specified when writing the digital gene code template.
[0131] "Geometric detail filling" refers to adding subtle deformations to the surface of a digital gene object representation to make it more consistent with the surface features of a real object. These "subtle deformations" are obtained through the correspondence between the digital gene object representation and the object itself. Let x be a point on the surface of the digital gene object representation, and y be its corresponding point on the object itself; then yx represents the subtle deformation of point x on the surface of the digital gene object representation. For new digital gene object representations obtained by applying randomization to existing representations, the correspondence between x and x can be obtained in the following ways: 1) For "randomly perturbing the parameters of digital gene instances within existing representations," the correspondence between x can be obtained by tracking parameter changes. 2) For "randomly replacing digital gene instances within a digital gene object," the correspondence between x can be obtained by the azimuth angles of the replacement instance and points on the surface of the replaced instance in spherical coordinates; points with the same azimuth angle are considered to have a correspondence. "Parameter change tracking" refers to writing the spatial coordinates of a specified point on a given instance as a function of parameters, without changing the digital gene. This allows us to obtain the corresponding point after the parameter change, even if the parameter changes. A simplified example of "writing the spatial coordinates of a specified point on a given instance as a function of parameters" is as follows: Consider a cuboid centered at the origin, with its shape determined by its length, width, and height (L, W, H, corresponding to the X, Y, and Z axes respectively). When L = W = H = 1, the coordinates of point x on the surface of the corresponding shape (cube) at (0.5, 0.2, 0.3) can be written as (0.5L, 0.2W, 0.3H). When the length, width, and height change to L = 2, W = 1.2, and H = 1.6, the corresponding point of x becomes (1, 0.24, 0.48).
[0132] Example 9 Building upon Example 1, this example further provides a visual-language-action robot control method based on object digital genes. First, the parameter information of the object's digital genes in the image is acquired. Second, the digital genes are encoded using a neural network. Finally, multimodal features are used in conjunction with the object's digital gene encoding, and the robot's end-effector trajectory is output through a transformer network based on a diffusion model. Traditional large-scale robot models employ a vision-language-action scheme; this application adopts a knowledge-vision-language-action scheme.
[0133] In other embodiments, a computer-readable storage medium storing a computer program may also be provided, which, when executed by a processor, implements the steps of the above-described visual language motion robot control method based on object digital genes.
[0134] Building upon Example 2, this example further provides a visual-language-action robot control system based on object digital genes. First, it acquires the parameter information of the digital genes of objects in an image. Second, it encodes the digital genes using a neural network. Finally, it uses multimodal features and object digital gene encoding, and outputs the robot's end-effector trajectory through a transformer network based on a diffusion model. Traditional large-scale robot models employ a vision-language-action scheme; this application adopts a knowledge-vision-language-action scheme.
[0135] Specific embodiments of the present invention have been described above. It should be understood that the present invention is not limited to the specific embodiments described above, and those skilled in the art can make various changes or modifications within the scope of the claims, which do not affect the essence of the present invention. Unless otherwise specified, the embodiments and features described in this application can be arbitrarily combined with each other.
Claims
1. A method for automatic generation and annotation of multimodal knowledge, characterized in that, include: Step 1: Parameterize the different structures of the object using digital genes to obtain the digital gene object representation; Step 2: Transfer the knowledge annotations defined in the digital gene to the object; The digital gene includes: a parameterized template with knowledge annotation defined by mathematical rules, written in computer-executable code, which describes a pre-defined connection structure in the real world through spatial pose constraints and describes the general characteristics of the pre-defined structure in the real world through a combination of basic geometric shapes.
2. The method for automatic generation and annotation of multimodal knowledge according to claim 1, characterized in that, Step 1 includes: Step 101: Display the object in the interactive interface, establish a three-dimensional coordinate axis with the center of the object as the origin and the normal pose as the reference, and determine the category to which the object belongs. Step 102: Based on the determined category, read the structural composition of the object and prompt the user to select the corresponding digital gene from the digital gene repository based on the structural composition. Then, input and adjust the parameters corresponding to each selected digital gene.
3. The method for automatic generation and annotation of multimodal knowledge according to claim 1, characterized in that, The knowledge annotation includes regional annotation; Migrating the regional labels to the object includes: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose; 2) Sample a large number of discrete points in both the digital gene object representation and the object's own surface; 3) For any point on the surface of the object, select the point that is closest to that point from the points on the surface represented by the digital gene object, and establish a correspondence; 4) Using the correspondence, the regional knowledge contained in the digital gene object representation and defined in the digital gene is transferred to the corresponding position of the object.
4. The method for automatic generation and annotation of multimodal knowledge according to claim 1, characterized in that, The knowledge annotation includes pose annotation; Transferring the pose annotations to the object includes: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose; 2) Directly label the pose knowledge defined in the digital gene, which is contained in the digital gene object representation, onto the object.
5. A multimodal knowledge automatic generation and annotation system, characterized in that, include: Module M1: Parameterizes different structures of an object using digital genes to obtain a digital gene object representation; Module M2: Transfers the knowledge annotations defined in the digital gene to the object; The digital gene includes: a parameterized template with knowledge annotation defined by mathematical rules, written in computer-executable code, which describes a pre-defined connection structure in the real world through spatial pose constraints and describes the general characteristics of the pre-defined structure in the real world through a combination of basic geometric shapes.
6. The multimodal knowledge automatic generation and annotation system according to claim 5, characterized in that, The module M1 includes: The object is displayed in the interactive interface, and a three-dimensional coordinate axis is established with the center of the object as the origin and the normal pose as the reference, and the category to which the object belongs is determined. Based on the determined category, the system reads the structural composition of the object and prompts the user to select the corresponding digital gene from the digital gene repository. The user then inputs and adjusts the parameters corresponding to each selected digital gene.
7. The multimodal knowledge automatic generation and annotation system according to claim 5, characterized in that, The knowledge annotation includes regional annotation; Migrating the regional labels to the object includes: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose; 2) Sample a large number of discrete points in both the digital gene object representation and the object's own surface; 3) For any point on the surface of the object, select the point that is closest to that point from the points on the surface represented by the digital gene object, and establish a correspondence; 4) Using the correspondence, the regional knowledge contained in the digital gene object representation and defined in the digital gene is transferred to the corresponding position of the object.
8. The multimodal knowledge automatic generation and annotation system according to claim 5, characterized in that, The knowledge annotation includes pose annotation; Transferring the pose annotations to the object includes: 1) Align the center of the digital gene object representation with the center of the object itself, and restore both to their normal pose; 2) Directly label the pose knowledge defined in the digital gene, which is contained in the digital gene object representation, onto the object.
9. A computer-readable storage medium storing a computer program, characterized in that, When the computer program is executed by the processor, it implements the steps of the multimodal knowledge automatic generation and annotation method according to any one of claims 1-4.
10. A robot control method, characterized in that, Includes the following steps: Point cloud acquisition steps: acquire visual observation information of the current environment, natural language task instructions and robot status information, and identify and segment the target object based on the visual observation information and task instructions to generate three-dimensional point cloud data of the target object. Parameterization step: Based on the point cloud data of the target object, determine the preset digital gene template ID that matches the target object and the corresponding template parameters, which together constitute the parameterized structural representation of the target object; Digital gene representation step: Based on the parameterized structural representation, generate a digital gene representation including at least one of structural information representation and knowledge information representation using the method described in any one of claims 1-4; Fusion step: The digital gene representation is fused with at least one feature selected from the following through a VLA model to generate fused features: visual features generated based on the visual observation information, language features generated based on the task instructions, and robot state features; Decoding step: Input the fused features into the action decoder to generate a sequence of action instructions that the robot can execute.
11. The robot control method according to claim 10, characterized in that, The step of determining the parameterized structure representation includes: The point cloud data is processed by a point cloud coding network to extract high-dimensional features; Based on the high-dimensional features, the digital gene template ID is determined through classification; Based on the high-dimensional features, the template parameters are obtained through a parameter estimation network.
12. The robot control method according to claim 10, characterized in that, The steps for generating the 3D point cloud data of the target object include: The task instructions are analyzed using a large language model to generate prompt words; The three-dimensional point cloud data is generated using object recognition and semantic segmentation models based on the prompts and visual observation information.
13. The robot control method according to claim 10, characterized in that, The knowledge information representation includes: Regional knowledge representation generated by encoding predefined functional area information; A pose-based knowledge representation generated by encoding predefined interactive pose information.
14. The robot control method according to claim 13, characterized in that, When the knowledge information representation includes the regional knowledge representation, the steps for generating the regional knowledge representation include: The functional region information bound to the parameterized structure representation is matched and transferred to the relevant points constituting the target object point cloud according to the nearest point principle. This results in: for each type of functional region information, each point in the target object point cloud has a binary label corresponding to that function. This label can be regarded as the color of the point cloud. The colored point cloud is then reprojected into the image space through the camera's intrinsic and extrinsic parameters to obtain a two-dimensional mask image. The image is further encoded through an image coding neural network. When the knowledge information representation includes the pose-type knowledge representation, a multilayer perceptron is used for feature extraction.
15. The robot control method according to claim 10, characterized in that, The digital gene representation step adopts a diffusion model, which fine-tunes the digital gene representation at the current moment based on the digital gene representation at the previous moment and the visual observation information at the current moment.
16. The robot control method according to claim 10, characterized in that, The VLA model is a model obtained by model distillation.
17. A robot control system, characterized in that, include: Point cloud acquisition module: Acquires visual observation information of the current environment, natural language task instructions and robot status information, and identifies and segments the target object based on the visual observation information and task instructions, and generates three-dimensional point cloud data of the target object; Parameterization module: Based on the point cloud data of the target object, determine the preset digital gene template ID that matches the target object and the corresponding template parameters, which together constitute the parameterized structural representation of the target object; Digital gene representation module: Based on the parameterized structure representation, a digital gene representation including at least one of structural information representation and knowledge information representation is generated using the method described in any one of claims 1; Fusion module: fuses the digital gene representation with at least one feature selected from the following to generate a fused feature: a visual feature generated based on the visual observation information, a language feature generated based on the task instructions, and a robot state feature; Decoding module: Inputs the fused features into the action decoder to generate a sequence of action instructions that the robot can execute.