Method for generating enhanced image, method for training policy network, system and device

By using expert trajectory reference frames to generate pixel-level masks and diverse background images for robot operation tasks, the problem of distinguishing target objects in complex backgrounds by robots is solved, improving operation accuracy and data coverage, and reducing acquisition costs.

CN122243750APending Publication Date: 2026-06-19BEIJING HUMANOID ROBOTICS INNOVATION CENTER CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
BEIJING HUMANOID ROBOTICS INNOVATION CENTER CO LTD
Filing Date
2026-02-04
Publication Date
2026-06-19

Smart Images

  • Figure CN122243750A_ABST
    Figure CN122243750A_ABST
Patent Text Reader

Abstract

This disclosure provides a method for generating enhanced images, a method for training a policy network, and a system. The method for generating enhanced images includes: acquiring reference frames for a robot operation task; determining anchor frames from the target trajectory and generating candidate bounding boxes for the anchor frames; determining the annotation information of the candidate bounding boxes based on the similarity between the candidate bounding box regions and the labeled bounding box regions; converting the candidate bounding boxes into pixel-level masks and propagating the masks to other frames on the target trajectory; and generating an enhanced image with annotation information based on the background image and the mask. This method can expand the amount of training data in the robot training set and efficiently generate complete and high-quality enhanced images. It solves the problems of high data acquisition costs and difficulty in covering long-tail scenes in the prior art. By introducing a variety of different background images, the semantic diversity of the training images is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This disclosure relates to the field of robot operation, and more particularly to a method for generating enhanced images, a method for training policy networks, a system, and an apparatus. Background Technology

[0002] With the rapid development of robot manipulation technology, enabling robots to adapt to unstructured environments with complex backgrounds, changing lighting, and interference has become a core challenge in realizing general-purpose intelligent robots. Currently, the industry mainly adopts the following technical approaches: One approach is to pre-train using large-scale real data collection, but this method faces extremely high equipment and labor costs and is difficult to cover infinitely long-tail scenarios. The other approach is to use weak data augmentation techniques such as random cropping and color dithering. Although these methods can improve robustness, they only change low-level pixel information, lack the necessary semantic diversity, and are difficult to bridge the huge differences between training and actual deployment environments.

[0003] In addition, existing augmentation methods generally suffer from the following problems: when visual encoders are trained on datasets with such augmentation, they often have difficulty accurately distinguishing target objects from distractors in semantically complex backgrounds, leading to divergent model attention and a significant decrease in operational accuracy in environments with strong interference.

[0004] Therefore, how to reduce the cost of large-scale data collection while introducing diverse backgrounds without causing semantic damage, so as to effectively guide the visual encoder to focus on the task-critical space, is a technical problem that urgently needs to be solved to improve the generalization ability of robot operations. Summary of the Invention

[0005] This disclosure provides a method for generating enhanced images, a method for training policy networks, a system, and an apparatus for providing high-quality training images for robot training.

[0006] In view of the above problems, firstly, embodiments of this disclosure provide a method for generating enhanced images, including: Obtain reference frames from the expert trajectory corresponding to the robot operation task; wherein, the task-related regions in the reference frames are pre-labeled with bounding boxes; Anchor frames are determined from the target trajectory, and candidate bounding boxes are generated for the anchor frames using a preset detector. The annotation information of the candidate bounding box is determined based on the similarity between the candidate bounding box region and the annotated bounding box region in the reference frame. The candidate bounding boxes are converted into pixel-level masks, and these masks are propagated temporally to other frames of the target trajectory to complete the annotation of the corresponding bounding boxes in the other frames; A background image is constructed, and an enhanced image with annotation information is generated based on the background image and the mask in each frame of the target trajectory.

[0007] In conjunction with the first aspect, in one possible implementation, the annotation information of the candidate bounding box is determined based on the similarity between the candidate bounding box region and the annotated bounding box region in the reference frame, including: The first feature embedding of the task-related region is extracted using a preset visual model; Extract the second feature embedding of the candidate bounding box region; Determine the similarity between the second feature embedding and the first feature embedding; For each second feature embedding, the annotation information corresponding to the first feature embedding with a similarity greater than a preset threshold is determined as the annotation information of the candidate bounding box corresponding to the second feature embedding.

[0008] In conjunction with the first aspect, in one possible implementation, converting the candidate bounding box into a pixel-level mask includes: A preset large model is invoked, and the candidate bounding boxes of the anchor frame are used as prompt information to generate pixel-level masks for the candidate bounding boxes to cover the task-related areas within the anchor frame.

[0009] In conjunction with the first aspect, in one possible implementation, a background image is constructed, and an enhanced image with annotation information is generated based on the background image and the masks in each frame of the target trajectory, including: Build a background description template library containing various materials; The system invokes a preset large language model and generates prompt words based on templates in the template library. The preset text-to-image model is invoked to generate a complete background image based on the prompt words; The foreground region in the frame corresponding to the mask is superimposed onto the background image to generate an enhanced image.

[0010] A second aspect of this disclosure provides a method for training a policy network, comprising: Training images are acquired from an enhanced image set and image features are extracted; the enhanced image set is constructed based on an enhanced image generation method as described in any of the first aspects; The policy network is trained using the image features as input and at least the annotation information corresponding to the image features as ground truth. The policy network is used to receive images and / or states observed by the robot and output the robot's control actions.

[0011] In conjunction with the second aspect, in one possible implementation, acquiring training images from the enhanced image set and extracting image features includes: Sample the target enhanced image from the enhanced image set; Based on the target annotation information, determine the mask of the corresponding category on the target enhancement image, and generate a mask image including the corresponding task-related region; Based on visual coding, the full-image features of the target enhancement image and the initial object features of the mask image are extracted respectively. Based on the full image features and the initial object features, the mask image is traversed using an attention enhancement algorithm to enhance the initial object features, thereby obtaining the enhanced features of the mask image and a set of enhanced image features. Using the image features as input to the policy network, including: The enhanced image feature set is used as the input to the policy network.

[0012] In conjunction with the second aspect, in one possible implementation, after obtaining the enhanced image feature set, the method further includes: Based on the annotation information of the mask images, the enhanced features of mask images with the same semantic features are clustered into positive sample pairs, and the enhanced features of mask images with different semantic features are clustered into negative sample pairs. Using the image features as input to the policy network, including: The positive sample pairs and the negative sample pairs are used as inputs to the policy network.

[0013] In conjunction with the second aspect, in one possible implementation, the loss is determined during the training of the policy network in the following manner: Based on the recognition results of the training images and the semantics of the corresponding training images, calculate the image contrast loss; The loss of the policy network is determined based on the contrast loss and the behavior cloning loss.

[0014] A third aspect of this disclosure provides an enhanced image generation system, comprising: The acquisition module is used to acquire reference frames in the expert trajectory corresponding to the robot operation task; In this process, bounding boxes were pre-labeled for the task-related regions in the reference frame; A boundary generation module is used to determine anchor frames from the target trajectory and generate candidate bounding boxes for the anchor frames using a preset detector. The determination module is used to determine the annotation information of the candidate bounding box based on the similarity between the candidate bounding box region and the annotated bounding box region in the reference frame; The propagation module is used to convert the candidate bounding box into a pixel-level mask and propagate the mask in time to other frames of the target trajectory to complete the annotation of the corresponding bounding box in the other frames; The image generation module is used to construct a background image and generate an enhanced image with annotation information based on the background image and the mask in each frame of the target trajectory.

[0015] This fourth aspect of the disclosure provides a training system for a policy network, comprising: The extraction module is used to acquire training images from the enhanced image set and extract image features; The enhanced image set is constructed based on an enhanced image generation method as described in any of the first aspects; The training module is used to train the policy network by using the image features as input and at least the annotation information corresponding to the image features as ground truth; the policy network is used to receive images and / or states observed by the robot and output the robot's control actions.

[0016] The fifth aspect of this disclosure provides an electronic device, including: a processor, a memory, and a bus; The memory stores machine-readable instructions that can be executed by the processor; When the electronic device is running, the processor and the memory communicate via a bus; When the machine-readable instructions are executed by the processor, they perform the steps of an enhanced image generation method as described in any of the first aspects; and / or the steps of a policy network training method as described in the second aspect.

[0017] The beneficial effects of the embodiments disclosed herein include: This disclosure provides a method for generating enhanced images, a method for training a policy network, a system, and an apparatus. The method for generating enhanced images includes: acquiring reference frames in an expert trajectory corresponding to a robot operation task; wherein, bounding boxes are pre-annotated for task-related regions in the reference frames; determining anchor frames from the target trajectory and generating candidate bounding boxes for the anchor frames using a preset detector; determining annotation information for the candidate bounding boxes based on the similarity between the candidate bounding box regions and the annotated bounding box regions in the reference frames; converting the candidate bounding boxes into pixel-level masks and propagating the masks temporally to other frames of the target trajectory to complete the annotation of corresponding bounding boxes in the other frames; constructing a background image and generating an enhanced image with annotation information based on the background image and the masks in each frame of the target trajectory. The enhanced image generation method provided in this disclosure transfers task-related regions and their annotations from the expert trajectory reference frame to the anchor frame based on the similarity between the bounding box region in the expert trajectory reference frame and the candidate bounding box region in the target trajectory anchor frame. Furthermore, it transfers the task-related regions and their annotations to other frames in the target trajectory based on the temporal relationship between the anchor frame and those frames. Finally, it generates an enhanced image with annotations based on the mask and background image corresponding to the task-related regions in the target trajectory frame. This method enables faster and more accurate acquisition and annotation of information about task-related regions in each frame of the target trajectory. By constructing a background image based on this information, the enhanced image is generated, rapidly expanding the number of images and efficiently generating complete and high-quality enhanced images. The enhanced images are then used as training data for training the robot policy model, solving the problems of high data acquisition costs and difficulty in covering long-tail scenarios in existing technologies. Introducing various background images improves the semantic diversity of the training images. Additionally, the division and annotation of bounding box regions can provide visual guidance for task-related regions, enabling the policy model to accurately distinguish target objects from interference items in the background, increasing the robot's operational accuracy. Attached Figure Description

[0018] Figure 1 A schematic flowchart illustrating the method for generating enhanced images provided in this embodiment of the disclosure; Figure 2 A flowchart illustrating the training method for a policy network provided in this embodiment of the disclosure; Figure 3 A schematic diagram illustrating the structure of the enhanced image generation system and the policy network training system provided in this embodiment of the disclosure; Figure 4 Image examples of reference frames provided for embodiments of this disclosure; Figure 5 This is an example of a pixel-level mask provided in an embodiment of this disclosure; Figure 6Examples of enhanced images provided for embodiments of this disclosure; Figure 7 Examples of mask images provided for embodiments of this disclosure; Figure 8 A schematic diagram of the structure of the enhanced image generation system provided in the embodiments of this disclosure; Figure 9 This is a schematic diagram of the structure of a training system for a policy network provided in an embodiment of this disclosure. Detailed Implementation

[0019] This disclosure provides a method for generating enhanced images, a method for training a policy network, a system, and a device. Preferred embodiments of this disclosure are described below with reference to the accompanying drawings. It should be understood that the preferred embodiments described herein are for illustrative and explanatory purposes only and are not intended to limit this disclosure. Furthermore, the embodiments and features described in this application can be combined with each other unless otherwise specified.

[0020] This disclosure provides a method for generating enhanced images, such as... Figure 1 As shown, it can be implemented as follows: S101. Obtain a reference frame from the expert trajectory corresponding to the robot operation task; wherein, the task-related regions in the reference frame have been pre-labeled with bounding boxes. S102. Determine anchor frames from the target trajectory and generate candidate bounding boxes for the anchor frames using a preset detector; S103. Determine the annotation information of the candidate bounding box based on the similarity between the candidate bounding box region and the annotated bounding box region in the reference frame; S104. Convert the candidate bounding box into a pixel-level mask, and propagate the mask in time to other frames of the target trajectory to complete the annotation of the corresponding bounding box in the other frames; S105. Construct a background image, and based on the background image and the mask in each frame of the target trajectory, generate an enhanced image with annotation information.

[0021] In this embodiment of the disclosure, an expert trajectory refers to a demonstration case in the field of robot operation and training that guides a robot to complete a specific task. This demonstration case may include a series of continuous sequences representing the robot's states and actions during the execution of the specific task. The expert trajectory may be recorded by the robot under the control of a human operator; it may also be recorded by dragging the robot with external forces through dynamic teaching; or it may be obtained by training in a simulator based on a control algorithm.

[0022] Specifically, an expert trajectory can be a sequence that evolves over time. Its state can be represented by images captured by the robot's camera, the angles of the robot's joints, the positions of the robot's actuators, and the pose of the target object. Its actions can be represented by current commands from the robot's motors, vectors of the robot's arm movement, or opening and closing controls performed by the robot.

[0023] For the purposes of this disclosure, the aforementioned reference frame may refer to a predetermined frame position within a continuous image of the task execution process recorded by the robot, included in the expert trajectory. In one possible implementation, this frame position may be set as the first frame of the continuous image.

[0024] Taking the first frame of the continuous image as an example, the image of the first frame of the continuous image is extracted, and the task-related regions in the image are selected and labeled. The task-related regions refer to the parts of the image that are strongly related to the robot's operation task (e.g., the robot's robotic arm and the target object of the operation task appearing in the frame image). In one possible implementation, this process can employ a manual bounding box annotation method. By defining bounding boxes around the task-related regions in the image and annotating the objects within each bounding box with corresponding semantics, a reference frame with bounding box annotations can be obtained.

[0025] It should be noted that an expert trajectory can be a set composed of multiple trajectories. Multiple trajectories belonging to the same expert trajectory can all be trajectories for completing the same specific task. There can be certain differences between each trajectory in terms of environmental interference and operational strategies. A preset frame position in the image of each trajectory (target trajectory) can be selected as an anchor frame. A preset detector is used on this anchor frame to identify regions related to the robot task in each anchor frame and to select these regions, obtaining candidate bounding boxes for each region. The preset detector can be an open set detector (e.g., Grounding DINO), which can select regions in the target image based on input text prompts. For the purposes of this disclosure, the text prompts can be descriptive text describing the target to be found in the anchor frame (e.g., a description of the input robotic arm and the manipulated object). The detector can convert this content into vectors, and then the detector will search for regions in the image that are semantically close to the text description by comparing image pixels and text vectors, and select all regions in the image that are possibly related to the description.

[0026] Because candidate bounding boxes selected by open-set detectors may suffer from misselection and lack of task awareness (i.e., selecting similar objects that are not involved in the task), to address these issues, images of the regions within the bounding boxes in the reference frame can be extracted using specific deep learning models (e.g., CLIP-based visual encoders, DINOv2, or ResNet) to extract image information (e.g., color, texture, and shape) and compressed into a vector. Similarly, the contents within the candidate bounding boxes can be compressed into a vector. By comparing the similarity between the two vectors (e.g., cosine similarity), a correspondence can be established between task-related regions in the reference frame and regions in the anchor frame. This enables cross-trajectory alignment of task-related regions, transferring the annotation information from the reference frame to the anchor frame. This avoids the shortcomings of open-set detector bounding boxes and prevents the cost surge and efficiency reduction caused by manual bounding box selection for all trajectories.

[0027] Furthermore, a specific large model can be used to process each anchor frame. This large model can be a segmentation and tracking model (e.g., SAM-2, Cutie, and DEVA). These models can utilize pre-stored deep learning prior knowledge, using candidate bounding boxes in the aligned anchor frames as cues. Within these candidate bounding boxes, all pixels are traversed to distinguish the main body of the object from other useless pixels. This method transforms candidate bounding boxes with only sparse vertex coordinates into a dense mask region composed of pixels representing the main body of the object. This mask region can also include annotation information propagated from the reference frame, thus performing semantic segmentation of each anchor frame at the pixel level. This approach avoids the influence of useless background pixels within the candidate bounding boxes on subsequent robot training; and improves the quality of robot training data by highlighting the details of the object through the mask.

[0028] Furthermore, this type of segmentation and tracking model also possesses feature memory capabilities. The large model can package the mask information and corresponding visual features of a particular frame into memory data within consecutive images. When processing adjacent frames, the large model can predict the possible location of the object in the adjacent frame based on this memory data, generate a new mask at that location, and store it as new memory data. It should be noted that this process can propagate the mask of the anchor frame sequentially to the next frame or in reverse sequence to the previous frame, achieving bidirectional propagation of pixel-level masks and establishing a spatiotemporally consistent semantic segmentation result throughout the entire trajectory. Here, the propagation direction is related to the position of the anchor frame in the target trajectory.

[0029] Furthermore, different images can be selected from the trajectory, and images at specific mask positions can be extracted from the trajectory using annotation information. These extracted images are then synthesized with a pre-prepared background image to obtain an enhanced image. This method can expand the amount of training data in the robot's training set, efficiently generating complete, high-quality training images (i.e., enhanced images), thereby improving the generalization ability of the target robot through training. It solves the problems of high data acquisition costs and difficulty in covering long-tail scenes in existing technologies. The background image here can be generated using a text-based image model, allowing the introduction of various different background images, thus improving the semantic diversity of the training images. Moreover, the enhanced image already includes pre-determined annotation information, avoiding misjudgments caused by relying on object detection models to re-separate objects in the training images during training.

[0030] In another embodiment of this disclosure, step S103 above, "determining the annotation information of the candidate bounding box based on the similarity between the candidate bounding box region and the annotated bounding box region in the reference frame," can be implemented as follows: Step 1: Extract the first feature embedding of the task-related region using a preset visual model; Step 2: Extract the second feature embedding of the candidate bounding box region; Step 3: Determine the similarity between the second feature embedding and the first feature embedding; Step 4: For each second feature embedding, the annotation information corresponding to the first feature embedding with a similarity greater than a preset threshold is determined as the annotation information of the candidate bounding box corresponding to the second feature embedding.

[0031] In this embodiment, the preset visual model can be a visual foundation model (e.g., DINOv2 or CLIP), which is pre-trained on large-scale data and can capture global dependencies between image pixels through a special architecture. Based on the task-related regions in the pre-selected reference frame, the images of the task-related regions can be pre-processed and input into the visual foundation model. This pre-processing is used to analyze the task-related regions in the reference frame. The images of task-related regions are subjected to size normalization and tensor normalization to ensure that the input images of task-related regions meet the input requirements of the visual basic model.

[0032] Taking DINOv2 as an example, after these task-related regions are fed into the visual base model, they can be further segmented into fixed-size pixel blocks, mapped to a preliminary token sequence. The model's internal architecture can calculate the spatial correlation between these pixel blocks to capture the structural and texture information of objects within each image. Finally, the last layer of the model architecture outputs reference feature embeddings for these task-related regions. .in, For the first reference frame The first feature embedding of each task-related region. , for A real vector space.

[0033] Similarly, the same method can be applied to the images within the candidate bounding box regions in the anchor frame to extract the second feature embedding for each candidate bounding box region. . The first anchor frame in the characterization The second feature embedding of each candidate bounding box region.

[0034] Then, the annotation information of each task area in the reference frame can be transferred to the corresponding position in the anchor frame through the following formula (1).

[0035] (1); in, For the annotation information of the candidate bounding box region; the above formula (1) can characterize the first anchor point frame. A second feature embedding is used to find a first feature embedding in the reference frame whose similarity exceeds a preset threshold, and the annotation information of the first feature embedding is passed to the second feature embedding.

[0036] In one possible implementation, the aforementioned similarity exceeding a preset threshold can refer to the second feature embedding that is closest to the first feature embedding. Since feature embeddings exist in vector form, the above process is... China Find the vector that is closest The above method enables the alignment of annotation information between regions across trajectories without the need for training. Only bounding box annotation of a single reference frame is required to transfer the annotation information to other trajectories, significantly reducing manual annotation costs and improving efficiency.

[0037] In another embodiment of this disclosure, step S104 above, "converting the candidate bounding box into a pixel-level mask," can be implemented as follows: A preset large model is invoked, and the candidate bounding boxes of the anchor frame are used as prompt information to generate pixel-level masks for the candidate bounding boxes to cover the task-related areas within the anchor frame.

[0038] In this embodiment, the pre-defined large model can be a segmentation and tracking model (e.g., SAM-2, Cutie, and DEVA). These models can utilize pre-stored deep learning-based prior knowledge and can use candidate bounding boxes and corresponding annotation information as prompts. Within each candidate bounding box in the anchor frame, all pixels are traversed to distinguish the main body of the object within the bounding box (i.e., the task-related region within the anchor frame) from other useless pixels. The main body can be determined based on the aforementioned annotation information. Through this method, pixels in the task-related region can be distinguished from irrelevant pixels at the edges within the candidate bounding box, transforming the candidate bounding box, represented by sparse vertex coordinates, into a dense mask region composed only of pixels in the task-related region, thereby achieving semantic segmentation of the anchor frame image at the pixel-level.

[0039] In another embodiment of this disclosure, step S105, "constructing a background image and generating an enhanced image with annotation information based on the background image and the masks in each frame of the target trajectory," can be implemented as follows: Step 1: Build a background description template library containing various materials; Step 2: Call the preset large language model and generate prompt words based on the templates in the template library; Step 3: Use a preset text-to-image model to generate a complete background image based on the prompt words; Step 4: Overlay the foreground region in the frame corresponding to the mask onto the background image to generate an enhanced image.

[0040] In this embodiment, the background description template library can serve as a constraint for the structured model of a large language model, ensuring that the descriptions generated by the large language model do not deviate from physical facts. The background description template library in this disclosure may include descriptions of various materials of the background image, such as wood, stone, composite materials, and metal. Optionally, the background description template library may also include descriptions of material textures, such as rough, polished, brushed, cracked, and worn.

[0041] Furthermore, prompts can be generated using large language models (e.g., ChatGPT or Gemini) to guide the text-to-image model. The vast corpus of these models allows simple vocabulary to be expanded into complex descriptions. The generated prompts are then output to a pre-defined text-to-image model (e.g., Stable Diffusion or Midjourney), which generates a complete background image.

[0042] The aforementioned large language model can refer to a computational model pre-trained on massive amounts of text data based on deep learning technology. A text-to-image model can refer to a cross-modal generative model that includes the mapping relationship between natural language and visual representation, capable of synthesizing corresponding images based on the input text description (Prompt).

[0043] In the above methods, the controllability of the generated background images can be guaranteed by the background description template library, that is, the generated background is restricted to the content provided by the background description template library; the randomness of the results can be guaranteed by the large language model and the text-to-image model, that is, each background image can have unique details, thereby solving the long tail problem of training data.

[0044] This allows for the visualization of each frame in the trajectory. Extract the mask from the frame. The foreground region covered, in this disclosure, can refer to the part of the image directly related to the robot's operation task. After preprocessing operations such as resolution matching and scaling adjustments between the foreground region and the background image, the foreground region can be superimposed onto the background image. The above results in an enhanced image. This overlay process can be achieved using linear interpolation. Linear interpolation involves adjusting the color values ​​of pixels at the edges between the foreground and background images using linear interpolation, thus achieving a smooth transition between the foreground and background. In other words, it enhances the image. ;in, This represents element-wise multiplication. Furthermore, by transforming the background of each frame within the trajectory to the generated background image, the final result is obtained, comprising a large number of enhanced images. Enhanced image set .

[0045] This disclosure also provides a method for training a policy network, such as Figure 2 As shown, it can be implemented as follows: S201. Acquire training images from the enhanced image set and extract image features; the enhanced image set is constructed based on an enhanced image generation method described in any of the above embodiments; S202. Using the image features as input to the policy network, and at least using the annotation information corresponding to the image features as ground truth, the policy network is trained; the policy network is used to receive images and / or states observed by the robot and output the robot's control actions.

[0046] In this embodiment, the policy network can be a mathematical function that maps environmental observations to control commands. For the robot's policy network, the inputs can include: visual information (e.g., images, depth maps, or point clouds from the robot's cameras), body state (e.g., the angles, velocities, and positions of the robot's various mechanisms), and sensor state (e.g., friction and pressure captured by the robot's sensors). The internal architecture of the policy network can include a visual encoder for processing visual information, a temporal memory module for processing long-sequence dependencies, and a policy head that maps specific feature vectors to robot control commands. Finally, the policy network can output corresponding robot control commands based on the input information to control the robot's various mechanisms to perform control actions in the operational task.

[0047] In this disclosure, images can be acquired from an augmented image set as training images for training the robot. Then, corresponding image features can be extracted from these training images. These image features can include overall features of the entire image or features of task-related regions. These features can be input into a policy network, which makes predictions based on these input features, compares the ground truth value with the prediction result, and calculates the loss between the two. The network parameters are then modified through backpropagation so that subsequent predictions can more closely approximate the ground truth value.

[0048] In another embodiment provided in this disclosure, the step S201 above, "acquiring training images from the enhanced image set and extracting image features", can be implemented as follows: Step 1: Sample the target augmented image from the augmented image set; Step 2: Determine the mask of the corresponding category on the target enhancement image based on the target annotation information, and generate a mask image including the corresponding task-related region; Step 3: Based on visual coding, extract the full-image features of the target enhancement image and the initial object features of the mask image, respectively; Step 4: Based on the full image features and the initial object features, the mask image is traversed using an attention enhancement algorithm to enhance the initial object features and obtain the enhanced features of the mask image, thus obtaining an enhanced image feature set; Then, step S202 above, "using the image features as input to the policy network," can be implemented as follows: The enhanced image feature set is used as the input to the policy network.

[0049] In this embodiment of the disclosure, during the training process, target augmented images can first be sampled from the augmented image set. In one possible implementation, augmented images in the augmented image set can be sampled randomly. Target annotation information is determined according to the required detection category (e.g., the target object of the operation task). Masks in the corresponding task-related regions with target annotation information are determined from the target augmented images, and a mask image is generated. It should be noted that the aforementioned augmented image set can include generated augmented images or the original images of each frame in each trajectory obtained from expert trajectories. Based on the annotation information corresponding to each mask in the augmented image, a mask image containing only the location images corresponding to a portion of the target category masks can be extracted from the augmented image. (That is, images that only include a portion of the task-related area). Among them, The category in the indicator annotation information is The mask.

[0050] Since the mask image consists entirely of black, zero-value regions except for the extracted task-relevant areas, these numerous zero-value regions lead to feature sparsity during feature extraction. Therefore, attention enhancement algorithms are needed to enhance the features of the mask image. Specifically, feature extraction is generally performed through convolution operations. Because the values ​​of black regions are zero, these zero-value regions cannot be extracted during convolution. Consequently, the mask image will have most dimensions with zero features, with only a few dimensions containing features, resulting in a sparse feature distribution. Furthermore, the representation of task-relevant region features depends not only on their own pixel values ​​but also on the context of the entire image. Numerous zero-value regions disrupt this context, failing to effectively reflect the features of task-relevant regions. Moreover, the lack of information from numerous zero-value regions dilutes the effective features. Therefore, it is necessary to enhance the initial object features of the mask image by incorporating features from the entire image.

[0051] In this disclosure, the full-image features of the complete enhanced image, represented in vector form, can be extracted first. This process can employ a visual encoder with shared weights. The visual encoder processes the enhanced images separately. Full image and mask image The shared weights here refer to the visual encoder using the same parameters when processing both the full image and the masked image. The full image features can be obtained after processing by this visual encoder. and initial object characteristics .

[0052] Based on the aforementioned full-image features and initial object features, the features of the masked image can be enhanced using an attention enhancement algorithm. The core of this attention enhancement algorithm is the use of a learnable self-attention module. Computational Spatial Weighted Graph This leads to the enhanced features of the mask image. .

[0053] In the above formula, a self-attention module can be used. Calculate full-map features The internal interrelationships are then captured to enhance the long-distance dependencies between elements within the image. Based on this, through... The autocorrelation mechanism enhances the features within the image across the entire image and suppresses isolated noise. This is the activation function, used to compress all values ​​into the interval (0,1). Furthermore, based on... Used to extract local object features This allows for more precise enhancement of features. .

[0054] It should be noted that the above process can process each image in the augmented image set, obtaining augmentation features for mask images of different target categories with different labeling information in each augmented image. These augmentation features can then be aggregated into an augmented image feature set as input to the policy network. In subsequent processes, each mask image in this augmented image feature set can be used as a training image for the subsequent training process.

[0055] In another embodiment provided in this disclosure, after "obtaining the enhanced image feature set" in step 4 above, the following step is also included: Based on the annotation information of the mask images, the enhanced features of mask images with the same semantic features are clustered into positive sample pairs, and the enhanced features of mask images with different semantic features are clustered into negative sample pairs. The step S202 above, "using the image features as input to the policy network," includes: The positive sample pairs and the negative sample pairs are used as inputs to the policy network.

[0056] In this embodiment of the disclosure, for the obtained enhanced image feature set, a certain number of samples (i.e., mask images and corresponding enhanced features) can be sampled from it as training batches for training the robot policy network. In this training batch, positive and negative sample pairs can be constructed based on the corresponding annotation information of each mask image. A positive sample pair can consist of two different mask images with the same semantic annotation information. A negative sample pair consists of a pair of mask images with different semantic annotation information. In one possible implementation, this construction process can be achieved using matrix masks. By batch calculating the pairwise similarity between all samples, samples with consistency are constructed into a matrix, from which positive and negative samples can be extracted.

[0057] In another embodiment provided in this disclosure, the loss is determined in the following manner during the training of the policy network: Step 1: Calculate the image contrast loss based on the recognition results of the training images and the semantics of the corresponding training images; Step 2: Determine the loss of the policy network based on the contrast loss and the behavior cloning loss.

[0058] In this embodiment of the disclosure, during the training of the robot's policy network using positive and negative sample inputs, the policy network can be trained to recognize training images with similar semantics using positive sample pairs to identify common features of similar objects; and trained to recognize training images with different semantics using negative sample pairs to identify differences between different objects. This training process can be achieved by calculating the image contrast loss function between different samples, thereby adjusting the parameters of the policy network.

[0059] Specifically, based on the behavior cloning loss of the existing robot policy network, the total loss function can be obtained by weighted summing of the behavior cloning loss and the contrast loss. This approach ensures accurate imitation of expert actions through behavior cloning, guarantees robust and discriminative features through region contrast, and balances replication accuracy and generalization ability through multi-objective joint optimization. The cloning loss can be determined based on the comparison results between the robot's action trajectory and the corresponding expert trajectory, which will not be elaborated upon here.

[0060] The image contrast loss can be obtained using the following formula: .

[0061] in, Image contrast loss, For training batches The set of positive sample indices in; In the training batch, except All other sample sets, This is a temperature parameter used to control the degree of attention the model pays to the samples; For the first Enhanced features for each sample; In order to be with the first Enhancement features of samples that form positive sample pairs; To remove outside Enhanced features for each sample; and In essence, it is to calculate the first The formula above calculates the similarity between each sample and other samples in a training batch, maximizing the proportion of that similarity in the total similarity between that sample and all samples.

[0062] Adding contrastive loss to the total loss function can force the robot's visual encoder to bring similar object features closer together and push away background and dissimilar object features in the feature space, thereby giving the robot's policy network the ability to lock onto key targets under complex interference.

[0063] The following is an example illustration. Figure 3 This is a schematic diagram illustrating the structure of an enhanced image generation system and a policy network training system combined according to an embodiment of this disclosure. Figure 3 The system provided here offers a robot operation generalization method combining region contrast representation and data augmentation. Its execution flow mainly includes the following steps: single-sample reference frame annotation, few-sample region matching based on anchor frames, temporal propagation of semantic masks, generative background construction, image synthesis, object-level feature extraction, spatial self-attention enhancement, and region contrast loss calculation. The specific implementation scheme can be divided into the following two main parts: Part 1: Task-Relevant Region Extraction and Data Augmentation. This part aims to extract high-quality pixel-level masks from expert trajectories and generate diverse training samples. The specific steps are as follows: Step 101: Single-sample reference frame annotation.

[0064] For a given robot operation task, the first frame of the expert trajectory is selected as the reference frame. .

[0065] The image of the reference frame can be as follows: Figure 4 As shown, the task-related area in this frame may include the robotic arms on the left and right sides, the microwave oven, and the plate and food on the plate.

[0066] In this frame A set of task-related regions (such as the manipulated object or robotic arm gripper) is manually bounded in one go. .

[0067] Extract reference feature embeddings for these regions using a visual base model (e.g., DINOv2). .

[0068] Step 102: Few-sample region matching based on anchor frames.

[0069] The first frame of each target trajectory is used as the anchor frame. Using an open set detector (e.g., GroundingDINO), in Generate candidate bounding boxes .

[0070] Extract the feature embeddings of each candidate box And calculate its embedding with the reference feature. The cosine similarity.

[0071] The category of each candidate box is determined by the following formula. This enables cross-trajectory region alignment without training. .

[0072] Step 103: Temporal propagation of semantic masks.

[0073] Using a segmentation and tracking model (such as SAM-2), and with the anchor frame bounding boxes determined in step 102 as prompts, the sparse bounding boxes are converted into dense pixel-level masks. .

[0074] exist Figure 4 The pixel-level mask obtained by the transformation can be as follows: Figure 5 As shown, the task-related area is covered by a mask.

[0075] The mask is propagated bidirectionally in time to all subsequent frames of the trajectory to ensure that the entire dataset obtains spatiotemporally consistent semantic segmentation results.

[0076] Step 104: Generative background construction.

[0077] Build a background description template library containing various materials (such as wood, stone, composite materials, etc.).

[0078] Generate rich background description prompts based on templates using large language models (such as ChatGPT).

[0079] Generate a complete background image based on prompt words using a text-to-image model (such as Stable Diffusion v3). Instead of simply filling in the missing areas.

[0080] Step 105: Image synthesis.

[0081] Using the mask obtained in step 103 The original image is obtained through linear interpolation. The foreground region is superimposed onto the generated background image. Above, generate enhanced image : ; in This indicates element-wise multiplication. This step generates the final augmented dataset. .

[0082] Based on the above Figure 4 and Figure 5The resulting enhanced image can be as follows Figure 6 As shown, it can be seen Figure 6 The background was composited to resemble a quartz countertop.

[0083] Part Two: Region Contrast Strategy Learning. This part introduces a contrastive learning objective during strategy training to optimize feature representation. The specific steps are as follows: Step 201: Object-level feature extraction.

[0084] From augmented datasets Medium-sampled image and its corresponding mask .

[0085] Generate a mask image containing only objects of a specific category. .

[0086] Masked images can be like Figure 7 As shown, the mask image is a mask image that only includes the microwave oven portion.

[0087] Visual encoder using shared weights Process the whole image separately and mask image To obtain the full image features and initial object characteristics .

[0088] Step 202: Enhance spatial self-attention.

[0089] To alleviate To address the feature sparsity problem caused by a large number of zero-value (black) regions in the image, we utilize full-image features. Enhance the characteristics of the object.

[0090] By calculating the spatial weight map And it applies to object characteristics: ; .

[0091] Step 203: Calculate the regional comparison loss.

[0092] In training batches In this process, positive sample pairs (features belonging to the same semantic category) and negative sample pairs (features from different categories) are constructed.

[0093] Calculate the supervised contrastive loss. As an auxiliary objective function: ; in, For the set of positive sample indices, To remove All other sample sets, The temperature parameter is used. The final total loss function is a weighted sum of the behavioral cloning loss and the contrastive loss.

[0094] Compared with the prior art, the present invention has the following significant advantages: Extremely low manual annotation cost: Compared to fully supervised learning or methods that require retraining the detector, this invention can complete the automatic annotation and enhancement of the entire dataset with only the bounding box annotation of a single reference frame.

[0095] Robust semantic extraction: By introducing one-shot region matching and temporal propagation mechanisms, the problem of detection failure in complex reflection and occlusion scenarios of existing technologies (such as Grounding DINO direct inference) is effectively solved, ensuring the semantic integrity of training data.

[0096] Superior generalization ability: Application scenario tests on various robot platforms such as Tien Kung 2.0, AgileX and UR-5e show that the success rate of the present invention is significantly better than that of the prior art when dealing with unfamiliar backgrounds, lighting changes and interference.

[0097] Enhanced feature interpretability: Through region contrast strategy learning, Grad-CAM visualization results show that the model can focus on the operation object more accurately, significantly suppressing the interference of background noise and solving the problem of the model "misreading the position" in traditional enhancement methods.

[0098] Currently, there is no single alternative that can simultaneously solve the problems of low-cost annotation, robust semantic extraction, and strong generalization ability.

[0099] Simulator-based data generation (Sim-to-Real) methods, while capable of generating unlimited data, suffer from a severe "Sim-to-Real Gap" and have extremely high costs for high-fidelity asset modeling, making them unable to fully replace the visual diversity of the real world.

[0100] End-to-end Visual Language Large Model (VLA) fine-tuning: Although it has a certain degree of generalization, fine-tuning can easily lead to catastrophic forgetting or overfitting in specific downstream tasks with extremely limited data, and it consumes huge computational resources.

[0101] Therefore, the lightweight framework proposed in this invention, which combines "data augmentation + contrastive learning" in one approach, is the optimal solution to overcome the above-mentioned shortcomings.

[0102] This disclosure also provides an enhanced image generation system, such as... Figure 8As shown, it includes: The acquisition module 801 is used to acquire reference frames in the expert trajectory corresponding to the robot operation task; In this process, bounding boxes were pre-labeled for the task-related regions in the reference frame; The boundary generation module 802 is used to determine anchor frames from the target trajectory and generate candidate bounding boxes for the anchor frames using a preset detector. The determining module 803 is used to determine the annotation information of the candidate bounding box based on the similarity between the candidate bounding box region and the annotated bounding box region in the reference frame; The propagation module 804 is used to convert the candidate bounding box into a pixel-level mask and propagate the mask in time to other frames of the target trajectory to complete the annotation of the corresponding bounding box in the other frames; The image generation module 805 is used to construct a background image and generate an enhanced image with annotation information based on the background image and the mask in each frame of the target trajectory.

[0103] In another embodiment provided in this disclosure, the determining module 803 is used to extract a first feature embedding of the task-related region using a preset visual model; extract a second feature embedding of the candidate bounding box region; determine the similarity between the second feature embedding and the first feature embedding; and for each second feature embedding, determine the annotation information corresponding to the first feature embedding whose similarity to the second feature embedding is greater than a preset threshold as the annotation information of the candidate bounding box corresponding to the second feature embedding.

[0104] In another embodiment provided in this disclosure, the propagation module 804 is used to call a preset large model, use the candidate bounding boxes of the anchor frame as prompt information, and generate a pixel-level mask for the candidate bounding boxes to cover the task-related areas within the anchor frame.

[0105] In another embodiment provided in this disclosure, the image generation module 805 is used to construct a background description template library containing multiple materials; call a preset large language model to generate prompt words based on the templates in the template library; call a preset text-to-image model to generate a complete background image according to the prompt words; and overlay the foreground region in the frame corresponding to the mask onto the background image to generate an enhanced image.

[0106] This disclosure also provides a training system for policy networks, such as... Figure 9 As shown, it includes: Extraction module 901 is used to acquire training images from the enhanced image set and extract image features; The enhanced image set is constructed based on an enhanced image generation method described in any of the above embodiments; The training module 902 is used to train the policy network by using the image features as input and at least the annotation information corresponding to the image features as ground truth; the policy network is used to receive images and / or states observed by the robot and output the robot's control actions.

[0107] In another embodiment provided in this disclosure, the extraction module 901 is used to sample target enhanced images from the enhanced image set; determine the mask of the corresponding category on the target enhanced image according to the target annotation information, and generate a mask image including the corresponding task-related region; extract the full-image features of the target enhanced image and the initial object features of the mask image based on visual encoding; and traverse the mask image through the full-image features and the initial object features using an attention enhancement algorithm to enhance the initial object features to obtain the enhanced features of the mask image, thereby obtaining an enhanced image feature set. Training module 902 is used to take the enhanced image feature set as input to the policy network.

[0108] In another embodiment provided in this disclosure, the extraction module 901 is further configured to cluster mask image enhancement features with the same semantic features into positive sample pairs and mask image enhancement features with different semantic features into negative sample pairs based on the annotation information of the mask image. Training module 902 is used to take the positive sample pairs and the negative sample pairs as input to the policy network.

[0109] In another embodiment provided in this disclosure, it further includes: a loss determination module 903, configured to calculate image contrast loss based on the recognition result of the training image and the semantics of the corresponding training image; and determine the loss of the policy network based on the contrast loss and the behavior cloning loss.

[0110] This disclosure also provides an electronic device, including: a processor, a memory, and a bus; The memory stores machine-readable instructions that can be executed by the processor; When the electronic device is running, the processor and the memory communicate via a bus; When the machine-readable instructions are executed by the processor, they perform the steps of an enhanced image generation method as described in any of the above embodiments; and / or perform the steps of a policy network training method as described in any of the above embodiments.

[0111] Through the above description of the embodiments, those skilled in the art can clearly understand that the embodiments of this disclosure can be implemented in hardware or by means of software plus necessary general-purpose hardware platforms. Based on this understanding, the technical solutions of the embodiments of this disclosure can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (such as a CD-ROM, USB flash drive, mobile hard drive, etc.) and includes several instructions to cause a computer device (such as a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of this disclosure.

[0112] Those skilled in the art will understand that the accompanying drawings are merely schematic diagrams of a preferred embodiment, and the modules or processes in the drawings are not necessarily essential for implementing this disclosure.

[0113] Those skilled in the art will understand that the modules in the apparatus of the embodiments can be distributed in the apparatus of the embodiments as described in the embodiments, or they can be located in one or more devices different from this embodiment with corresponding changes. The modules of the above embodiments can be combined into one module, or they can be further divided into multiple sub-modules.

[0114] The sequence numbers of the embodiments disclosed above are for descriptive purposes only and do not represent the superiority or inferiority of the embodiments.

[0115] Obviously, those skilled in the art can make various modifications and variations to this disclosure without departing from its spirit and scope. Therefore, if such modifications and variations fall within the scope of the claims of this disclosure and their equivalents, this disclosure is also intended to include such modifications and variations.

Claims

1. A method for generating enhanced images, characterized in that, include: Obtain reference frames from the expert trajectory corresponding to the robot operation task; wherein, the task-related regions in the reference frames are pre-labeled with bounding boxes; Anchor frames are determined from the target trajectory, and candidate bounding boxes are generated for the anchor frames using a preset detector. The annotation information of the candidate bounding box is determined based on the similarity between the candidate bounding box region and the annotated bounding box region in the reference frame. The candidate bounding boxes are converted into pixel-level masks, and the masks are propagated temporally to other frames of the target trajectory to complete the annotation of the corresponding bounding boxes in the other frames; A background image is constructed, and an enhanced image with annotation information is generated based on the background image and the mask in each frame of the target trajectory.

2. The method as described in claim 1, characterized in that, The annotation information of the candidate bounding box is determined based on the similarity between the candidate bounding box region and the annotated bounding box region in the reference frame, including: The first feature embedding of the task-related region is extracted using a preset visual model; Extract the second feature embedding of the candidate bounding box region; Determine the similarity between the second feature embedding and the first feature embedding; For each second feature embedding, the annotation information corresponding to the first feature embedding with a similarity greater than a preset threshold is determined as the annotation information of the candidate bounding box corresponding to the second feature embedding.

3. The method as described in claim 1, characterized in that, Converting the candidate bounding boxes into pixel-level masks includes: A preset large model is invoked, and the candidate bounding boxes of the anchor frame are used as prompt information to generate pixel-level masks for the candidate bounding boxes to cover the task-related areas within the anchor frame.

4. The method as described in claim 1, characterized in that, Constructing a background image, and generating an enhanced image with annotation information based on the background image and the masks in each frame of the target trajectory, including: Build a background description template library containing various materials; The system invokes a preset large language model and generates prompt words based on templates in the template library. The preset text-to-image model is invoked to generate a complete background image based on the prompt words; The foreground region in the frame corresponding to the mask is superimposed onto the background image to generate an enhanced image.

5. A method for training a policy network, characterized in that, include: Training images are acquired from the enhanced image set and image features are extracted; The enhanced image set is constructed based on an enhanced image generation method according to any one of claims 1-4; The policy network is trained using the image features as input and at least the annotation information corresponding to the image features as ground truth. The policy network is used to receive images and / or states observed by the robot and output the robot's control actions.

6. The method as described in claim 5, characterized in that, Training images are acquired from the augmented image set and image features are extracted, including: Sample the target enhanced image from the enhanced image set; Based on the target annotation information, determine the mask of the corresponding category on the target enhancement image, and generate a mask image including the corresponding task-related region; Based on visual coding, the full-image features of the target enhancement image and the initial object features of the mask image are extracted respectively. Based on the full image features and the initial object features, the mask image is traversed using an attention enhancement algorithm to enhance the initial object features, thereby obtaining the enhanced features of the mask image and a set of enhanced image features. Using the image features as input to the policy network, including: The enhanced image feature set is used as the input to the policy network.

7. The method as described in claim 6, characterized in that, After obtaining the enhanced image feature set, the following is also included: Based on the annotation information of the mask images, the enhanced features of mask images with the same semantic features are clustered into positive sample pairs, and the enhanced features of mask images with different semantic features are clustered into negative sample pairs. Using the image features as input to the policy network, including: The positive sample pairs and the negative sample pairs are used as inputs to the policy network.

8. The method as described in claim 5, characterized in that, During the training of the policy network, the loss is determined in the following manner: Based on the recognition results of the training images and the semantics of the corresponding training images, calculate the image contrast loss; The loss of the policy network is determined based on the contrast loss and the behavior cloning loss.

9. A system for generating enhanced images, characterized in that, include: The acquisition module is used to acquire reference frames in the expert trajectory corresponding to the robot operation task; In this process, bounding boxes were pre-labeled for the task-related regions in the reference frame; A boundary generation module is used to determine anchor frames from the target trajectory and generate candidate bounding boxes for the anchor frames using a preset detector. The determination module is used to determine the annotation information of the candidate bounding box based on the similarity between the candidate bounding box region and the annotated bounding box region in the reference frame; The propagation module is used to convert the candidate bounding box into a pixel-level mask and propagate the mask in time to other frames of the target trajectory to complete the annotation of the corresponding bounding box in the other frames; The image generation module is used to construct a background image and generate an enhanced image with annotation information based on the background image and the mask in each frame of the target trajectory.

10. A training system for a policy network, characterized in that, include: The extraction module is used to acquire training images from the enhanced image set and extract image features; The enhanced image set is constructed based on an enhanced image generation method according to any one of claims 1-4; The training module is used to train the policy network by using the image features as input and at least the annotation information corresponding to the image features as ground truth; the policy network is used to receive images and / or states observed by the robot and output the robot's control actions.

11. An electronic device, characterized in that, include: Includes processor, memory, and bus; The memory stores machine-readable instructions that can be executed by the processor; When the electronic device is running, the processor and the memory communicate via a bus; When the machine-readable instructions are executed by the processor, they perform the steps of an enhanced image generation method as claimed in any one of claims 1-4; and / or the steps of a policy network training method as claimed in any one of claims 5-8.