Sample generation method, system, device and storage medium based on projection transformation
By automatically generating high-quality new scene container image data through projection transformation technology, the problem of high cost in generating training data in existing technologies is solved, and the generalization ability and deployment efficiency of segmentation models are improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SHANGHAI WESTWELL INFORMATION & TECH CO LTD
- Filing Date
- 2026-03-05
- Publication Date
- 2026-06-19
AI Technical Summary
Existing technologies struggle to generate high-quality, scene-matched container image data quickly and cost-effectively in industrial settings, resulting in insufficient generalization ability of segmentation models in new scenarios and failing to meet business needs.
By collecting labeled old scene images and unlabeled new scene images, geometric mapping relationships are established using projection transformation technology, and high-quality new scene training data is automatically generated, including image segmentation, projection transformation and visual fusion, to achieve automated generation of pixel-level annotations.
It achieves high-quality and diverse training data generation with zero manual annotation cost, significantly improves the generalization ability and deployment efficiency of segmentation models in new scenarios, and solves the problems of high cost and low efficiency in traditional methods.
Smart Images

Figure CN122243724A_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of AI model training, and more specifically, to a method, system, device, and storage medium for generating samples based on projection transformation. Background Technology
[0002] In modern logistics, port automation, and smart customs, containers serve as the core carrier of global cargo transportation, making the inspection of their appearance and integrity crucial. Computer vision-based automated container defect inspection systems typically rely on a high-performance semantic segmentation model to accurately segment various surface areas of the container (such as doors, side panels, and top panels) from the input image or video stream, thereby identifying and locating defects such as scratches, dents, rust, and damage.
[0003] These data-driven deep learning models are highly dependent on the scale, quality, and fit of the training data to the deployment scenario for their performance. However, in actual industrial deployments, these models often face the severe challenge of "scenario fragility." When the detection environment changes—for example, changes in the camera's installation height, angle, or focal length; significant differences in lighting conditions due to weather, day / night, or season; differences in the container's model, color, condition, or texture; or changes in the background environment from a storage yard to a dock or railway freight yard—the performance of segmentation models trained on older scene data can plummet. This phenomenon is essentially due to "domain differences" or "distribution shifts" between the training and testing data.
[0004] To address the adaptability of models in new scenarios, the traditional and direct approach is to send personnel to the new deployment site to re-collect tens of thousands or even hundreds of thousands of images and hire a professional annotation team to perform detailed, pixel-level semantic segmentation annotations. This process is not only extremely costly and time-consuming (usually taking weeks to months), but also makes it difficult to maintain consistent annotation quality, severely restricting the model's rapid iteration and agile deployment capabilities, and failing to meet the timeliness requirements of business development.
[0005] Another common approach is to employ data augmentation strategies, such as randomly rotating, translating, scaling, mirroring, adjusting brightness and contrast, and adding noise to existing training data. While these methods are simple to implement and can improve the robustness of the model to some extent, their limitations are also quite obvious: they mainly perturb at the surface pixel level of the image and cannot simulate the real geometric changes of containers under different 3D perspectives and perspective distortions, nor can they generate images that blend naturally with a completely new background environment in terms of lighting and color. Therefore, traditional data augmentation has little effect on improving the model's ability to cope with complex geometric deformations and domain shifts, and the generated images may appear unrealistic, making them unsuitable for training high-precision industrial-grade segmentation models.
[0006] In recent years, generative adversarial networks (GANs) and other generative models have shown potential in the field of data synthesis. However, directly applying them to scenarios such as industrial defect detection, which have extremely high requirements for realism, controllability, and reliability, still faces many challenges. The generation process of GANs is random and uncontrollable, making it difficult to accurately specify the geometric structure, pose, size, and precise location of the generated target in the image. At the same time, when training data is scarce (i.e., small sample size), GANs are prone to mode collapse, resulting in poor image diversity or artifacts. More importantly, even if realistic images are generated, how to automatically obtain their corresponding accurate pixel-level annotation information is another independent problem.
[0007] Therefore, those skilled in the art have long faced a sharp contradiction: on the one hand, powerful segmentation models crave massive amounts of high-quality, scene-matched labeled data; on the other hand, traditional methods for acquiring such data are costly and inefficient, while existing technological alternatives have significant shortcomings. This contradiction has become a core bottleneck hindering the large-scale and rapid deployment of industrial vision systems such as automated container defect inspection.
[0008] In view of this, the purpose of the present invention is to provide a sample generation method, system, device and storage medium based on projection transformation to solve the above problems. Summary of the Invention
[0009] To address the problems in the prior art, the present invention aims to provide a sample generation method, system, device, and storage medium based on projection transformation, which overcomes the difficulties of the prior art, enables the generation of high-quality and diverse training data with zero manual annotation cost, and significantly improves the generalization ability and deployment efficiency of segmentation models in new scenarios.
[0010] Embodiments of the present invention provide a sample generation method based on projection transformation, comprising the following steps: S110. Collect a first image with containers, which is labeled, and a second image with containers, which is unlabeled, wherein the first image has a pixel-level label mask; S120. Perform image segmentation on each second image containing a container to identify the reference container surface pixel region in the second image; S130. Establish a first pixel set based on the first reference box surface pixel region of the first image, establish a second pixel set based on the second reference box surface pixel region of the second image, perform projection transformation on the first reference box surface pixel region of each first image to the second reference box surface pixel region of each second image, establish projection transformation relationship respectively, and filter the projection transformation relationship by the intersection-union ratio of the pixel coordinate set obtained after the projection transformation and the second pixel set. S140. According to each of the filtered projection transformation relationships, the first reference box surface pixel region of the first image is mapped onto the corresponding second reference box surface pixel region of the second image, and the pixel-level label mask of the first image is projected onto the mapped second image to generate a composite image.
[0011] Preferably, step S110 includes: S111. Collect m first images containing containers and pixel-level label masks to establish a first image set; S112. Collect n unlabeled second images containing containers to form a second image set, where n is less than m.
[0012] Preferably, the first image includes at least one first container face of the container, which is either a side container face or a top container face, and the first container face that occupies the most pixels in the first image is defined as the first reference container face. The second image contains at least one second container front, which is either a side container front or a top container front. The second container front that occupies the most pixels in the second image is defined as the second reference container front.
[0013] Preferably, step S120 includes: S121. For each second image containing a container, the pixels corresponding to the second reference surface of the container are segmented using a general image segmentation model, and the background pixels are deduced. S122. Perform a morphological closing operation on the segmented second reference box area to fill any possible voids and gaps inside, and obtain a third image. S123. Calculate a distance transformation map for the third image, wherein the distance transformation map is the value of each pixel point as the nearest distance from the pixel point to the background pixel point; S124. For the distance transformation map, the maximum inter-class variance algorithm is used to calculate the binarization threshold, and binarization processing is performed to remove noise points; S125. Perform connected component analysis on the distance transformation graph and select the connected region with the largest area as the second reference box surface pixel region of the second image.
[0014] Preferably, step S130 includes: S131. Establish a first pixel set based on the first reference box surface pixel regions of m first images, and establish a second pixel set based on the second reference box surface pixel regions of n second images; S132. Detect the edge contours of the first reference box surface and the second reference box surface respectively, and extract the coordinates of four corner points from the edge contours using the RDP algorithm; S133. Based on the corner coordinates of m first reference box surfaces and n second reference box surfaces, establish m×n projection transformation relationships; S134. Based on the projection transformation relationship, the pixel coordinates in the first reference box area corresponding to the first image are projected onto the coordinate system of the second image to obtain the set of projected pixel coordinates as the third pixel set. S135. For each projection transformation relationship, calculate the intersection-union ratio (IUR) between the third pixel set and the second pixel set corresponding to the projection transformation relationship, and filter out projection transformation relationships with IUR lower than a preset threshold.
[0015] Preferably, step S135 includes: S1351. Based on each projection transformation relationship, the intersection of the pixels in the third pixel set and the corresponding second pixel set is used as the numerator, and the union is used as the denominator to calculate the intersection-union ratio of each third pixel set and the corresponding second pixel set. S1352. Filter out projection transformation relationships with an intersection-to-union ratio lower than a preset threshold, wherein the preset threshold ranges from 65% to 95%.
[0016] Preferably, step S140 includes: S141. Based on each filtered projection transformation relationship, the first reference box surface pixel region of the first image is mapped onto the corresponding second reference box surface pixel region of the second image. S142. Based on the background area of the second image, adjust the color and / or style of the first reference box surface pixel area mapped onto the second reference box surface pixel area to blend it with the background. S143. Project the pixel-level label mask of each first image onto the second image after mapping based on the corresponding projection transformation relationship to generate a synthetic image as a sample. S144. Collect the synthesized images to establish a training sample set of container images.
[0017] Embodiments of the present invention also provide a sample generation system based on projection transformation, used to implement the above-described sample generation method based on projection transformation, wherein the sample generation system based on projection transformation includes: The image material module collects a first image with containers that is already labeled and a second image with containers that is not labeled, wherein the first image has a pixel-level label mask; The image segmentation module performs image segmentation on each second image containing a container in order to identify the reference container surface pixel region in the second image; The projection relationship module establishes a first pixel set based on the first reference box surface pixel region of the first image and a second pixel set based on the second reference box surface pixel region of the second image. It performs projection transformation on the first reference box surface pixel region of each first image to the second reference box surface pixel region of each second image, establishes projection transformation relationships, and filters the projection transformation relationships by the intersection-union ratio of the pixel coordinate set obtained after the projection transformation and the second pixel set. The sample bonding module maps the first reference box surface pixel region of the first image to the corresponding second reference box surface pixel region of the second image according to each of the filtered projection transformation relationships, and projects the pixel-level label mask of the first image onto the mapped second image to generate a composite image.
[0018] Embodiments of the present invention also provide a sample generation device based on projection transformation, comprising: processor; A memory in which executable instructions of the processor are stored; The processor is configured to execute the steps of the above-described sample generation method based on projection transformation by executing the executable instructions.
[0019] Embodiments of the present invention also provide a computer-readable storage medium for storing a program that, when executed, implements the steps of the above-described sample generation method based on projection transformation.
[0020] The purpose of this invention is to provide a sample generation method, system, device and storage medium based on projection transformation, which can realize the generation of high-quality and diversified training data with zero manual annotation cost, and significantly improve the generalization ability and deployment efficiency of segmentation models in new scenarios. Attached Figure Description
[0021] Other features, objects, and advantages of the invention will become more apparent from the following detailed description of non-limiting embodiments with reference to the accompanying drawings.
[0022] Figure 1 This is a flowchart of the sample generation method based on projection transformation of the present invention.
[0023] Figure 2 This is a schematic diagram of the first image set in the sample generation method based on projection transformation of the present invention.
[0024] Figure 3 This is a schematic diagram of the second image set in the sample generation method based on projection transformation of the present invention.
[0025] Figure 4This is a schematic diagram illustrating the principle of generating a third image based on a first image and a second image in the sample generation method based on projection transformation of the present invention.
[0026] Figure 5 This is an overall architecture diagram of the sample generation system based on projection transformation of the present invention.
[0027] Figure 6 This is a schematic diagram of the sample generation device based on projection transformation according to the present invention.
[0028] Figure 7 This is a schematic diagram of the structure of a computer-readable storage medium according to an embodiment of the present invention. Detailed Implementation
[0029] The following specific examples illustrate the implementation methods of this application. Those skilled in the art can easily understand the other advantages and effects of this application from the content disclosed herein. This application can also be implemented or applied through other different specific embodiments, and various details in this application can be modified or changed according to different viewpoints and application systems without departing from the spirit of this application. It should be noted that, unless otherwise specified, the embodiments and features in the embodiments of this application can be combined with each other.
[0030] The embodiments of this application will now be described in detail with reference to the accompanying drawings, so that those skilled in the art can easily implement the application. This application may be embodied in many different forms and is not limited to the embodiments described herein.
[0031] In this application, the terms "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., refer to specific features, structures, materials, or characteristics represented in connection with that embodiment or example, which are included in at least one embodiment or example of this application. Furthermore, the specific features, structures, materials, or characteristics represented may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate different embodiments or examples represented in this application, as well as features of different embodiments or examples.
[0032] Furthermore, the terms "first" and "second" are used for illustrative purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the representation of this application, "multiple" means two or more, unless otherwise explicitly specified.
[0033] For the purpose of clearly describing this application, devices that are not relevant to the description are omitted, and the same or similar components throughout the specification are given the same reference numerals.
[0034] Throughout this specification, when it is said that a device is "connected" to another device, this includes not only "direct connection" but also "indirect connection" by placing other components in between. Furthermore, when it is said that a device "comprises" a certain constituent element, unless otherwise stated otherwise, this does not exclude other constituent elements, but rather implies that other constituent elements may be included.
[0035] When we say that a device is "above" another device, this can mean that it is directly above the other device, or it can mean that other devices are present in between. Conversely, when we say that a device is "directly" "above" another device, there are no other devices present in between.
[0036] Although the terms first, second, etc., are used in some instances herein to refer to various elements, these elements should not be limited by these terms. These terms are used only to distinguish one element from another. For example, first interface and second interface, etc., are used. Furthermore, as used herein, the singular forms “a,” “an,” and “the” are intended to also include the plural forms unless the context indicates otherwise. It should be further understood that the terms “comprising,” “including,” indicate the presence of features, steps, operations, elements, components, items, kinds, and / or groups, but do not exclude the presence, occurrence, or addition of one or more other features, steps, operations, elements, components, items, kinds, and / or groups. The terms “or” and “and / or” as used herein are interpreted as inclusive, or mean any one or any combination thereof. Thus, “A, B, or C” or “A, B, and / or C” means “any one of: A; B; C; A and B; A and C; B and C; A, B, and C.” Exceptions to this definition will only occur if the combination of elements, functions, steps, or operations is inherently mutually exclusive in some way.
[0037] The technical terms used herein are for reference only to specific embodiments and are not intended to limit the scope of this application. The singular form used herein includes the plural form unless the statement explicitly indicates otherwise. The word "comprising" as used in the specification means to specify a particular characteristic, region, integer, step, operation, element, and / or component, and does not exclude the presence or addition of other characteristics, regions, integers, steps, operations, elements, and / or components.
[0038] Although not explicitly defined, all terms, including technical and scientific terms used herein, shall have the same meaning as commonly understood by one of ordinary skill in the art to which this application pertains. Terms defined in commonly used dictionaries shall be further interpreted as having a meaning consistent with the relevant technical literature and the content of this present application, and shall not be over-interpreted as having an ideal or overly formulaic meaning unless otherwise defined.
[0039] Based on the technical difficulties disclosed in the background art, there is an urgent need in this field for a technical solution that can automatically, cost-effectively, and rapidly generate a large amount of container image data that is highly adapted to the target new scene and has precise pixel-level annotations, so as to enable the rapid transfer and generalization capabilities of the segmentation model and break through the data bottleneck of the current large-scale application of the automatic container defect inspection system.
[0040] Figure 1 This is a flowchart of the sample generation method based on projection transformation of the present invention. Figure 1 As shown, the sample generation method based on projection transformation of the present invention includes the following steps: S110. Collect a first image with containers that is already labeled and a second image with containers that is not already labeled. The first image has a pixel-level label mask. The number of second images is significantly less than the number of first images. That is, a large amount of high-quality new scene training data is generated by combining a small number of new scene images with a large number of old scene materials.
[0041] S120. Perform image segmentation on each second image containing a container to identify the reference container surface pixel region in the second image. This step uses a general image segmentation model (such as the SAM model) for initial segmentation, followed by a series of post-processing steps, including morphological closure operation to fill holes, calculation of distance transform maps, OTSU binarization for noise reduction, and connected component analysis to select the largest region, to obtain an accurate and complete second reference container surface pixel region. For example, projection transform relationships with IoU values below a preset threshold (e.g., 0.7, or within the range of 65% to 95%) are filtered out, retaining only image pairs that achieve good geometric alignment.
[0042] S130. A first pixel set is established based on the first reference box surface pixel region of the first image, and a second pixel set is established based on the second reference box surface pixel region of the second image. For each first image, a first reference box surface pixel region is used. To ensure projection quality, a projection transformation is performed on the second reference box surface pixel region of each second image, and projection transformation relationships are established. The projection transformation relationships are then filtered using the intersection-union ratio (IUU) of the pixel coordinate set obtained after the projection transformation and the second pixel set, retaining only image pairs that achieve good geometric alignment.
[0043] S140. Based on each filtered projection transformation relationship, the first reference container surface pixel region of the first image is mapped onto the corresponding second reference container surface pixel region of the second image, and the pixel-level label mask of the first image is projected onto the mapped second image to generate a composite image. First, according to the corresponding projection transformation relationship, the first reference container surface pixel region of the first image is mapped onto the second reference container surface pixel region of the second image, completing preliminary geometric alignment. To make the mapped region visually compatible with the new background environment, the color and / or style of the mapped region needs to be adjusted, for example, through adaptive instance normalization based on LAB color space statistical features, to blend it with the background region of the second image. Simultaneously, the pixel-level label mask carried by the first image is subjected to the exact same projection transformation and then overlaid onto the corresponding position in the second image (this position is initialized as a background label), thereby automatically generating the pixel-level label mask of the composite image. Finally, all generated composite images are collected to construct a large-scale training sample set of container images for the new scene.
[0044] This invention aims to automate the synthesis of large amounts of visually consistent, geometrically aligned, and precisely labeled training data that are identical to the target scene without human intervention through an innovative technical process. This effectively solves the performance degradation problem of segmentation models caused by scene changes and completely eliminates the burden of expensive manual labeling in new scenes. To achieve the above objective, the first aspect of this invention provides a sample generation method based on projection transformation. The core concept is as follows: using a small number of unlabeled images of a new scene as a "background canvas," the main container region is identified using advanced image segmentation techniques (such as the SAM model); then, a precise mapping relationship is established between this region and the corresponding region in a large number of existing, labeled container images ("material library") through geometric projection transformation; finally, based on this mapping relationship, the image content of the containers in the material library and their accompanying pixel-level labels are "seamlessly" integrated into the new scene background after adaptive color style adjustment, and precise labels of the integrated image are generated simultaneously.
[0045] In a preferred embodiment, step S110 includes: S111. Collect m first images (old scene materials) with containers and pixel-level label masks to create a first image set.
[0046] S112. Collect n unlabeled second images (new scene materials) with containers to establish a second image set, where n is less than m.
[0047] In a preferred embodiment, the first image includes at least one first container face, which is either a side container face or a top container face. The first container face that occupies the most pixels in the first image is defined as the first reference container face.
[0048] The second image contains at least one second container face, which is either a side container face or a top container face. The second container face that occupies the most pixels in the second image is defined as the second reference container face.
[0049] In a preferred embodiment, step S120 includes: S121. For each second image containing a container, the pixels corresponding to the second reference surface of the container are segmented using a general image segmentation model, and the background pixels are deduced.
[0050] S122. Perform a morphological closing operation on the segmented second reference box area to fill any possible voids and gaps inside, and obtain a third image.
[0051] S123. Calculate the distance transformation map for the third image. The distance transformation map is the value of each pixel, which is the shortest distance from the pixel to the background pixel.
[0052] S124. For the distance transformation map, the Otsu's inter-class variance algorithm is used to calculate the binarization threshold and perform binarization processing to remove noise points.
[0053] S125. Perform connected component analysis on the distance transformation graph and select the connected region with the largest area as the second reference box surface pixel region of the second image.
[0054] In a preferred embodiment, step S130 includes: S131. Establish a first pixel set based on the first reference box surface pixel regions of m first images, and establish a second pixel set based on the second reference box surface pixel regions of n second images.
[0055] S132. Detect the edge contours of the first reference box surface and the second reference box surface respectively, and use the RDP algorithm to extract the coordinates of the four corner points from the edge contours.
[0056] S133. Based on the corner coordinates of m first reference box surfaces and n second reference box surfaces, establish m×n (m multiplied by n) projection transformation relationships.
[0057] S134. Based on the projection transformation relationship, the pixel coordinates in the first reference box area corresponding to the first image are projected onto the coordinate system of the second image to obtain the set of projected pixel coordinates as the third pixel set.
[0058] S135. For each projection transformation relationship, calculate the intersection-union ratio (IUR) between the third pixel set and the second pixel set corresponding to that projection transformation relationship, and filter out projection transformation relationships with IUR lower than a preset threshold.
[0059] In a preferred embodiment, step S135 includes: S1351. Based on each projection transformation relationship, take the intersection of the pixels in the third pixel set and the corresponding second pixel set as the numerator and the union as the denominator to calculate the intersection-union ratio of each third pixel set and the corresponding second pixel set.
[0060] S1352. Filter out projection transformation relationships with an intersection-to-union ratio lower than a preset threshold. The preset threshold ranges from 65% to 95%.
[0061] In a preferred embodiment, step S140 includes: S141. Based on each filtered projection transformation relationship, the first reference box surface pixel region of the first image is mapped onto the second reference box surface pixel region of the corresponding second image.
[0062] S142. Based on the background area of the second image, adjust the color and / or style of the first reference box surface pixel area mapped onto the second reference box surface pixel area to blend it with the background.
[0063] S143. Project the pixel-level label mask of each first image onto the textured second image based on the corresponding projection transformation relationship to generate a synthetic image as a sample.
[0064] S144. Collect synthetic images to establish a training sample set of container images.
[0065] The following details the specific implementation process of this invention: The core inventive concept of this invention lies in: for a new scene (second image), automatically finding a labeled old scene image (first image) with a matching geometric shape, and then using precise projection transformation to "wrap" the content and labels of the old image into the corresponding position of the new image. After visual fusion processing, training data suitable for the new scene is finally generated. The technical solution of this invention will be described in detail below with reference to the accompanying drawings.
[0066] refer to Figure 1 The sample generation method based on projection transformation according to this invention mainly includes four logically interconnected stages: the first stage, material preparation (corresponding to step S110 of claim); the second stage, new scene image segmentation and reference box extraction (corresponding to step S120 of claim); the third stage, projection transformation relationship calculation and quality filtering (corresponding to step S130 of claim); and the fourth stage, image synthesis, visual fusion, and automatic annotation generation (corresponding to step S140 of claim). These four stages constitute a complete automated data generation pipeline.
[0067] Phase 1: Material Preparation
[0068] The goal of this stage is to prepare the input data sources required for the algorithm to run, including the labeled "material library" and the "new scene" images to be adapted. Figure 2 This is a schematic diagram of the first image set in the sample generation method based on projection transformation of the present invention. Figure 3 This is a schematic diagram of the second image set in the sample generation method based on projection transformation of the present invention. (Reference) Figure 2 and 3 As shown, this stage involves two key datasets: First image set (reference) Figure 2 (A large library of labeled images representing older scenes): This collection contains m container images with pixel-level semantic segmentation annotations, referred to as Image 1. The annotation information (pixel-level label mask) in Image 1 accurately distinguishes the container region (which can be further subdivided into components such as doors and side panels) from the non-container background region in the image. These images typically come from historical projects, public datasets, or early annotation work, and should cover as many container types, colors, ages, lighting conditions, and possible damage patterns as possible to provide rich material.
[0069] Second image set (refer to) Figure 3 (A small set of images representing the new scene): This set contains n unlabeled images of containers taken in the new target deployment scene, referred to as the second set of images 2. The second set of images 2 reflects the new environment that the model needs to adapt to, and its number n can be significantly less than m (e.g., dozens of images compared to thousands). These images should contain complete containers, but their visual features, such as viewpoint, lighting, and background, are significantly different from those of the first set of images.
[0070] The concept of a "reference container surface" is defined here to simplify subsequent geometric modeling. For a container image, there is typically a container surface facing the camera, either directly or at an angle, that occupies the largest pixel area in the image. This surface with the largest area is defined as the "reference container surface" of the image. In the first image 1, it is called the first reference container surface 11 (the remaining pixels outside the first reference container surface 11 constitute the first background pixel area 12 in the first image 1); in the second image 2, it is called the second reference container surface 21 (the remaining pixels outside the second reference container surface 21 constitute the second background pixel area 22 in the second image 2). Subsequent projection transformations will primarily revolve around these two reference container surfaces, assuming that the main visible surface of the container is approximately a plane, which meets the application conditions of homography transformation.
[0071] Phase 2: New Scene Image Segmentation and Reference Box Extraction
[0072] The goal of this stage is to automatically, robustly, and accurately segment the pixel region of the second reference container surface 21 of each input second image 2, providing accurate "target anchor points" for subsequent geometric alignment.
[0073] The specific implementation of this step (corresponding to step S120 and sub-steps S121 to S125 of claim 1) is as follows, combining an advanced general segmentation model with classic image post-processing techniques: S121. Initial Segmentation Using a General Model: A powerful general image segmentation model is used, such as the Segment Anything Model (SAM) released by Meta AI. The SAM model has excellent zero-shot segmentation capabilities, generating high-quality object masks based on minimal cues or automatic detection. For the second image 2, a mode that automatically generates masks for all salient objects can be used, or a coarse bounding box containing the container can be provided as a cue, and the SAM model will output an initial binary segmentation mask for the container. This mask initially marks the pixel regions of the container.
[0074] S122. Morphological Post-processing to Fill Holes: Initial segmentation may result in holes or narrow gaps within the container area due to occlusion, reflection, or model errors. Therefore, a morphological closing operation (dilation followed by erosion) is used to process the initial mask. The closing operation effectively fills small holes within the area and connects adjacent broken sections, resulting in a more complete and coherent container area mask. The processed image can be referred to as the third image.
[0075] S123. Calculate the distance transformation map: Calculate the distance transformation map D for the third image (foreground is a container area, background is other elements). The value of each foreground pixel in the distance transformation map D represents the Euclidean distance from that point to the nearest background pixel. Therefore, the pixel located at the center of the region has the largest distance value, and the distance value decreases as it approaches the edge.
[0076] S124. Binarization Denoising Based on Distance Transform: The Otsu's algorithm (Maximum Inter-Class Variance) is applied to the distance-transformed image D. The OTSU algorithm automatically calculates an optimal threshold that maximizes the inter-class variance when pixels are classified into foreground and background categories based on this threshold. Pixels with a distance value greater than the OTSU threshold are set to 1 (considered foreground), and the rest are set to 0 (considered background). This operation has a sophisticated filtering effect: the reference container surface is usually a relatively flat, convex area, with its central part being the farthest from the background and having a large distance value, so it is retained; while attachments on the container edge, railing shadows, or sporadic noise caused by inaccurate segmentation, because they are close to the background, have small distance values and are filtered out by this threshold. This results in a "cleaner" binary region that is closer to the main body of the reference container surface.
[0077] S125. Connected Component Analysis and Maximum Region Selection: Connected component analysis is performed on the image after binarization in step S124. Since the thresholding process in the previous step may have broken some weak connections, a container area may be segmented into several independent connected components. Among these connected components, the one with the largest pixel area is selected. The set of pixels covered by this largest connected component is ultimately determined as the second reference container surface pixel region 21 of the second image 2. This region is continuous, complete, and represents the main visible container surfaces in the image to the greatest extent possible.
[0078] By following the steps above, even when faced with complex backgrounds and lighting changes, the key regions for geometric alignment can be extracted stably and accurately from new scene images.
[0079] Phase 3: Calculation of Projection Transformation Relationships and Quality Filtering
[0080] The goal of this stage is to calculate a geometric mapping relationship from the first reference box surface 11 to the second reference box surface 21 for each possible combination (a first image 1 and a second image 2), evaluate the quality of the mapping relationship, and select high-quality pairs for synthesis.
[0081] The specific implementation process of this step (corresponding to step S130 and sub-steps S131 to S135 of claim) is as follows: S131. Establish pixel coordinate sets: For m first images, extract all pixel coordinates of each first reference box 11 according to its labeled mask, forming m first pixel sets A. For n second images, obtain all pixel coordinates of each second reference box 21 according to the results of the second stage extraction, forming n second pixel sets B.
[0082] S132. Contour Detection and Corner Extraction: For each region of the first reference container surface 11 and the second reference container surface 21, its outer contour is first extracted using an edge detection algorithm (such as the Canny operator). Since container surfaces are typically convex quadrilaterals, the Ramer-Douglas-Peucker (RDP) algorithm is used to approximate the extracted contour with a polygon. The RDP algorithm approximates the original shape with fewer points within an acceptable error range by reducing the number of contour points. By setting appropriate parameters, the algorithm can output a quadrilateral approximation result, thus obtaining the pixel coordinates of the four vertices. These corner coordinates are the basis for calculating the projection transformation.
[0083] S133. Establishing Projection Transformation Relationships: Projection transformation, specifically homography transformation, describes the perspective mapping relationship between two planes and can be represented by a 3x3 matrix H. For any pair of corner point sets (4 points) of the first reference box and the second reference box, the homography matrix H can be solved using the Direct Linear Transform (DLT) algorithm, or calculated using the cv2.getPerspectiveTransform function from libraries such as OpenCV. This establishes the projection transformation relationship from the first image to the second image for all possible image pairs (m×n pairs in total).
[0084] S134. Projection Verification and Coordinate Mapping: The homography matrix calculated solely based on the four corner points may contain errors because corner detection may be inaccurate, or the box surface may not be a perfect plane. To quantitatively evaluate the quality of each projection transformation, verification is required: For a given projection transformation matrix H, the coordinates of all pixels in the corresponding first reference box surface region (i.e., the first pixel set A) are projected (transformed) through matrix H onto the coordinate system of the second image 2, resulting in a new set of coordinates, called the third pixel set C. This set C represents the position in the second image where the first image box surface "should" appear according to the current transformation relationship.
[0085] S135. Intersection over Union (IoU) Calculation and Filtering: Calculate the Intersection over Union (IoU) ratio between the third pixel set (the theoretical position projected) and the actual second reference container surface pixel set (the second pixel set B, i.e., the actual position of the container in the second image). The formula for calculating the IoU value is: IoU = |C ∩ B| / |C ∪ B|, where C represents the third pixel set and B represents the second pixel set. The IoU value ranges from 0 to 1. The closer the value is to 1, the more accurate the projection transformation and the better the overlap between the two regions.
[0086] S1352. Set a quality threshold, such as 0.7 (or select within the range of 65% to 95% depending on the actual situation). Iterate through all m×n projection transformation relationships and filter out those transformation relationships with IoU values lower than the preset threshold. This means that only image pairs that can very accurately "align" the old scene box surface to the position of the new scene box surface will be retained for the next stage of the compositing process. This filtering step is crucial; it is the core quality control step to ensure the geometric realism of the final composite data.
[0087] Phase 4: Image synthesis, visual fusion, and automatic annotation generation
[0088] This stage is the final step in data generation. Based on the high-quality projection relationships selected in the third stage, the image content is "transplanted," the visual effects are "fused," and the annotation information is "synchronously generated."
[0089] Figure 4 This is a schematic diagram illustrating the principle of generating a third image from a first image and a second image in the sample generation method based on projection transformation of the present invention. Figure 4 As shown, the specific implementation of this step (corresponding to step S140 and sub-steps S141 to S144 of claim) includes the following key operations: S141, Geometric Alignment and Preliminary Mapping (see...) Figure 4 For each filtered image (first image 1, second image 2, homography matrix H), a perspective transformation is performed on the first reference box surface region in the first image 1 using the homography matrix H. This transformation process "distorts" the box surface in the first image to have the same viewpoint and shape as the second reference box surface 21 in the second image. Then, the transformed image patch is directly placed (overlaid) onto the corresponding position in the second image 2. This completes the initial geometric alignment based on projection transformation.
[0090] S142. Color and Style Adaptive Blending: After geometric alignment, the color, brightness, and contrast of image patches may be inconsistent with the new background environment, resulting in a noticeable "pasting" effect. To solve this problem, visual blending processing of the textured areas is required.
[0091] A preferred implementation is based on statistical color migration, specifically performed in the LAB color space: a) Convert the image from RGB color space to LAB color space because LAB space separates luminance (L channel) and color information (A and B channels), which is more in line with human visual perception and is easier to adjust independently.
[0092] b) Calculate the mean (μ_bg) and standard deviation (σ_bg) of the pixels in the background region surrounding the texture region in the L, A, and B channels in the second image 2, as well as the mean (μ_src) and standard deviation (σ_src) of the texture region itself.
[0093] c) Perform Adaptive Instance Normalization (AdaIN) transformation on each pixel value I_src of the texture region: I_src' = (σ_bg / σ_src) (I_src-μ_src)+μ_bg This transformation causes the pixel value distribution (mean and standard deviation) of the texture area to converge towards the background area.
[0094] d) To increase the diversity of generated data and avoid excessive uniformity, a random intensity factor α∈[0.4,0.8] can be introduced to soften the above transformation: I_src'=[(1-α)σ_src+ασ_bgc] / σ_src (I_src-μ_src)+[(1-α)μ_src+αμ_bg] α controls the degree of style transfer; when α=1, it is a complete transfer, and when α=0, it remains unchanged.
[0095] e) Color consistency check and iterative adjustment: Calculate the average color difference ΔE between the adjusted texture area and the background area (the second background pixel area 22 of the second image) (calculated in LAB space). If ΔE is greater than an acceptable threshold (e.g., 5), it indicates that the color fusion is still not ideal. In this case, the α value can be appropriately reduced, and steps c and d can be recalculated until ΔE meets the preset requirements.
[0096] f) Edge Feathering: Even after color adjustments, the hard edges of the texture area may still appear abrupt at the pixel level. Therefore, Gaussian feathering is applied at the boundary between the texture area and the background. The feathering width r can be dynamically calculated based on the size of the texture area, for example, r = clamp(width / 100, 3, 15) pixels, where width is the width of the texture area. A Gaussian kernel of this width is used to smoothly blend the boundary area, achieving a natural visual transition.
[0097] S143. Automatic Generation and Projection of Label Mask: This is the core step in achieving "zero labeling cost" in this invention. The first image 1 is accompanied by a precise pixel-level label mask. This label mask undergoes the exact same geometric transformation as the first reference container surface area—that is, perspective transformation using the same homography matrix H. Thus, the label mask undergoes deformation completely consistent with the image content. Then, a blank label mask is created with the same size as the second image 2 and all pixels initialized to the "background" category. The deformed source label mask is then overlaid on this blank mask at its corresponding position (i.e., the position of the mapping area). Thus, the pixel-level label mask of the composite image 3 is automatically obtained. This mask precisely labels the pixel assignments of the "transplanted" containers and their various components in the composite image, without any manual intervention.
[0098] S144, Loop and Dataset Construction: For a second image 2, it may have established a high-quality projection relationship (IoU value meets the standard) with multiple different first images 1 in the third stage. Therefore, multiple different first images 1 (showing different container states) can be used to synthesize a second image 2 with the same background, thereby generating a series of samples with different content but the same background, greatly enriching the diversity of the data. The system loops through steps S141 to S143, generating a synthetic image 3 and its corresponding automatically generated label mask for each valid pair (first image, second image) (a new pixel-level label mask is "transplanted" from the original pixel-level label mask in the first image 1 after deformation). Finally, all generated synthetic images 3 and label masks are collected to form a large-scale, high-quality training sample set for container segmentation targeting a new scene (represented by the second image set). (Each image has been labeled with a pixel-level label mask and can be directly used for subsequent AI vision calculations.) The size of this set can be much larger than the number of new scene images n in the original input (for example, by using the generation method of the present invention, using 1,000 labeled first images and 30 unlabeled second images of the new scene, approximately 26,000 composite images can be generated in the end (30,000 projection relationships are generated in this process, and after image quality filtering, 26,000 image quality reliable projection relationships remain). Each composite image has a pixel-level label mask. Obviously, the number of high-quality composite images newly generated by the present invention is much larger than the 30 real images of the new scene).
[0099] Compared with the prior art, the present invention has the following significant advantages: 1. Achieves highly efficient data generation with zero manual annotation cost: This invention creatively utilizes the semantic information of existing labeled data (first image) and, through precise geometric projection and intelligent image fusion technology, "migrates" it to a new unlabeled scene (second image), automatically generating labeled data for the new scene. This process is fully automated, completely avoiding time-consuming, expensive, and error-prone manual pixel-level annotation in the new scene, achieving "zero" annotation cost.
[0100] 2. Ensuring the geometric realism and annotation accuracy of the generated data: Through homography projection transformation based on the corner points of the container surface, combined with an intersection-over-union (IoU) filtering mechanism, the source container surface is ensured to achieve high-precision geometric alignment with the target container surface at the pixel level. This strict geometric alignment guarantees that the shape and perspective angle of the container in the generated synthetic image perfectly match the target scene. At the same time, its corresponding label mask also inherits the same geometric accuracy, which can be directly used to train high-requirement segmentation models, effectively avoiding edge misalignment, ghosting, and artifact problems caused by simple pasting.
[0101] 3. Significantly improved cross-scene generalization ability: By fusing a large number of diverse old scene container images (covering different states, lighting, and minor damage) into new target scene backgrounds, a large-scale training set can be quickly constructed that contains rich container ontological variations and is strongly coupled with the new scene background. This combination of "content diversity" and "background novelty" forces the model to focus on the essential features of the container during the learning process, rather than memorizing specific background patterns, thereby greatly enhancing the model's adaptability and robustness to unknown new scenes.
[0102] 4. Achieves seamless visual integration: Building upon geometric alignment, this invention introduces color and style transfer techniques (such as adaptive instance normalization). By analyzing and matching the color statistical characteristics (mean, variance) of the textured area and the target background area, the "transplanted" container surface naturally harmonizes with the surrounding environment in terms of color, brightness, and hue. Further processing, such as edge feathering, greatly eliminates compositing artifacts, generating visually highly realistic images and improving the quality and reliability of the generated data.
[0103] 5. A fully automated and efficient data generation pipeline has been built: The entire process, from new scene image input, automatic segmentation, projection relationship calculation and filtering, to image fusion and annotation generation, forms a complete, closed-loop automated pipeline. Users only need to input a small number of new scene images and an existing annotation material library, and the system can generate tens of thousands of training samples with accurate annotations in a short time. This high efficiency supports the rapid iteration, validation, and deployment of models, effectively responding to the needs of changing business requirements.
[0104] Figure 5 This is an overall architecture diagram of the sample generation system based on projection transformation according to the present invention. (See diagram below.) Figure 5 As shown, the sample generation system based on projection transformation of the present invention includes: Image material module 51 collects a first image with containers that is labeled and a second image with containers that is unlabeled. The first image has a pixel-level label mask.
[0105] Image segmentation module 52 performs image segmentation on each second image containing a container in order to identify the reference container surface pixel region in the second image.
[0106] The projection relationship module 53 establishes a first pixel set based on the first reference box surface pixel region of the first image and a second pixel set based on the second reference box surface pixel region of the second image. It performs projection transformation on the first reference box surface pixel region of each first image to the second reference box surface pixel region of each second image, establishes projection transformation relationship respectively, and filters the projection transformation relationship by the intersection-union ratio of the pixel coordinate set obtained after projection transformation and the second pixel set.
[0107] The sample bonding module 54, according to each filtered projection transformation relationship, maps the first reference box surface pixel area of the first image to the second reference box surface pixel area of the corresponding second image, and projects the pixel-level label mask of the first image onto the mapped second image to generate a composite image.
[0108] In a preferred embodiment, the image material module 51 is configured to collect m first images of containers with pixel-level label masks to establish a first image set. It also collects n unlabeled second images of containers to establish a second image set, where n is less than m, but not limited to m.
[0109] In a preferred embodiment, the first image includes at least one first container front, which may be a side or top container front. The first container front occupying the most pixels in the first image is defined as the first reference container front. The second image includes at least one second container front, which may be a side or top container front. The second container front occupying the most pixels in the second image is defined as the second reference container front, but is not limited thereto.
[0110] In a preferred embodiment, the image segmentation module 52 is configured to segment pixels corresponding to the second reference surface of the container into each second image containing the container using a general image segmentation model, and infer the background pixels. Morphological closing operations are performed on the segmented second reference surface regions to fill any possible holes and gaps, resulting in a third image. A distance transform map is calculated for the third image, where the value of each pixel is the nearest distance to a background pixel. A binarization threshold is calculated using the maximum inter-class variance algorithm on the distance transform map, and binarization is performed to remove noise points. Connectivity analysis is performed on the distance transform map, and the largest connected region is selected as the second reference surface pixel region of the second image, but this is not a limitation.
[0111] In a preferred embodiment, the projection relationship module 53 is configured to establish a first pixel set based on the first reference box surface pixel regions of m first images, and a second pixel set based on the second reference box surface pixel regions of n second images. The edge contours of the first and second reference box surfaces are detected respectively, and the coordinates of four corner points are extracted from the edge contours using the RDP algorithm. Based on the corner coordinates of the m first reference box surfaces and the n second reference box surfaces, m×n projection transformation relationships are established. Based on the projection transformation relationships, the pixel coordinates within the first reference box surface region of the corresponding first image are projected onto the coordinate system of the second image to obtain the projected pixel coordinate set as the third pixel set. For each projection transformation relationship, the intersection-union ratio (IUR) of the third pixel set and the second pixel set corresponding to that projection transformation relationship is calculated, and projection transformation relationships with an IUR lower than a preset threshold are filtered out, but this is not a limitation.
[0112] In a preferred embodiment, the projection relationship module 53 is further configured to calculate the intersection-union ratio (IUR) of each third pixel set and its corresponding second pixel set, using the intersection of pixels in the third pixel set and the corresponding second pixel set as the numerator and the union as the denominator, according to each projection transformation relationship. Projection transformation relationships with IUR below a preset threshold are filtered out. The preset threshold ranges from 65% to 95%, but is not limited to this.
[0113] In a preferred embodiment, the sample pasting module 54 is configured to paste a first reference container surface pixel region of the first image onto a corresponding second reference container surface pixel region of the second image according to each filtered projection transformation relationship. Based on the background region of the second image, the color and / or style of the first reference container surface pixel region pasted onto the second reference container surface pixel region is adjusted to blend with the background. A pixel-level label mask of each first image is projected onto the pasted second image based on the corresponding projection transformation relationship to generate a synthetic image as a sample. The synthetic images are collected to establish a training sample set for container images, but are not limited thereto.
[0114] In summary, the sample generation system based on projection transformation of the present invention can generate high-quality and diverse training data with zero manual annotation cost, and significantly improves the generalization ability and deployment efficiency of the segmentation model in new scenarios.
[0115] This invention also provides a sample generation device based on projection transformation, including a processor and a memory storing executable instructions for the processor. The processor is configured to execute steps of a sample generation method based on projection transformation by executing the executable instructions.
[0116] As shown above, the sample generation device based on projection transformation of this invention in this embodiment can generate high-quality and diverse training data with zero manual annotation cost, which significantly improves the generalization ability and deployment efficiency of the segmentation model in new scenarios.
[0117] Those skilled in the art will understand that various aspects of the present invention can be implemented as systems, methods, or program products. Therefore, various aspects of the present invention can be specifically implemented in the following forms: a completely hardware implementation, a completely software implementation (including firmware, microcode, etc.), or a combination of hardware and software aspects, collectively referred to herein as a "circuit," "module," or "platform."
[0118] Figure 6 This is a schematic diagram of the sample generation device based on projection transformation according to the present invention. See below for further details. Figure 6 To describe an electronic device 600 according to this embodiment of the present invention. Figure 6 The electronic device 600 shown is merely an example and should not impose any limitations on the functionality and scope of use of the embodiments of the present invention.
[0119] like Figure 6 As shown, the electronic device 600 is presented in the form of a general-purpose computing device. The components of the electronic device 600 may include, but are not limited to: at least one processing unit 610, at least one storage unit 620, a bus 630 connecting different platform components (including storage unit 620 and processing unit 610), a display unit 640, etc.
[0120] The storage unit stores program code, which can be executed by the processing unit 610 to perform the steps described in the method section of this specification according to various exemplary embodiments of the present invention. For example, the processing unit 610 can perform actions such as... Figure 1 The steps are shown in the figure.
[0121] Storage unit 620 may include readable media in the form of volatile storage units, such as random access memory (RAM) 6201 and / or cache memory 6202, and may further include read-only memory (ROM) 6203.
[0122] Storage unit 620 may also include a program / utility 6204 having a set (at least one) program module 6205, such program module 6205 including but not limited to: operating system, one or more application programs, other program modules and program data, each or some combination of these examples may include an implementation of a network environment.
[0123] Bus 630 can represent one or more of several types of bus structures, including a memory cell bus or memory cell controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local bus using any of the multiple bus structures.
[0124] Electronic device 600 can also communicate with one or more external devices 700 (e.g., keyboard, pointing device, Bluetooth device, etc.), and with one or more devices that enable a user to interact with electronic device 600, and / or with any device that enables electronic device 600 to communicate with one or more other computing devices (e.g., router, modem, etc.). This communication can be performed via input / output (I / O) interface 650. Furthermore, electronic device 600 can also communicate with one or more networks (e.g., local area network (LAN), wide area network (WAN), and / or public networks, such as the Internet) via network adapter 660. Network adapter 660 can communicate with other modules of electronic device 600 via bus 630. It should be understood that, although not shown in the figures, other hardware and / or software modules can be used in conjunction with electronic device 600, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage platforms.
[0125] This invention also provides a computer-readable storage medium for storing a program, which, when executed, implements the steps of a sample generation method based on projection transformation. In some possible implementations, various aspects of this invention can also be implemented as a program product comprising program code that, when run on a terminal device, causes the terminal device to perform the steps described in the above-described method section of this specification according to various exemplary embodiments of the invention.
[0126] As shown above, the sample generation system based on projection transformation of this invention in this embodiment can generate high-quality and diverse training data with zero manual annotation cost, which significantly improves the generalization ability and deployment efficiency of the segmentation model in new scenarios.
[0127] Figure 7 This is a schematic diagram of the structure of the computer-readable storage medium of the present invention. (Reference) Figure 7 As shown, a program product 800 for implementing the above-described method according to an embodiment of the present invention is described. It may employ a portable compact disc read-only memory (CD-ROM) and include program code, and may run on a terminal device, such as a personal computer. However, the program product of the present invention is not limited thereto. In this document, the readable storage medium may be any tangible medium containing or storing a program that may be used by or in conjunction with an instruction execution system, apparatus, or device.
[0128] The program product may employ any combination of one or more readable media. A readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples of readable storage media (a non-exhaustive list) include: electrical connections having one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable programmable read-only memory (EPROM or flash memory), optical fiber, portable compact disk read-only memory (CD-ROM), optical storage devices, magnetic storage devices, or any suitable combination thereof.
[0129] Computer-readable storage media may include data signals propagated in baseband or as part of a carrier wave, carrying readable program code. Such propagated data signals may take various forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination thereof. A readable storage medium may also be any readable medium other than a readable storage medium that can transmit, propagate, or transfer a program for use by or in connection with an instruction execution system, apparatus, or device. The program code contained on the readable storage medium may be transmitted using any suitable medium, including but not limited to wireless, wired, optical fiber, RF, etc., or any suitable combination thereof.
[0130] Program code for performing the operations of this invention can be written in any combination of one or more programming languages, including object-oriented programming languages such as Java and C++, and conventional procedural programming languages such as C or similar languages. The program code can execute entirely on the user's computing device, partially on the user's device, as a standalone software package, partially on the user's computing device and partially on a remote computing device, or entirely on a remote computing device or server. In cases involving remote computing devices, the remote computing device can be connected to the user's computing device via any type of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (e.g., via the Internet using an Internet service provider).
[0131] In summary, the purpose of this invention is to provide a sample generation method, system, device, and storage medium based on projection transformation, which can achieve high-quality and diversified training data generation with zero manual annotation cost, and significantly improve the generalization ability and deployment efficiency of segmentation models in new scenarios.
[0132] The above description, in conjunction with specific preferred embodiments, provides a further detailed explanation of the present invention. It should not be construed that the specific implementation of the present invention is limited to these descriptions. For those skilled in the art, various simple deductions or substitutions can be made without departing from the concept of the present invention, and all such modifications and substitutions should be considered within the scope of protection of the present invention.
Claims
1. A sample generation method based on projection transformation, characterized in that, Includes the following steps: S110. Collect a first image with containers, which is labeled, and a second image with containers, which is unlabeled, wherein the first image has a pixel-level label mask; S120. Perform image segmentation on each second image containing a container to identify the reference container surface pixel region in the second image; S130. Establish a first pixel set based on the first reference box surface pixel region of the first image, establish a second pixel set based on the second reference box surface pixel region of the second image, perform projection transformation on the first reference box surface pixel region of each first image to the second reference box surface pixel region of each second image, establish projection transformation relationship respectively, and filter the projection transformation relationship by the intersection-union ratio of the pixel coordinate set obtained after the projection transformation and the second pixel set. S140. According to each of the filtered projection transformation relationships, the first reference box surface pixel region of the first image is mapped onto the corresponding second reference box surface pixel region of the second image, and the pixel-level label mask of the first image is projected onto the mapped second image to generate a composite image.
2. The projection transformation-based sample generation method of claim 1, wherein, Step S110 includes: S111. Collect m first images containing containers and pixel-level label masks to establish a first image set; S112. Collect n unlabeled second images containing containers to form a second image set, where n is less than m.
3. The projection transformation-based sample generation method of claim 2, wherein, The first image contains at least one first container face, which is either a side container face or a top container face. The first container face that occupies the most pixels in the first image is defined as the first reference container face. The second image contains at least one second container front, which is either a side container front or a top container front. The second container front that occupies the most pixels in the second image is defined as the second reference container front.
4. The projection transformation-based sample generation method of claim 3, wherein, Step S120 includes: S121. For each second image containing a container, the pixels corresponding to the second reference surface of the container are segmented using a general image segmentation model, and the background pixels are deduced. S122. Perform a morphological closing operation on the segmented second reference box area to fill any possible voids and gaps inside, and obtain a third image. S123. Calculate a distance transformation map for the third image, wherein the distance transformation map is the value of each pixel point as the nearest distance from the pixel point to the background pixel point; S124. For the distance transformation map, the maximum inter-class variance algorithm is used to calculate the binarization threshold, and binarization processing is performed to remove noise points; S125. Perform connected component analysis on the distance transformation graph and select the connected region with the largest area as the second reference box surface pixel region of the second image.
5. The projection transformation-based sample generation method of claim 3, wherein, Step S130 includes: S131. Establish a first pixel set based on the first reference box surface pixel regions of m first images, and establish a second pixel set based on the second reference box surface pixel regions of n second images; S132. Detect the edge contours of the first reference box surface and the second reference box surface respectively, and extract the coordinates of four corner points from the edge contours using the RDP algorithm; S133. Based on the corner coordinates of m first reference box surfaces and n second reference box surfaces, establish m×n projection transformation relationships; S134. Based on the projection transformation relationship, the pixel coordinates in the first reference box area corresponding to the first image are projected onto the coordinate system of the second image to obtain the set of projected pixel coordinates as the third pixel set. S135. For each projection transformation relationship, calculate the intersection-union ratio (IUR) between the third pixel set and the second pixel set corresponding to the projection transformation relationship, and filter out projection transformation relationships with IUR lower than a preset threshold.
6. The projection transformation-based sample generation method of claim 5, wherein, Step S135 includes: S1351. Based on each projection transformation relationship, the intersection of the pixels in the third pixel set and the corresponding second pixel set is used as the numerator, and the union is used as the denominator to calculate the intersection-union ratio of each third pixel set and the corresponding second pixel set. S1352. Filter out projection transformation relationships with an intersection-to-union ratio lower than a preset threshold, wherein the preset threshold ranges from 65% to 95%.
7. The projection transformation-based sample generation method of claim 1, wherein, Step S140 includes: S141. Based on each filtered projection transformation relationship, the first reference box surface pixel region of the first image is mapped onto the corresponding second reference box surface pixel region of the second image. S142. Based on the background area of the second image, adjust the color and / or style of the first reference box surface pixel area mapped onto the second reference box surface pixel area to blend it with the background. S143. Project the pixel-level label mask of each first image onto the second image after mapping based on the corresponding projection transformation relationship to generate a synthetic image as a sample. S144. Collect the synthesized images to establish a training sample set of container images.
8. A projection transformation-based sample generation system for implementing the projection transformation-based sample generation method of claim 1, characterized by include: The image material module collects a first image with containers that is already labeled and a second image with containers that is not labeled, wherein the first image has a pixel-level label mask; The image segmentation module performs image segmentation on each second image containing a container in order to identify the reference container surface pixel region in the second image; The projection relationship module establishes a first pixel set based on the first reference box surface pixel region of the first image and a second pixel set based on the second reference box surface pixel region of the second image. It performs projection transformation on the first reference box surface pixel region of each first image to the second reference box surface pixel region of each second image, establishes projection transformation relationships, and filters the projection transformation relationships by the intersection-union ratio of the pixel coordinate set obtained after the projection transformation and the second pixel set. The sample bonding module maps the first reference box surface pixel region of the first image to the corresponding second reference box surface pixel region of the second image according to each of the filtered projection transformation relationships, and projects the pixel-level label mask of the first image onto the mapped second image to generate a composite image.
9. A sample generation device based on projection transformation, characterized by, include: processor; A memory in which executable instructions of the processor are stored; The processor is configured to execute the steps of the sample generation method based on projection transformation according to any one of claims 1 to 7 by executing the executable instructions.
10. A computer readable storage medium for storing a program, characterized in that, When the program is executed by the processor, it implements the steps of the sample generation method based on projection transformation as described in any one of claims 1 to 7.