An image generation method, device and storage medium
By acquiring and utilizing the keypoint spatial semantic information and global conditional information of the image generation model, the integrity and consistency issues of existing image generation algorithms in multi-object collaborative generation are solved, achieving high-quality and accurate image generation, which is suitable for complex multi-object scenarios in fields such as transportation.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- ZHEJIANG DAHUA TECH CO LTD
- Filing Date
- 2026-02-02
- Publication Date
- 2026-06-12
AI Technical Summary
Existing image generation algorithms struggle to balance the integrity of each target, the rationality of spatial layout, and scene consistency when generating multiple targets collaboratively. Furthermore, they cannot accurately integrate real-world scene constraints such as occlusion and lighting, resulting in deviations between the generated images and reality, making it difficult to meet the high-precision application requirements of specific fields.
By acquiring the spatial semantic information and global condition information of key points in the target image, the image generation model generates a target image containing the preset target. Combining the distribution and connection relationship of key points, the image generation model is guided to generate a high-quality and accurate image.
It significantly improves the quality and accuracy of image generation, ensuring the morphological accuracy and integrity of the preset target, and the generated images conform to the physical laws and perspective requirements of the real scene.
Smart Images

Figure CN122199717A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the fields of computer vision and generative artificial intelligence, specifically to an image generation method, device, and storage medium. Background Technology
[0002] With the rapid iteration of artificial intelligence and computer vision technologies, image generation algorithms, as a core creative technology in the field of deep learning, have been implemented on a large scale in several key areas such as transportation, security, autonomous driving, and smart cities, and their technological value and application potential continue to be released. For example, generative image algorithms can significantly improve the intelligence level of the transportation system through data augmentation and scene simulation.
[0003] Image generative algorithms stem from two major breakthroughs: Generative Adversarial Networks (GANs) and diffusion models. GANs achieve rapid generation through adversarial training, and synthesize high-quality images from noise through adversarial training or gradual denoising. However, existing algorithms still have significant shortcomings. For example, in multi-object collaborative generation, it is difficult to balance the integrity of each object, the rationality of spatial layout, and scene consistency. Furthermore, they cannot accurately integrate real-world scene constraints such as occlusion and lighting, leading to deviations between the generated images and reality. Consequently, they fail to meet the high-precision application requirements of complex multi-object scenes in specific domains. Summary of the Invention
[0004] To address the aforementioned technical problems, this application provides an image generation method, apparatus, and storage medium to improve the quality and accuracy of image generation.
[0005] According to one embodiment of this application, an image generation method is provided, comprising:
[0006] Acquire several conditional information for generating a target image. The several conditional information includes key point spatial semantic information of a preset target. The key point spatial semantic information includes the distribution information of several preset key points on the preset target and the structural information of the preset target. The structural information represents the connection relationship between at least two preset key points. The aforementioned conditional information is input into the image generation model, which then guides the model to generate a target image containing the preset target.
[0007] To solve the above-mentioned technical problems, one technical solution adopted in this application is to provide an electronic device, including a memory and a processor, wherein the memory is used to store a computer program, and when the computer program is executed by the processor, it is used to implement the image generation method in the above-mentioned technical solution.
[0008] To solve the above-mentioned technical problems, one technical solution adopted in this application is to provide a computer-readable storage medium for storing a computer program, which, when executed by a processor, is used to implement the image generation method in the above-mentioned technical solution.
[0009] Through the above scheme, this application obtains several conditional information for generating a target image, inputs this information into an image generation model, and guides the model to generate a target image containing a preset target. The conditional information includes the spatial semantic information of key points of the preset target, which in turn includes the distribution information of several preset key points on the preset target and the structural information of the preset target. The structural information represents the connection relationship between at least two preset key points. Thus, this application more accurately constrains the spatial arrangement of the preset target through the distribution information of the target's key points, while simultaneously ensuring the integrity of the target structure by combining the connection relationship of the key points. This significantly optimizes the accuracy and integrity of the generated target image in terms of the preset target's morphology, thereby effectively improving the quality and accuracy of image generation. Attached Figure Description
[0010] To more clearly illustrate the technical solutions in the embodiments of this application, the accompanying drawings used in the description of the embodiments will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this application. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort. Wherein: Figure 1 This is a flowchart illustrating an embodiment of the image generation method provided in this application; Figure 2 This is a schematic diagram of the structure of an embodiment of the diffusion model provided in this application; Figure 3 This is a schematic diagram of the diffusion process in the diffusion model provided in this application; Figure 4 This is a schematic diagram of the diffusion process calculation provided in this application using the diffusion model. Figure 5 This is a schematic diagram of the system framework of the image generation method provided in this application; Figure 6 This is a flowchart illustrating a specific embodiment of the image generation method provided in this application; Figure 7 This is a schematic diagram of the architecture of an attention mechanism that integrates multiple input conditions provided in this application; Figure 8 This is a schematic diagram of the structure of an embodiment of the electronic device provided in this application; Figure 9 This is a schematic diagram of an embodiment of the computer-readable storage medium provided in this application. Detailed Implementation
[0011] The present application will now be described in further detail with reference to the accompanying drawings and embodiments. It should be particularly noted that the following embodiments are for illustrative purposes only and do not limit the scope of the application. Similarly, the following embodiments are only some, not all, embodiments of the present application, and all other embodiments obtained by those skilled in the art without inventive effort are within the scope of protection of the present application.
[0012] In this application, the reference to "embodiment" means that a specific feature, structure, or characteristic described in connection with an embodiment may be included in at least one embodiment of this application. The appearance of this phrase in various places throughout the specification does not necessarily refer to the same embodiment, nor is it a separate or alternative embodiment mutually exclusive with other embodiments. It will be explicitly and implicitly understood by those skilled in the art that the embodiments described herein can be combined with other embodiments.
[0013] It should be noted that the terms "first," "second," etc., used in this application are for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of indicated technical features. Therefore, a feature defined with "first," "second," etc., may explicitly or implicitly include at least one of that feature. In the description of this application, "multiple" means at least two, such as two, three, etc., unless otherwise explicitly specified. Furthermore, the terms "comprising" and "having," and any variations thereof, are intended to cover non-exclusive inclusion. For example, a process, method, system, product, or apparatus that includes a series of steps or units is not limited to the listed steps or units, but may optionally include steps or units not listed, or may optionally include other steps or units inherent to these processes, methods, products, or apparatuses.
[0014] Please see Figure 1 , Figure 1 This is a schematic flowchart of an embodiment of the image generation method provided in this application. It should be noted that if substantially the same result is obtained, this embodiment does not necessarily reflect that result. Figure 1 The illustrated process sequence is limited. For example... Figure 1 As shown, this embodiment includes: S110: Obtain several conditional information for generating the target image, including the spatial semantic information of the key points of the preset target.
[0015] Certain conditional information refers to a set of one or more types of structured / unstructured information collected and organized during the initial stage of an image generation task to provide orientation constraints on the output of the image generation model.
[0016] Several conditional information includes key point spatial semantic information of a preset target. This key point spatial semantic information includes the distribution information of several preset key points on the preset target and the structural information of the preset target. The structural information represents the connection relationship between at least two of the preset key points. The distribution information of the preset key points can be represented in forms including, but not limited to, images of the key point locations of the preset target and textual descriptions of the key point locations. Preset key points are specific location points (or feature points) that are predefined and can characterize the core features of the target, precisely constraining its shape, structure, and spatial layout. Their positions and numbers are pre-set based on the target's essential attributes and generation requirements.
[0017] In one embodiment, the key point spatial semantic information is a multi-channel semantic map, wherein the multi-channel semantic map includes a key point heatmap and a target structure map corresponding to different channels respectively. The key point heatmap represents the distribution information of several preset key points, and the target structure map represents the structural information of a preset target. The key point heatmap includes sub-heatmaps corresponding to each preset key point, and each sub-heatmap corresponds to a different channel.
[0018] In one embodiment, for each preset key point of a preset target, a Gaussian circular spot is drawn at the pixel position of the preset key point to generate a corresponding sub-heatmap. According to the structure of the preset target, the preset key points are sequentially connected by lines to generate a target structure map. The sub-heatmaps and target structure maps corresponding to several preset key points are stitched together along the channel dimension to construct a multi-channel semantic map. For example, for a training image or an input condition, an N+1-channel multi-channel semantic map of the same size as the target image is generated, where N is the number of preset key points. In the first N channels, each channel corresponds to one preset key point, and a two-dimensional Gaussian circular spot (peak value of 1) is drawn at the two-dimensional pixel position of the preset key point to form a sub-heatmap. The last channel is the target structure map obtained by sequentially connecting the preset key points with related relationships according to the structure of the preset target.
[0019] In one embodiment, the conditional information further includes at least one of global conditional information and environmental spatial information. The global conditional information includes at least one of camera parameter information and scene description text information, and the environmental spatial information includes at least one of scene layout information and occlusion space information. The scene layout information is used to define the layout of at least one static background element in the target image, and the occlusion space information represents the position information of occluders that occlude the preset target.
[0020] In one embodiment, to ensure the rationality and diversity of the generated target image scene, a scene semantic segmentation map corresponding to the target image can be obtained. The scene semantic segmentation map is then encoded to obtain a scene feature map, which serves as scene layout information. For example, in a traffic scene, a scene semantic segmentation map is provided, defining static background elements such as lanes, sidewalks, sky, buildings, and traffic signs in the scene semantic segmentation image. The scene semantic segmentation map is encoded into a scene feature map using a multilayer perceptron or other deep learning model.
[0021] In one embodiment, to avoid generation distortion caused by occlusion logic confusion and to further improve the realism, structural accuracy, and scene consistency of the target image, occlusion spatial information can be used as conditional information. An occlusion binary map corresponding to the target image is obtained, and the occlusion binary map is mapped to an occlusion feature map as occlusion spatial information. The occlusion binary map includes a first image region where the occluder is located, and a second image region where the preset target and background are located. The first and second image regions correspond to different pixel values. For example, an occlusion binary semantic map of the same size as the output target image is provided. White areas (set to 1) represent the positions where the occluder should appear, and black areas (set to 0) represent the image regions where the vehicle and background should appear. The occlusion binary semantic map is mapped to an occlusion feature map using a multilayer perceptron or other deep learning model.
[0022] In one embodiment, to further improve the imaging logic of the generated target image to conform to the real scene, the imaging rules of the real camera (such as perspective relationships, viewpoint constraints, and scale ratios) can be used as conditional information for generating the target image. The camera parameters of the target camera are obtained and mapped into a camera parameter vector as camera parameter information. This camera parameter information can serve as a control condition for the viewpoint of the generated target image.
[0023] In one embodiment, to further conform to the physical laws of real-world scenes, scene information can also be used as conditional information for generating the target image. A scene text description of the target scene is obtained and encoded into a scene text vector as scene description text information. For example, in the transportation field, a scene description of the target image is provided, including but not limited to global illumination and weather descriptions, vehicle attribute text descriptions, dynamic motion blur descriptions, occlusion degree and occluding object attribute descriptions, such as providing global descriptions like "On a sunny evening, two large trucks are traveling on a desert highway," or "A white van with a blue cargo box occludes a white dump truck, with an occlusion ratio of 50%," to control vehicle type, color, occlusion ratio, weather illumination, and scene. It should be noted that the text-based approach is not limited to the above description and can be expanded and supplemented with additional text conditions. In addition to using text (such as "dusk," "strong light and backlight," "damp and rainy"), it can also employ physical lighting parameters, such as solar altitude / azimuth angle, light intensity, and shadow softness, to describe direction. This ensures that the light and shadow directions and intensities of vehicles, occlusions, and the background in the generated image are completely consistent, conforming to physical laws. After obtaining the scene text description, it can be encoded into a scene text vector using the GLIP (Grounded Language-Image Pre-training) module.
[0024] S120: Input several conditional information into the image generation model to guide the image generation model to generate a target image containing a preset target.
[0025] Image generation models are artificial intelligence models based on deep learning technology that can autonomously synthesize high-quality images that conform to the distribution of real-world images, starting from unstructured inputs (such as random noise) or structured constraint information. These models include generative adversarial networks and diffusion models.
[0026] In one embodiment, the conditional information further includes environmental spatial information and global conditional information. The global conditional information includes at least one of camera parameter information and scene description text information, and the environmental spatial information includes at least one of scene layout information and occlusion spatial information. The environmental spatial information and keypoint spatial semantic information are fused to obtain spatial conditional guidance information, and the global conditional information is fused to obtain global conditional guidance information. The spatial conditional guidance information and global conditional guidance information are input into an image generation model to generate the target image.
[0027] In one embodiment, the spatial information of each environment and the spatial semantic information of key points are spliced together in the channel dimension to obtain spatial condition guidance information.
[0028] In one embodiment, spatial conditional guidance information and global conditional guidance information are fused to obtain conditional fusion features. These features guide the image generation model to perform a reverse generation iterative process. When the reverse generation iterative process reaches a preset termination step, the image to be processed in the current iteration step is output as the target image. Each iteration step of the reverse generation iterative process guided by the conditional fusion features may include the following steps: obtaining a time-step embedding vector; inputting the image to be processed in the current iteration step, the conditional fusion features, and the time-step embedding vector into the processing unit of the image generation model; integrating the conditional fusion features during feature processing to achieve directional guidance through feature constraints; and iteratively updating the image to be processed in the current iteration step based on the guidance processing results of the processing unit to obtain the image to be processed in the next iteration step.
[0029] In one embodiment, the image generation model is a preset diffusion model. This model breaks down the complex image generation task into multiple simple denoising processes. First, a forward noise-adding process gradually transforms the real image into random noise. Then, by learning a reverse denoising process, it gradually recovers a high-quality image that conforms to the target distribution, starting from pure noise. In this embodiment, the preset diffusion model includes a forward process and a reverse process. The forward process, also known as the diffusion process, refers to the gradual addition of Gaussian noise to the data until the data becomes random noise. The reverse process is the denoising process used to generate the target image.
[0030] In one implementation, please refer to Figure 2 , Figure 2 This is a schematic diagram of an embodiment of the diffusion model provided in this application. The diffusion model, such as U-NET (a U-shaped network, a deep learning model based on encoder-decoder and skip connections), achieves high-precision image synthesis through gradual denoising. Figure 2 As shown, the noise map on the left side of the image is the starting point of the diffusion model's reverse denoising process, i.e., pure random noise. The ε module is used to predict the noise components in the current noisy image. The QKV module, which is the feature processing unit, is the Query-Key-Value attention module in the Transformer architecture. It is used to fuse text vectors, preset keypoint semantic information, occlusion binary images, and other conditional constraint information. Through the attention mechanism, it aligns the noise-processed features with the constraint features of the generated target, achieving directional guidance. D represents the decoding module, which is used to gradually restore the attention-processed features into a high-resolution, clear image, and finally output the target image that meets the conditional constraints. The denoising process of the diffusion model starts from random noise, first predicts and processes the noise, then fuses the generated constraint information through the attention module, and finally decodes to obtain the target image that meets the requirements.
[0031] Please see Figure 3 , Figure 3 This is a schematic diagram of the diffusion process, as provided in this application. X 0 to X T This is a progressively increasing noise forward process, where the noise is known. The process gradually adds noise from the original image to a set of pure noise. X T arrive X 0 represents the process of restoring a set of random noise to the input, which requires learning a denoising process until an image is restored.
[0032] Please see Figure 4 , Figure 4 This is a schematic diagram of the diffusion process calculation using the diffusion model provided in this application, for the original data. X 0~q( X 0), totaling T The diffusion process of the step, the image during the forward process X t Only with the previous moment X t-1 Regarding this, the process can be viewed as a Markov process, satisfying the following formulas (1) and (2), where q ( X 1:T | X 0) indicates from the original image X Starting from 0, after T The noise image is obtained after the noise addition process. X T The probability distribution of each iteration step q( X 1:T | X 0) indicates that at step t, given the image from the previous step... X t-1 Get the current image X t The conditional probability distribution is a Gaussian distribution with a mean of . The variance is , is a preset noise variance coefficient. I It is the identity matrix. Where, for different t... It is predefined, based on time 1~ T It gradually increases, ranging from 0 to 1.
[0033] (1) (2) The reverse process is a denoising process, for example, obtaining the true distribution of each iteration step of the reverse process. p0 ( X t-1 | X t ), can be obtained through random noise X T A target image is gradually reconstructed.
[0034] In one embodiment, if the target image contains multiple preset targets and there is an occlusion relationship between the preset targets, a two-stage generation method can be used to generate the target image. In one embodiment, several conditional information is input into the image generation model to guide the image generation model to generate multiple first images, each first image corresponding to a single preset target. The image generation model performs a forward noise addition process on each first image to obtain a corresponding second image. Based on the environmental spatial information of the target image, the image generation model is guided to perform a reverse generation iterative process on each second image. Based on the image generation results of the reverse generation iterative process of multiple second images, the target image containing multiple preset targets is determined.
[0035] In one embodiment, to further detect the quality of the generated target image, key point consistency detection can be performed on the target image after it is generated. In one embodiment, key points are predicted for a preset target in the target image to obtain a number of predicted key points. The consistency score between the predicted key points and each preset key point is calculated, and preset key points with consistency scores lower than a preset threshold are removed.
[0036] To better illustrate the image generation method provided in this application, this embodiment takes the transportation field as an example. In the prior art, the viewpoint and occlusion control of general image generation models are weak. For example, models such as Stable Diffusion can generate vehicle images through text prompts, but they cannot accurately control the shooting viewpoint of the generated image. The generated image is often a "level-on" or "aesthetically pleasing" viewpoint, rather than a real, perspective-distorted side-mounted fixed monitoring viewpoint. At the same time, the control over the precise shape, position and type of occluders is also insufficient.
[0037] Please see Figure 5 , Figure 5 This is a schematic diagram of the system framework of the image generation method provided in this application. This embodiment integrates scene layout map, occlusion binary map, key point spatial semantic map, and conditional information such as side-mounted camera parameters and scene description, and uses the U-NET denoising diffusion model to generate highly realistic, correctly viewed, and reasonably occluded vehicle static images. The output images can also be used as training data in this application scenario, thereby constructing an image diffusion model controlled by four conditions: "viewpoint-space-layout-occlusion". The key point spatial semantic map is used as a strong spatial condition, which, together with occlusion information, camera parameters and other conditions, guides the generation process of the diffusion model.
[0038] To better illustrate the image generation method provided in this application, taking a vehicle occlusion image generation scenario as an example, please refer to [link / reference]. Figure 6 , Figure 6 This is a schematic flowchart of a specific embodiment of the image generation method provided in this application. It should be noted that if substantially the same result is obtained, this embodiment does not necessarily reflect that result. Figure 6 The sequence of processes shown is limited.
[0039] S201: Definition and encoding of multi-condition input.
[0040] Taking the generation of a target vehicle image from the perspective of a side-mounted camera as an example, several conditional information for generating the target vehicle image are defined and encoded. The side-mounted camera parameters are used as the control conditions of the core perspective. The intrinsic and extrinsic parameters of the side-mounted camera are obtained, and the side-mounted camera parameters are mapped into a conditional vector `C_cam` through a multilayer perceptron.
[0041] In this embodiment, the target is a vehicle. Before obtaining the keypoint spatial semantic map, keypoints are defined. For example, a well-defined set of vehicle component-level keypoints is used, typically including but not limited to: wheel center points (4, core key points determining vehicle attitude and steering angle), corner points (8, extreme boundary points of the front hood / rear bumper, left / right, and top / bottom), headlight corner points (4, 2 each for front and rear), door handle points (4, left / right and front / rear doors), window angles (8, front / rear, left / right, and top / bottom boundaries), and the highest point of the roof (1), for a total of 29 keypoints. The keypoint spatial semantic map includes multiple keypoint heatmaps and a vehicle skeleton map, which can clearly encode the vehicle's precise attitude, angles, and component topology relationships. For any training image or a user-input condition, an "N+1" channel semantic map of the same size as the target output image is generated, where N is the number of keypoints, such as the 29 keypoints defined in this embodiment. In the first "N" channels, each channel corresponds to a key point. A 2D Gaussian circular spot (peak value = 1) is drawn at the 2D pixel position of the key point to form a key point heatmap. The last channel is the vehicle skeleton map, which connects all the key points connected according to the vehicle structure (such as adjacent wheels and corners) with lines to form a rough skeleton of the vehicle.
[0042] Provide a semantic segmentation map to define static background elements such as lanes, sidewalks, sky, buildings, and traffic signs in an image. Encode the scene layout map into a scene feature map `F_layout` using a multilayer perceptron.
[0043] Provide a binary semantic map of the same size as the target output image. White areas (value 1) represent the locations where occlusions should appear, and black areas (value 0) represent the areas where vehicles and background should appear. Map the occlusion binary semantic map to a feature map `F_mask` through another multilayer perceptron.
[0044] Provide a text description of the scene, including but not limited to global illumination and weather descriptions, vehicle attribute text descriptions, dynamic motion blur descriptions, occlusion degree and occlusion object attribute descriptions, and encode the text descriptions into a text vector `C_text` through the GLIP module.
[0045] S202: Training of a condition-guided image diffusion model.
[0046] After obtaining the scene layout map, occlusion binary map, keypoint spatial semantic map, side-mounted camera parameters, scene description, and other conditional information, please refer to [link / reference]. Figure 7 , Figure 7 This is a schematic diagram of the architecture of an attention mechanism that integrates multiple input conditions, as provided in this application.
[0047] The camera parameter vector `C_cam` and the text vector `C_text` are fused to obtain the global condition `C_global`. The scene layout feature map `F_layout` and the occlusion binary map `F_mask` are concatenated in the channel dimension. At the same time, the vehicle key point spatial semantic map is used as an additional spatial condition input and concatenated in the channel dimension to jointly form rich spatial guidance information and obtain the spatial condition feature map `F_spatial`.
[0048] The noisy image `z_t` and the spatial conditional feature map `F_spatial` are concatenated and input into the backbone of U-Net. The global condition `C_global` and the time step embedding `T` are injected into the bottleneck layer of U-Net through a cross-attention mechanism. The spatial conditional features and global conditional features are fused through the cross-attention module to form a conditional fusion feature that can simultaneously constrain both local and global features. The noisy image of the current step (z_t) is then used as the input. X t Inputting the data into U-NET, while simultaneously injecting conditional features (including fused features and original encoded features) into U-NET's multi-scale feature processing layer, allows U-NET to adhere to conditional constraints during the denoising process. After U-NET processing, the output is the image that has undergone single-step denoising. X t-1 This is one of the iterative steps in the diffusion model to gradually generate a clear image from noise.
[0049] During training, by training the image fusion module, the image generation model not only learns to generate vehicle textures, but more importantly, it learns the strong correlation between vehicle keypoints and vehicle appearance pixels. When the occlusion binary map covers some parts of the vehicle, the keypoint spatial semantic map can provide the model with geometric priors for the occluded parts. For example, even if the front wheels are completely occluded, as long as the position of the front wheels is marked in the keypoint spatial semantic map, the model can reasonably generate the body structure of the area where the occluded front wheels are located based on the position of the rear wheels and the vehicle frame, and infer the correct steering angle. This can greatly improve the structural rationality of the generated vehicle in complex scenes, especially the generation of steering angles, the reproduction of the vehicle after short-term occlusion, and the preservation of vehicle component-level integrity.
[0050] S203: Serialization generation strategy.
[0051] For generating target images with extremely complex occlusion conditions, a two-stage generation method can be adopted to improve the accuracy and quality of image generation. The first stage generates a clear image without occlusion, conforming to all vehicle and scene conditions; that is, for each vehicle target, an image without any occlusion is generated. In the second stage, noise is added to the image generated in the first stage. Based on this noisy image, combined with occlusion spatial information such as an occlusion binary map, it is input into a trained image generation model, outputting a denoised image. Finally, the target image is synthesized from multiple images. The sequential generation strategy can better decouple the generation process of vehicles and occlusions, thereby effectively improving the detail quality of the generated target image.
[0052] S204: Post-processing based on key point consistency.
[0053] After generating the target image, a lightweight vehicle keypoint detection model can be run to predict keypoints for vehicle targets within the generated image. The predicted keypoints are compared with the input conditional keypoints to calculate a consistency score. Images with a consistency score below a preset threshold can be either directly identified as generation failures and discarded, or used as samples for iterative model adjustments, ultimately achieving effective control over the image output quality.
[0054] Please see Figure 8 , Figure 8 This is a schematic diagram of an embodiment of the electronic device provided in this application. The electronic device 60 includes a memory 61 and a processor 62 that are interconnected. The memory 61 is used to store a computer program. When the computer program is executed by the processor 62, it is used to implement the image generation method in the above embodiment.
[0055] The methods described in the above embodiments can exist in the form of a computer program; therefore, this application proposes a computer-readable storage medium. Please refer to [link / reference needed]. Figure 9 , Figure 9 This is a schematic diagram of an embodiment of a computer-readable storage medium provided in this application. The computer-readable storage medium 80 is used to store a computer program 81, which can be executed to implement the image generation method in the above embodiment.
[0056] The computer-readable storage medium 80 can be any medium capable of storing program code, such as a server, USB flash drive, external hard drive, read-only memory (ROM), random access memory (RAM), magnetic disk, or optical disk.
[0057] The above description is merely an embodiment of this application and does not limit the patent scope of this application. Any equivalent structural or procedural transformations made using the content of this application's specification and drawings, or direct or indirect applications in other related technical fields, are similarly included within the patent protection scope of this application.
Claims
1. An image generation method, characterized in that, The method includes: Acquire several conditional information for generating a target image. The several conditional information includes key point spatial semantic information of a preset target. The key point spatial semantic information includes the distribution information of several preset key points on the preset target and the structural information of the preset target. The structural information represents the connection relationship between at least two preset key points. The aforementioned conditional information is input into the image generation model, which then guides the model to generate a target image containing the preset target.
2. The method according to claim 1, characterized in that, The key point spatial semantic information is a multi-channel semantic map, which includes a key point heatmap and a target structure map corresponding to different channels. The key point heatmap represents the distribution information of the several preset key points, and the target structure map represents the structural information of the preset target. The key point heatmap includes sub-heatmaps corresponding to each preset key point, and each sub-heatmap corresponds to a different channel.
3. The method according to claim 2, characterized in that, The conditional information for obtaining the target image includes: For each preset key point of the preset target, a Gaussian circular spot is drawn at the pixel position of the preset key point to generate a corresponding sub-heatmap. According to the structure of the preset target, the preset key points are connected sequentially with lines to generate the target structure diagram; The sub-heatmaps corresponding to the preset key points and the target structure map are spliced together along the channel dimension to construct the multi-channel semantic map.
4. The method according to claim 1, characterized in that, The aforementioned conditional information further includes at least one of global conditional information and environmental spatial information. The global conditional information includes at least one of camera parameter information and scene description text information. The environmental spatial information includes at least one of scene layout information and occlusion spatial information. The scene layout information is used to define the layout of at least one static background element in the target image. The occlusion spatial information represents the position information of occluders that occlude the preset target.
5. The method according to claim 4, characterized in that, The process of obtaining several conditional information for generating the target image includes at least one of the following steps: Obtain the scene semantic segmentation map corresponding to the target image, and encode the scene semantic segmentation map to obtain a scene feature map, which is used as the scene layout information; Obtain the occlusion binary map corresponding to the target image, and map the occlusion binary map into an occlusion feature map as the occlusion spatial information. The occlusion binary map includes a first image region where the occluder is located and a second image region where the preset target and background are located. The first image region and the second image region correspond to different pixel values. Obtain the camera parameters of the target camera, and map the camera parameters into a camera parameter vector to serve as the camera parameter information; Obtain the scene text description of the target scene, and encode the scene text description into a scene text vector to serve as the scene description text information.
6. The method according to claim 4, characterized in that, The step of inputting the aforementioned conditional information into the image generation model to guide the image generation model in generating a target image containing the preset target includes: The spatial information of each environment and the spatial semantic information of each key point are fused to obtain spatial condition guidance information, and the global condition information is fused to obtain global condition guidance information. The spatial condition guidance information and the global condition guidance information are input into the image generation model to generate the target image.
7. The method according to claim 6, characterized in that, The process of fusing the environmental spatial information and the key point spatial semantic information to obtain spatial condition guidance information includes: The spatial information of each environment and the spatial semantic information of each key point are spliced together in the channel dimension to obtain the spatial condition guidance information; And / or, the step of inputting the spatial condition guidance information and the global condition guidance information into the image generation model to generate the target image includes: The spatial condition guidance information and the global condition guidance information are fused to obtain the condition fusion feature; The conditional fusion features are used to guide the image generation model to perform a reverse generation iterative process; When the reverse generation iteration process reaches the preset termination step, the image to be processed in the current iteration step is output as the target image.
8. The method according to claim 7, characterized in that, The operation of guiding the image generation model to perform each iteration step of the reverse generation iterative process using the conditional fusion features includes: Obtain the time step embedding vector; The image to be processed in the current iteration step, the conditional fusion feature, and the time step embedding vector are input to the processing unit of the image generation model. The processing unit integrates the conditional fusion feature during feature processing to achieve directional guidance through feature constraints. Based on the guidance processing results of the processing unit, the image to be processed in the current iteration step is iteratively updated to obtain the image to be processed in the next iteration step.
9. The method according to claim 1, characterized in that, The image generation model is a preset diffusion model; And / or, the target image contains multiple preset targets, and there is an occlusion relationship between the preset targets. The step of inputting the several conditional information into the image generation model to guide the image generation model to generate a target image containing the preset targets includes: The aforementioned conditional information is input into the image generation model to guide the image generation model to generate multiple first images, each of which corresponds to a single preset target. The image generation model is used to perform a forward noise addition process on each of the first images to obtain the corresponding second image; The image generation model is guided by the environmental spatial information of the target image to perform a reverse generation iteration process on each of the second images. The environmental spatial information includes at least one of scene layout information and occlusion space information. The scene layout information is used to define the layout of at least one static background element in the target image. The occlusion space information represents the position information of the occluder that occludes the preset target. Based on the image generation results of the reverse generation iteration process of multiple second images, a target image containing the multiple preset targets is determined.
10. The method according to claim 1, characterized in that, After inputting the aforementioned conditional information into the image generation model and guiding the image generation model to generate a target image containing the preset target, the process includes: Key point prediction is performed on a preset target in the target image to obtain several predicted key points; Calculate the consistency score between the predicted key points and each of the preset key points; Remove the preset key points whose consistency scores are lower than a preset threshold.
11. The method according to claim 1, characterized in that, The method is applied to the generation of vehicle occlusion images; And / or, the preset target includes vehicles; And / or, the target image includes an image of a vehicle obstructing the view.
12. An electronic device, characterized in that, The electronic device includes a processor and a memory, the processor being coupled to the memory, the processor being configured to perform one or more steps of the image generation method according to any one of claims 1 to 11 based on instructions stored in the memory.
13. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program that is executed by a processor to implement the steps of the image generation method as described in any one of claims 1 to 11.