Image generation method and apparatus, and electronic device

By dynamically generating mask images in a latent diffusion model and performing feature fusion processing, the problem of users needing to provide accurate mask images is solved, enabling efficient and simplified local image editing, suitable for online tools and resource-constrained devices.

WO2026138299A1PCT designated stage Publication Date: 2026-07-02NETEASE (HANGZHOU) NETWORK CO LTD

Patent Information

Authority / Receiving Office
WO · WO
Patent Type
Applications
Current Assignee / Owner
NETEASE (HANGZHOU) NETWORK CO LTD
Filing Date
2025-11-24
Publication Date
2026-07-02

Smart Images

  • Figure CN2025137039_02072026_PF_FP_ABST
    Figure CN2025137039_02072026_PF_FP_ABST
Patent Text Reader

Abstract

An image generation method and apparatus, and an electronic device. The method comprises: in response to a position selection instruction for an initial image, determining a target position on the initial image (S102); on the basis of the target position, generating a mask image corresponding to the initial image, the mask image being used for indicating a first region to be edited in the initial image (S104); on the basis of the mask image, performing multi-time-step feature fusion and feature denoising processing on image features of the initial image and text features of a target text, until a target image is generated, wherein a second region in the target image comprises image content corresponding to the target text, regions outside the second region comprise image content corresponding to the initial image, and the second region corresponds to the first region (S106). The method improves image editing efficiency while ensuring a good editing effect.
Need to check novelty before this filing date? Find Prior Art

Description

Image generation methods, apparatus and electronic devices

[0001] Cross-reference to related applications

[0002] This application claims priority to Chinese Patent Application No. 202411944097.1, filed on December 26, 2024, entitled “Image Generation Method, Apparatus and Electronic Device”, the entire contents of which are incorporated herein by reference. Technical Field

[0003] This disclosure relates to the field of image processing technology, and in particular to an image generation method, apparatus and electronic device. Background Technology

[0004] Latent diffusion models can perform local editing of images; that is, modify the image content within a local area of ​​the image. In related technologies, in order for the latent diffusion model to know the local area that needs to be modified, the user needs to provide a mask image indicating that local area; the user needs to create an accurate mask image so that the local editing result output by the latent diffusion model meets expectations; however, the process of creating an accurate mask image is cumbersome and time-consuming, reducing image editing efficiency. Summary of the Invention

[0005] In view of this, the purpose of this disclosure is to provide an image generation method, apparatus and electronic device to improve image editing efficiency while ensuring good editing results.

[0006] In a first aspect, embodiments of this disclosure provide an image generation method, the method comprising: responding to a position selection instruction for an initial image, determining a target position on the initial image; generating a mask image corresponding to the initial image based on the target position; wherein the mask image is used to: indicate a first region to be edited in the initial image; and performing multi-time-step feature fusion and feature denoising processing on image features of the initial image and text features of the target text based on the mask image, until a target image is generated; wherein a second region in the target image contains image content corresponding to the target text, and regions outside the second region contain image content corresponding to the initial image; the second region corresponds to the first region.

[0007] Secondly, embodiments of this disclosure provide an image generation apparatus, comprising: a position determination module configured to execute a position selection instruction for an initial image to determine a target position on the initial image; a mask generation module configured to execute a mask image corresponding to the initial image based on the target position; wherein the mask image is used to: indicate a first region in the initial image; and an image generation module configured to execute multi-time-step feature fusion and feature denoising processing based on the mask image, performing on image features of the initial image and text features of the target text, until a target image is generated; wherein a second region in the target image contains image content corresponding to the target text, and regions outside the second region contain image content corresponding to the initial image; the second region corresponds to the first region.

[0008] Thirdly, embodiments of this disclosure provide an electronic device, including a processor and a memory, wherein the memory stores computer-executable instructions that can be executed by the processor, and the processor executes the computer-executable instructions to implement the above-described image generation method.

[0009] Fourthly, embodiments of this disclosure provide a computer-readable storage medium storing computer-executable instructions. When the computer-executable instructions are invoked and executed by a processor, the computer-executable instructions cause the processor to implement the above-described image generation method.

[0010] The embodiments disclosed herein bring the following beneficial effects:

[0011] The aforementioned image generation method, apparatus, and electronic device, in response to a position selection instruction for an initial image, determine a target position on the initial image; based on the target position, generate a mask image corresponding to the initial image; wherein the mask image is used to: indicate a first region to be edited in the initial image; based on the mask image, perform multi-time-step feature fusion and feature denoising processing on the image features of the initial image and the text features of the target text until a target image is generated; wherein a second region in the target image contains the image content corresponding to the target text, and the region outside the second region contains the image content corresponding to the initial image; the second region corresponds to the first region.

[0012] In this method, the user only needs to provide an image location, and a mask image can be generated based on that location. This mask image indicates the area to be edited, allowing the user to edit the image content indicated by the target text within that area. This method improves image editing efficiency while ensuring good editing results.

[0013] Other features and advantages of this disclosure will be set forth in the following description and will be apparent in part from the description or may be learned by practicing the disclosure. The objects and other advantages of this disclosure are realized and obtained through the structures particularly pointed out in the description, claims and drawings.

[0014] To make the above-mentioned objects, features and advantages of this disclosure more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0015] To more clearly illustrate the technical solutions in the specific embodiments or related technologies of this disclosure, the accompanying drawings used in the description of the specific embodiments or related technologies will be briefly introduced below. Obviously, the accompanying drawings described below are some embodiments of this disclosure. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0016] Figure 1 is a flowchart of one of the image generation methods provided in the embodiments of this disclosure;

[0017] Figure 2 is a schematic diagram of one of the image generation processes provided in the embodiments of this disclosure;

[0018] Figure 3 is an example diagram of one of the methods for generating a target image according to an embodiment of this disclosure;

[0019] Figure 4 is an example diagram of another target image generation provided in the embodiments of this disclosure;

[0020] Figure 5 is an example diagram of another target image generation provided in an embodiment of this disclosure;

[0021] Figure 6 is a schematic diagram of one of the image generation apparatuses provided in the embodiments of this disclosure;

[0022] Figure 7 is a schematic diagram of one of the electronic devices provided in the embodiments of this disclosure. Detailed Implementation

[0023] To make the objectives, technical solutions, and advantages of the embodiments of this disclosure clearer, the technical solutions of this disclosure will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this disclosure, and not all embodiments. Based on the embodiments of this disclosure, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of this disclosure.

[0024] To facilitate understanding, the terms used in the embodiments of this disclosure will be explained first.

[0025] 1. Latent Diffusion Models (LDMs): These are deep learning models used to generate images and other data types. LDMs modify input data by gradually adding noise and then learn how to recover the data from the noise. They can be used to generate new images or modify existing ones.

[0026] 2. Mask: In image processing, a mask image is usually a black and white image used to indicate which parts need to be processed or edited. White parts usually represent the parts to be edited, while black parts represent the parts that remain unchanged.

[0027] 3. Semantic loss: Semantic loss is a method to measure the semantic differences between images. It is usually used to supervise the learning process of generative models and can also be used to guide models to generate appropriate image content based on given text descriptions.

[0028] 4. Gradient update: Gradient update refers to the process of adjusting parameters based on the gradient of the loss function with respect to the model parameters during training. It can also be used to adjust the mask image to better match the desired content changes.

[0029] 5. CLIP: Contrastive Language-Image Pre-training; a deep learning model used to solve cross-modal tasks, such as associating text and images. CLIP's main function is to understand the relationship between text and images and to generate or retrieve relevant image content given a text description.

[0030] 6. Alpha-CLIP: This is an enhanced model based on the aforementioned CLIP model. By introducing an additional alpha channel, the model can focus on the user-specified region without altering the image content. Alpha-CLIP can be used to evaluate image editing results. While the aforementioned CLIP model primarily evaluates the similarity between the generated image and the text description, the Alpha-CLIP model focuses mainly on evaluating the edited region.

[0031] The Alpha-CLIP workflow is as follows: First, it extracts the editable regions from the generated image, allowing for individual evaluation of these regions. Then, it calculates similarity, specifically the similarity between the editable region and the editing instructions, or in other words, the similarity between the masked editable region and the descriptive text. Alpha-CLIP quantifies whether the editable region conforms to the user-provided editing instructions, thus offering an objective measure of editing quality.

[0032] In related technologies, when performing local editing on an image, the user needs to provide a precise mask image to limit the local editing area.

[0033] For example, in the Blended Diffusion model, a user-provided mask image is used to mix with text-guided noise during the denoising process to generate an image. However, user-provided mask images in these methods have significant limitations because the success of editing is highly dependent on the precise shape of the mask image, and creating an accurate mask is often tedious and time-consuming for the user.

[0034] Another approach is to describe the target area using text. However, describing the target area using text can be difficult for users to accurately describe the location, and the model may also fail to accurately understand the area described in the text, resulting in poor final editing results.

[0035] In addition, in related technologies, users provide masked images or describe areas through text, which places a heavy input burden on users and limits the flexibility of image editing.

[0036] Based on this, the present disclosure provides an image generation method, apparatus, and electronic device that can be applied to generate various types of images.

[0037] Referring to Figure 1, an image generation method is shown, which includes the following steps:

[0038] Step S102: In response to the position selection instruction for the initial image, determine the target position on the initial image;

[0039] Step S104: Based on the target location, generate a mask image corresponding to the initial image; wherein, the mask image is used to: indicate the first region to be edited in the initial image;

[0040] The location selection instruction can be generated by the user performing a location selection operation through a terminal device, such as by performing a location selection operation through a mouse or touch. The location selection operation can be a click or swipe operation applied to the initial image.

[0041] Specifically, when the location selection operation is a click operation, the click position can be used as the target position, which can be understood as a location point; when the location selection operation is a swipe operation, the swipe path can be used as the target position, which includes multiple location points.

[0042] Once the target location is determined, a new image can be generated on top of the initial image, with the same dimensions as the initial image; this new image can also be a new layer generated based on the initial image; the target location can be recorded through the new image, specifically the location coordinates of the target location.

[0043] The target location is used to indicate the location of the first region in the initial image; in actual implementation, the first region can be formed by expanding outwards from the target location. The first region is recorded using a mask image; in the mask image, the pixel values ​​in the first region are different from the pixel values ​​outside the first region, for example, the pixel value in the first region is 1, while the pixel value outside the first region is 0.

[0044] Step S106: Based on the mask image, perform multi-time-step feature fusion and feature denoising on the image features of the initial image and the text features of the target text until the target image is generated; wherein, the second region in the target image contains the image content corresponding to the target text, and the region outside the second region contains the image content corresponding to the initial image; the second region corresponds to the first region.

[0045] The location of the second region in the target image corresponds to the location of the first region indicated in the mask image in the initial image.

[0046] Encoding the initial image using an image encoder yields its image features, while encoding the target text using a text encoder yields its text features. The target text is input by the user and indicates the content to be edited in the first region; for example, the target text could be "a skateboard" or "a calf."

[0047] Feature fusion and feature denoising can be achieved using latent diffusion models, such as the Stable Diffusion model and Denoising Diffusion Implicit Models. Latent diffusion models require iterative processing of features across multiple time steps. After generating the final features, these features are decoded to obtain the target image. This target image can include multiple images, from which the user can select the one that best meets their needs.

[0048] Understandably, the target image is edited based on the initial image. The user determines the target location in the initial image and enters the target text. Therefore, the image content indicated by the target text is edited in the first area corresponding to the target location, resulting in the edited image content in the second area of ​​the target image. For example, if the target text is "a calf", the image content in the first area corresponding to the user-selected target location is modified based on the initial image, and "a tummy" is generated in the second area of ​​the final target image. At the same time, the image content outside the second area retains the image content outside the first area of ​​the initial image, thus achieving local image editing.

[0049] The image generation method described above, in response to a position selection instruction for an initial image, determines a target position on the initial image; based on the target position, generates a mask image corresponding to the initial image; wherein the mask image is used to: indicate a first region to be edited in the initial image; based on the mask image, performs multi-time-step feature fusion and feature denoising processing on the image features of the initial image and the text features of the target text until a target image is generated; wherein a second region in the target image contains the image content corresponding to the target text, and the regions outside the second region contain the image content corresponding to the initial image; the second region corresponds to the first region.

[0050] In this method, the user only needs to provide an image location, and a mask image can be generated based on that location. This mask image indicates the area to be edited, allowing the user to edit the image content indicated by the target text within that area. This method improves image editing efficiency while ensuring good editing results.

[0051] In one specific implementation, at a specified time step in a multi-time step, the first region in the mask image is adjusted based on the similarity between the intermediate image corresponding to the previous time step and the target text; wherein, the intermediate image is generated based on the features fused and denoised from the previous time step.

[0052] In this embodiment, the user provides only one target location. Based on this, this embodiment needs to continuously adjust the mask image during the feature processing at multiple time steps so that the shape of the first region indicated by the mask image has a high degree of matching with the outline of the image content indicated by the target text.

[0053] The aforementioned specified time step can be all or part of multiple time steps. Multiple time steps constitute the total steps, and the specified time step can be the first 50% of the multiple time steps. That is, in the multiple time steps of feature fusion and feature denoising, the first region in the mask image will continuously change in the first 50% of the time steps, while in the latter 50% of the time steps, the first region no longer changes, and further processing is mainly performed on the features of the image content within the first region.

[0054] Within a specified time step, the mask image is adjusted before feature fusion and denoising. Specifically, if the specified time step is the first time step, the mask image can be adjusted based on the similarity between the initial image and the target text. If the specified time step is a time step following the first time step, the features obtained after feature fusion and denoising in the previous time step are decoded to obtain an intermediate image. The similarity between the intermediate image of the previous time step and the target text is calculated using a preset algorithm, and the first region is adjusted based on this similarity. For example, regions with high similarity may have their edges expanded, while regions with low similarity may have their edges shrunk. After adjusting the region edges, the area and shape of the first region may change.

[0055] After the mask image is adjusted at a specified time step, feature fusion and feature denoising are performed based on the adjusted mask image at that time step.

[0056] In the above method, a mask image is generated based on the location provided by the user, and the editing area indicated by the mask image is continuously adjusted in multiple time steps to make the shape of the editing area more consistent with the image content indicated by the target text, thereby further improving the image editing effect.

[0057] In one specific implementation, a potential energy map corresponding to the initial image is generated with the target location as the center; wherein, in the potential energy map, the pixel location closer to the target location has a larger corresponding potential energy value; based on a preset potential energy threshold, the potential energy map is binarized to obtain a mask image; wherein, in the potential energy map, potential energy values ​​greater than the preset potential energy threshold are set as first values, potential energy values ​​less than or equal to the preset potential energy threshold are set as second values, and the pixel locations corresponding to the first values ​​form a first region.

[0058] The potential energy value in the potential energy map indicates the distance of the pixel location containing that potential energy value from the target location. For example, the potential energy value at the target location is 1, the potential energy value at the pixel location adjacent to the target location is 0.9, the potential energy value at the pixel location further away is 0.8, and so on, until the potential energy value becomes 0.

[0059] If the potential energy value decreases rapidly with distance, the pixel positions with potential energy values ​​greater than 0 form a local region in the potential energy map. This local region may be circular, elliptical, or other shaped regions. If the potential energy value decreases slowly with distance, the potential energy value of all pixel positions in the potential energy map may be greater than 0.

[0060] The purpose of generating the potential energy map in this embodiment is to indicate the distance between each pixel location and the target location using the potential energy value of each pixel. Pixel locations that are closer have larger potential energy values, and are more likely to be included in the first region; conversely, pixel locations that are farther away have smaller potential energy values, and are less likely to be included in the first region. This ensures that the first region in the image includes the target location selected by the user, thus satisfying the user's location requirements.

[0061] The aforementioned preset potential energy threshold can be set according to requirements. When the preset potential energy threshold is large, the number of pixel positions with a potential energy value greater than the preset potential energy threshold is small, and therefore the area of ​​the first region is small. In one example, when the potential energy value ranges from 1 to 0, if the preset potential energy threshold is 0.9, then the pixel positions corresponding to potential energy values ​​greater than 0.9 form the first region A; if the preset potential energy threshold is 0.8, then the upward positions corresponding to potential energy values ​​greater than 0.8 form the first region B, and the area of ​​the first region B is greater than the area of ​​the first region A.

[0062] The aforementioned first and second values ​​are different numerical values. For example, the first value can be 1 and the second value can be 0, or the first value can be 2 and the second value can be 1, etc. There are multiple potential energy values ​​in the potential energy map. After binarization, the pixel values ​​in the mask image have only two values. The potential energy value that is greater than the preset potential energy threshold is the first value, and the potential energy value that is less than or equal to the preset potential energy threshold is the second value.

[0063] In one example, the pixel values ​​in the mask image include 1 and 0. 1 represents the first value, and the pixel location containing this first value forms the aforementioned first region. 0 represents the second value, and the pixel location containing this second value forms the image region outside the first region. It should be noted that the target location typically has a large potential energy value; therefore, the target location is always located within the first region, ensuring that the first region indicated by the mask image never deviates from the user-selected target location.

[0064] Furthermore, a potential energy map in Gaussian distribution form corresponding to the initial image is generated, centered on the target location. The potential energy value in the potential energy map can be understood as the height potential energy of the corresponding pixel location. A height potential energy field is generated around the target location, which is Gaussian distributed with the target location as the center; the potential energy value is the largest at the target location.

[0065] As can be seen from the foregoing embodiments, the first region in the mask image will be continuously adjusted within a specified time step so that the region outline of the first region matches the image content corresponding to the target text.

[0066] In one specific adjustment method, at a specified time step in multiple time steps, the similarity between the intermediate image corresponding to the previous time step and the target text is determined; the gradient of the similarity relative to the mask image is determined, and the gradient is superimposed on the potential energy map corresponding to the mask image to obtain an updated potential energy map; based on a preset potential energy threshold, the updated potential energy map is binarized to obtain an updated mask image; wherein, in the updated mask image, the first region is updated.

[0067] In practical implementation, the latent space vector output at a specified time step is decoded to obtain the intermediate image. Both the intermediate image and the target text can be mapped into a feature space to obtain the image features corresponding to the intermediate image and the text features corresponding to the target text. Then, the similarity between the image features and the text features is calculated. This similarity can be represented by cosine distance, Euclidean distance, or other distance functions.

[0068] In one specific implementation, the region image content of the first region indicated by the mask image is extracted from the intermediate image corresponding to the previous time step of the specified time step; the cosine distance between the region image content and the target text is calculated; wherein the cosine distance indicates the similarity between the region image content and the target text.

[0069] Since this embodiment performs local editing on the initial image, it only needs to focus on the similarity between the first region and the target text. The first region is segmented from the intermediate image to obtain the region image features corresponding to the first region. The cosine distance between the region image features and the text features of the target text is calculated. The cosine distance is used to indicate the similarity between the current region image content of the first region and the target text.

[0070] In one example, the intermediate image, the current mask image, and the target text are input into the aforementioned Alpha-CLIP model, which calculates the cosine distance between the feature vectors of the intermediate image and the feature vectors of the target text.

[0071] Furthermore, the text features of the target text can specifically be the latent space vector of the target file; the gradient of similarity relative to the mask image is calculated through backpropagation; specifically, the gradient of similarity relative to the downsampled mask image can be calculated, that is, the gradient of the cosine distance relative to the downsampled mask image; the larger the gradient, the more important the pixel position is to the image content corresponding to the target text, and the greater the probability that the pixel position belongs to the first region.

[0072] After calculating the gradient, the absolute value of the gradient is calculated and then superimposed onto the potential energy map corresponding to the mask image to obtain the updated potential energy map. In actual implementation, a corresponding gradient value is calculated for each pixel position in the mask image, and this gradient value is superimposed on the potential energy value at the pixel position. When the absolute value of the gradient is greater than 0, the potential energy value at that pixel position will increase, and therefore, the probability of that pixel position being updated to the first region will increase.

[0073] It should be noted that at a specified time step, what is actually updated is the potential energy map corresponding to the mask image. That is, the potential energy value in the potential energy map will be updated after the gradient values ​​are superimposed. Then, the updated potential energy map is binarized based on the preset potential energy threshold to obtain the updated mask image.

[0074] In one approach, at a specified time step, in addition to updating the potential energy map, a preset potential energy threshold is added. This preset potential energy threshold can be gradually increased at each specified time step. That is, as the specified time step changes, the preset potential energy threshold gradually increases, and the potential energy values ​​in the regions of the potential energy map that are important to the image content corresponding to the target text are also gradually increased, so as to ensure that the potential energy values ​​in the important regions are always higher than the preset potential energy threshold. These regions are the first regions for editing the image content corresponding to the target text.

[0075] For pixel locations where the potential energy value increases slowly, when the potential energy value is below a preset potential energy threshold, the pixel location will be updated to a pixel location outside the first region. This can be understood as the first region shrinking at that pixel location. For pixel locations where the potential energy value increases rapidly, which may be located outside the first region, as the potential energy value increases, when the potential energy value exceeds the preset potential energy threshold, the pixel location will be updated to the first region. This can be understood as the first region expanding at that pixel location.

[0076] Therefore, the final area and shape of the first region are determined by the image content indicated by the target text.

[0077] In one specific implementation, at the first time step, the text features of the target text and the first Hadamard product of the downsampled mask image are calculated, and the image features of the initial image and the inverted image of the downsampled mask image are calculated, along with the second Hadamard product. The inverted image is used to indicate regions outside the first region in the initial image. The first and second Hadamard product results are fused to obtain fused features. The fused features are then input into a denoising network for processing to obtain denoised features.

[0078] In subsequent time steps after the first time step, the first Hadamard product of the text features of the target text and the downsampled mask image is calculated, and the third Hadamard product of the denoising features corresponding to the previous time step and the inverted image of the downsampled mask image is calculated. The first and third Hadamard product results are fused to obtain the fused features. The fused features are input into the denoising network for processing to obtain the denoised features. The denoised features of the last time step are decoded to obtain the target image.

[0079] The feature fusion process for a given time step can be expressed by the following formula: z t =z fg ⊙m latent +z bg ⊙(1-m latent )

[0080] Among them, z t For fusion features, z fg The text features of the target text can be the latent space vector of the target text. Since the initial image is locally edited based on the target text, the text features of the target text can also be understood as the foreground vector; ⊙ is the Hadamard product operator.

[0081] m latent The downsampled mask image indicates the location of the first region, which can be understood as the region of the foreground image generated based on the target text; the region outside the first region indicated by the inverted image can be understood as the region of the background image provided by the initial image. Since the pixel values ​​in the mask image consist of 0s and 1s, (1-m...) latent That is, the downsampled mask image is inverted to obtain an inverted image. The inversion process can be understood as pixel value 0 being inverted into pixel value 1, and pixel value 1 being inverted into pixel value 0. Therefore, in the inverted image, the pixel positions with pixel value 1 form the area outside the first region, and the pixel positions with pixel value 0 form the first region.

[0082] z bg For the denoising features described in the previous time step, when the time step is the first time step, z bg It is the image feature of the initial image, that is, the latent space vector of the initial image.

[0083] At each time step, z fg and z bg The following sampled mask image m latent For weight fusion; for a pixel location, if m latent If the pixel value at that location is 1, then that pixel location uses the z-axis. fg The feature corresponding to the pixel position, if m latentIf the pixel value at that location is 0, then that pixel location uses the z-axis. bg The features corresponding to the pixel position.

[0084] Based on this, m latent It determines which regions in the initial image are edited and which are preserved; in the last time step, after outputting the denoising features, the denoising features are decoded to obtain the target image.

[0085] The above method allows you to continue editing within a specified area while maintaining image details outside that area.

[0086] Based on the above formula, refer to Figure 2 for a schematic diagram of the image generation process.

[0087] To achieve local editing of the initial image, the input data includes the initial image, the target location, and the target position. The initial image can be encoded using an image encoder based on a Stable Diffusion model to generate image features, namely the latent space vector z. int In the first time step, z bg For z int In subsequent time steps, z bg Z0' is the denoised feature output from the previous time step.

[0088] z fg The text features of the target text can be obtained by inputting the target text into a text encoder, outputting feature p, and then denoising feature p to generate z. fg z fg and z bg After fusing the masked image Mt, the result is fed into a denoising network, along with the feature p. The network outputs the denoised feature Z0' at the current time step. This denoised feature Z0' is then decoded by an image decoder to generate an intermediate image.

[0089] A potential energy map is generated based on the target location. The feature p, the intermediate image, and the mask image Mt are input into the Alpha-CLIP model, and the gradient is output. This gradient is superimposed on the potential energy map to generate the mask image Mt-1. The mask image Mt at the next time step t is updated based on the mask image Mt-1.

[0090] In the last time step, the denoising network outputs the denoised feature Z0, which is then input into the image decoder to output the target image.

[0091] Figure 3 shows three examples. In the first example, the target text is "a bowl," the black dots represent the target location, and a first region is generated based on this location. The dashed line area represents the first region, and the second region in the target image corresponding to the first region ultimately generates a bowl. In the second example, the target text is "a calf," the black dots represent the target location, the first region is generated based on this location, the dashed line area represents the first region, and the second region in the target image corresponding to the first region ultimately generates a calf. In the third example, the target text is "people swimming," the black dots represent the target location, the first region is generated based on this location, the dashed line area represents the first region, and the second region in the target image corresponding to the first region ultimately generates four people swimming.

[0092] In Figure 4, after the user clicks on a target location in the initial image, a first region is generated based on that target location. The dotted line area represents the first region. When the target text is "tombstone", a target image 1 containing a tombstone is generated; when the target text is "cartoon car", a target image 2 containing a cartoon car is generated; and when the target text is "snake", a target image 3 containing a snake is generated.

[0093] Figure 5 illustrates the adjustment process of the first region indicated by the mask image. In the first example, the target text is "a giraffe." Initially, the first region is an approximately circular area. After continuous adjustments, the shape of the first region approximates the shape of a giraffe, ultimately generating a target image containing a giraffe. In the second example, the target text is "snow mountain." Initially, the first region is an approximately circular area. After continuous adjustments, the shape of the first region approximates the shape of a mountain, ultimately generating a target image containing a snow mountain. It should be noted that in the examples in Figure 5, the percentage represents the position of the current time step within the total time steps. For example, when the current time step is 48%, the final mask image is determined, and the shape of the first region no longer changes.

[0094] The image generation method provided in this embodiment only requires the user to provide a reference point (the aforementioned target location) and a text description via mouse click or touch input to achieve precise local image editing. During the editing process, a mask image is dynamically generated based on this reference point. This mask image will dynamically evolve and adjust according to semantic loss, ultimately generating a mask image that matches the image content indicated by the text description, thereby achieving local image editing.

[0095] This embodiment dynamically generates a mask image using only the single point location clicked by the user and a text description, eliminating the need for the user to manually create a mask image or provide a text description of the editing area. This reduces the complexity of user input, simplifies user input requirements, and reduces reliance on input data. This embodiment enables real-time local image editing, improves the interactivity and real-time performance of the editing process, is easy to operate, user-friendly, and enables efficient and accurate local image editing.

[0096] Corresponding to the above method embodiments, referring to Figure 6, an image generation apparatus includes:

[0097] The position determination module 60 is configured to execute a position selection instruction for the initial image and determine the target position on the initial image.

[0098] The mask generation module 62 is configured to generate a mask image corresponding to the initial image based on the target location; wherein the mask image is used to indicate the first region to be edited in the initial image;

[0099] The image generation module 64 is configured to perform multi-time-step feature fusion and feature denoising based on the mask image, combining the image features of the initial image and the text features of the target text, until the target image is generated; wherein, the second region in the target image contains the image content corresponding to the target text, and the region outside the second region contains the image content corresponding to the initial image; the second region corresponds to the first region.

[0100] The image generation apparatus described above, in response to a position selection instruction for an initial image, determines a target position on the initial image; based on the target position, it generates a mask image corresponding to the initial image; wherein the mask image is used to: indicate a first region to be edited in the initial image; based on the mask image, it performs multi-time-step feature fusion and feature denoising processing on the image features of the initial image and the text features of the target text until a target image is generated; wherein a second region in the target image contains the image content corresponding to the target text, and the region outside the second region contains the image content corresponding to the initial image; the second region corresponds to the first region.

[0101] In this method, the user only needs to provide an image location, and a mask image can be generated based on that location. This mask image indicates the area to be edited, allowing the user to edit the image content indicated by the target text within that area. This method improves image editing efficiency while ensuring good editing results.

[0102] The aforementioned apparatus further includes a mask adjustment module, configured to adjust a first region in a mask image based on the similarity between the intermediate image corresponding to the previous time step and the target text at a specified time step in a multi-time step; wherein the intermediate image is generated based on the features generated after feature fusion and feature denoising processing of the previous time step.

[0103] The mask generation module is configured to perform the following: generate a potential energy map corresponding to the initial image, centered on the target location; wherein, in the potential energy map, the pixel location closer to the target location has a larger potential energy value; and perform binarization processing on the potential energy map based on a preset potential energy threshold to obtain a mask image; wherein, in the potential energy map, potential energy values ​​greater than the preset potential energy threshold are set as a first value, and potential energy values ​​less than or equal to the preset potential energy threshold are set as a second value, and the pixel locations corresponding to the first value form a first region.

[0104] The mask generation module described above is configured to perform the following: generate a potential energy map in Gaussian distribution form corresponding to the initial image, centered on the target location.

[0105] The aforementioned mask adjustment module is configured to perform the following: in a specified time step among multiple time steps, determine the similarity between the intermediate image corresponding to the previous time step and the target text; determine the gradient of the similarity relative to the mask image, and superimpose the gradient onto the potential energy map corresponding to the mask image to obtain an updated potential energy map; based on a preset potential energy threshold, perform binarization processing on the updated potential energy map to obtain an updated mask image; wherein, in the updated mask image, the first region is updated.

[0106] The aforementioned mask adjustment module is configured to perform the following: extract the region image content of the first region indicated by the mask image from the intermediate image corresponding to the previous time step of the specified time step; calculate the cosine distance between the region image content and the target text; wherein the cosine distance indicates the similarity between the region image content and the target text.

[0107] The mask adjustment module described above is configured to perform the following: calculate the gradient of similarity relative to the downsampled mask image.

[0108] The mask adjustment module is configured to perform the following: calculate the absolute value of the gradient, superimpose the absolute value onto the potential energy map corresponding to the mask image, and obtain an updated potential energy map.

[0109] The aforementioned device also includes a potential energy value increasing module, configured to perform: increasing a preset potential energy threshold.

[0110] The image generation module described above is configured to perform the following: In the first time step, calculate the first Hadamard product of the text features of the target text and the downsampled mask image, and calculate the second Hadamard product of the image features of the initial image and the inverted image of the downsampled mask image; wherein the inverted image is used to: indicate the region outside the first region in the initial image; fuse the first Hadamard product and the second Hadamard product to obtain a fused feature; input the fused feature into a denoising network for processing to obtain a denoised feature; In subsequent time steps of the first time step, calculate the first Hadamard product of the text features of the target text and the downsampled mask image, and calculate the third Hadamard product of the denoised feature corresponding to the previous time step of the specified time step and the inverted image of the downsampled mask image; fuse the first Hadamard product and the third Hadamard product to obtain a fused feature; input the fused feature into a denoising network for processing to obtain a denoised feature; decode the denoised feature of the last time step to obtain the target image.

[0111] The image generation method and apparatus provided in this embodiment dynamically generate a mask image using only a single reference position clicked by the user and a content description, reducing the operational complexity for users during image editing. This simplified operation allows non-professional users to easily perform local image editing without needing advanced image processing skills or in-depth understanding of the details of mask creation.

[0112] The image generation method in this embodiment can complete the image editing task in approximately one second, which greatly speeds up the editing process and improves real-time performance. This rapid response capability makes this embodiment suitable for application scenarios that require instant feedback, such as online image editing tools or mobile applications.

[0113] This embodiment uses semantic loss guidance to ensure semantic consistency between the edited content and the user-provided text description, thereby improving the accuracy and naturalness of the editing results. This method can more accurately understand and execute the user's editing intentions, generating high-quality results that seamlessly integrate with the original image.

[0114] Because this embodiment employs a training-free editing workflow, it reduces the demand for computing resources. It can run on resource-constrained devices, such as smartphones or tablets, without relying on high-performance computing equipment.

[0115] This embodiment also provides an electronic device, including a processor and a memory. The memory stores computer-executable instructions that can be executed by the processor, and the processor executes the computer-executable instructions to implement the above-described image generation method. This electronic device can be a server or a terminal device.

[0116] Referring to Figure 7, the electronic device includes a processor 100 and a memory 101. The memory 101 stores computer-executable instructions that can be executed by the processor 100. The processor 100 executes the computer-executable instructions to implement the above-described image generation method.

[0117] Furthermore, the electronic device shown in Figure 7 also includes a bus 102 and a communication interface 103, with the processor 100, the communication interface 103, and the memory 101 connected via the bus 102.

[0118] The memory 101 may include high-speed random access memory (RAM) or non-volatile memory, such as at least one disk storage device. Communication between this system network element and at least one other network element is achieved through at least one communication interface 103 (which can be wired or wireless), such as the Internet, wide area network, local area network, or metropolitan area network. The bus 102 may be an ISA bus, PCI bus, or EISA bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of illustration, only a single bidirectional arrow is used in Figure 7, but this does not indicate that there is only one bus or one type of bus.

[0119] The processor 100 may be an integrated circuit chip with signal processing capabilities. In implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of the processor 100 or by instructions in software form. The processor 100 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams disclosed in the embodiments of this disclosure. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the methods disclosed in the embodiments of this disclosure can be directly manifested as execution by a hardware decoding processor, or execution by a combination of hardware and software modules in the decoding processor. The software module can reside in a readily available storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 101, and the processor 100 reads the information from memory 101 and, in conjunction with its hardware, completes the steps of the method described in the foregoing embodiments.

[0120] The processor in the aforementioned electronic device, by executing computer-executable instructions, can perform the following operations in the aforementioned image generation method:

[0121] An image generation method, in response to a position selection instruction for an initial image, determines a target position on the initial image; based on the target position, generates a mask image corresponding to the initial image; wherein the mask image is used to: indicate a first region to be edited in the initial image; based on the mask image, performs multi-time-step feature fusion and feature denoising processing on the image features of the initial image and the text features of the target text until a target image is generated; wherein a second region in the target image contains the image content corresponding to the target text, and the region outside the second region contains the image content corresponding to the initial image; the second region corresponds to the first region.

[0122] In a specified time step within multiple time steps, the first region in the mask image is adjusted based on the similarity between the intermediate image corresponding to the previous time step and the target text; wherein, the intermediate image is generated based on the features generated after feature fusion and feature denoising processing of the previous time step.

[0123] A potential energy map corresponding to the initial image is generated with the target location as the center. In the potential energy map, the pixel location closer to the target location has a larger potential energy value. Based on a preset potential energy threshold, the potential energy map is binarized to obtain a mask image. In the potential energy map, potential energy values ​​greater than the preset potential energy threshold are set as the first value, and potential energy values ​​less than or equal to the preset potential energy threshold are set as the second value. The pixel locations corresponding to the first value form the first region.

[0124] Generate a potential energy map in Gaussian distribution form corresponding to the initial image, centered on the target location.

[0125] In a specified time step within multiple time steps, the similarity between the intermediate image corresponding to the previous time step and the target text is determined; the gradient of the similarity relative to the mask image is determined, and the gradient is superimposed on the potential energy map corresponding to the mask image to obtain an updated potential energy map; based on a preset potential energy threshold, the updated potential energy map is binarized to obtain an updated mask image; wherein, in the updated mask image, the first region is updated.

[0126] Extract the region image content of the first region indicated by the mask image from the intermediate image corresponding to the previous time step of the specified time step; calculate the cosine distance between the region image content and the target text; where the cosine distance indicates the similarity between the region image content and the target text.

[0127] Calculate the gradient of similarity relative to the downsampled mask image.

[0128] Calculate the absolute value of the gradient and superimpose it onto the potential energy map corresponding to the mask image to obtain the updated potential energy map.

[0129] Increase the preset potential energy threshold.

[0130] In the first time step, the first Hadamard product of the text features of the target text and the downsampled mask image is calculated, and the second Hadamard product of the image features of the initial image and the inverted image of the downsampled mask image is calculated; wherein, the inverted image is used to: indicate the region outside the first region in the initial image; the first Hadamard product and the second Hadamard product are fused to obtain the fused feature; the fused feature is input into the denoising network for processing to obtain the denoised feature; in subsequent time steps of the first time step, the first Hadamard product of the text features of the target text and the downsampled mask image is calculated, and the third Hadamard product of the denoised feature corresponding to the previous time step and the inverted image of the downsampled mask image is calculated; the first Hadamard product and the third Hadamard product are fused to obtain the fused feature; the fused feature is input into the denoising network for processing to obtain the denoised feature; the denoised feature of the last time step is decoded to obtain the target image.

[0131] In this method, the user only needs to provide an image location, and a mask image can be generated based on that location. This mask image indicates the area to be edited, allowing the user to edit the image content indicated by the target text within that area. This method improves image editing efficiency while ensuring good editing results.

[0132] This embodiment also provides a computer-readable storage medium storing computer-executable instructions. When the computer-executable instructions are called and executed by a processor, the computer-executable instructions cause the processor to implement the above-described image generation method.

[0133] The computer-executable instructions stored in the aforementioned computer-readable storage medium can, by executing the aforementioned computer-executable instructions, perform the following operations in the aforementioned image generation method:

[0134] An image generation method, in response to a position selection instruction for an initial image, determines a target position on the initial image; based on the target position, generates a mask image corresponding to the initial image; wherein the mask image is used to: indicate a first region to be edited in the initial image; based on the mask image, performs multi-time-step feature fusion and feature denoising processing on the image features of the initial image and the text features of the target text until a target image is generated; wherein a second region in the target image contains the image content corresponding to the target text, and the region outside the second region contains the image content corresponding to the initial image; the second region corresponds to the first region.

[0135] In a specified time step within multiple time steps, the first region in the mask image is adjusted based on the similarity between the intermediate image corresponding to the previous time step and the target text; wherein, the intermediate image is generated based on the features generated after feature fusion and feature denoising processing of the previous time step.

[0136] A potential energy map corresponding to the initial image is generated with the target location as the center. In the potential energy map, the pixel location closer to the target location has a larger potential energy value. Based on a preset potential energy threshold, the potential energy map is binarized to obtain a mask image. In the potential energy map, potential energy values ​​greater than the preset potential energy threshold are set as the first value, and potential energy values ​​less than or equal to the preset potential energy threshold are set as the second value. The pixel locations corresponding to the first value form the first region.

[0137] Generate a potential energy map in Gaussian distribution form corresponding to the initial image, centered on the target location.

[0138] In a specified time step within multiple time steps, the similarity between the intermediate image corresponding to the previous time step and the target text is determined; the gradient of the similarity relative to the mask image is determined, and the gradient is superimposed on the potential energy map corresponding to the mask image to obtain an updated potential energy map; based on a preset potential energy threshold, the updated potential energy map is binarized to obtain an updated mask image; wherein, in the updated mask image, the first region is updated.

[0139] Extract the region image content of the first region indicated by the mask image from the intermediate image corresponding to the previous time step of the specified time step; calculate the cosine distance between the region image content and the target text; where the cosine distance indicates the similarity between the region image content and the target text.

[0140] Calculate the gradient of similarity relative to the downsampled mask image.

[0141] Calculate the absolute value of the gradient and superimpose it onto the potential energy map corresponding to the mask image to obtain the updated potential energy map.

[0142] Increase the preset potential energy threshold.

[0143] In the first time step, the first Hadamard product of the text features of the target text and the downsampled mask image is calculated, and the second Hadamard product of the image features of the initial image and the inverted image of the downsampled mask image is calculated; wherein, the inverted image is used to: indicate the region outside the first region in the initial image; the first Hadamard product and the second Hadamard product are fused to obtain the fused feature; the fused feature is input into the denoising network for processing to obtain the denoised feature; in subsequent time steps of the first time step, the first Hadamard product of the text features of the target text and the downsampled mask image is calculated, and the third Hadamard product of the denoised feature corresponding to the previous time step and the inverted image of the downsampled mask image is calculated; the first Hadamard product and the third Hadamard product are fused to obtain the fused feature; the fused feature is input into the denoising network for processing to obtain the denoised feature; the denoised feature of the last time step is decoded to obtain the target image.

[0144] In this method, the user only needs to provide an image location, and a mask image can be generated based on that location. This mask image indicates the area to be edited, allowing the user to edit the image content indicated by the target text within that area. This method improves image editing efficiency while ensuring good editing results.

[0145] The computer program products of the image generation method, apparatus, and electronic device provided in this disclosure include a computer-readable storage medium storing program code. The instructions included in the program code can be used to execute the methods described in the preceding method embodiments. For specific implementation details, please refer to the method embodiments, which will not be repeated here.

[0146] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the system and apparatus described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0147] Furthermore, in the description of the embodiments of this disclosure, unless otherwise expressly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in this disclosure based on the specific circumstances.

[0148] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of this disclosure, in essence, or the part that contributes to related technologies, or a portion of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of this disclosure. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0149] In the description of this disclosure, it should be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings, and are only for the convenience of describing this disclosure and simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation, and therefore should not be construed as a limitation of this disclosure. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.

[0150] Finally, it should be noted that the above embodiments are merely specific implementations of this disclosure, used to illustrate the technical solutions of this disclosure, and not to limit it. The protection scope of this disclosure is not limited thereto. Although this disclosure has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments, or make equivalent substitutions for some of the technical features, within the scope of the technology disclosed in this disclosure. Such modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of this disclosure, and should all be covered within the protection scope of this disclosure. Therefore, the protection scope of this disclosure should be determined by the protection scope of the claims.

Claims

1. An image generation method, the method comprising: In response to a position selection instruction for an initial image, a target position is determined on the initial image; Based on the target location, a mask image corresponding to the initial image is generated; wherein, the mask image is used to: indicate the first region to be edited in the initial image; Based on the mask image, the image features of the initial image and the text features of the target text are subjected to multi-time-step feature fusion and feature denoising processing until the target image is generated; The second region in the target image contains the image content corresponding to the target text, and the region outside the second region contains the image content corresponding to the initial image; the second region corresponds to the first region.

2. The method according to claim 1, wherein, After generating the mask image corresponding to the initial image based on the target location, the method further includes: In a specified time step among the multiple time steps, the first region in the mask image is adjusted based on the similarity between the intermediate image corresponding to the previous time step and the target text; wherein, the intermediate image is generated based on the features generated after feature fusion and feature denoising processing of the previous time step.

3. The method according to claim 1, wherein, The step of generating a mask image corresponding to the initial image based on the target location includes: A potential energy map corresponding to the initial image is generated with the target location as the center; wherein, in the potential energy map, the pixel location that is closer to the target location has a larger potential energy value; Based on a preset potential energy threshold, the potential energy map is binarized to obtain a mask image; wherein, in the potential energy map, potential energy values ​​greater than the preset potential energy threshold are set as first values, and potential energy values ​​less than or equal to the preset potential energy threshold are set as second values, and the pixel positions corresponding to the first values ​​constitute the first region.

4. The method according to claim 3, wherein, The step of generating a potential energy map corresponding to the initial image with the target location as the center includes: generating a potential energy map in Gaussian distribution form corresponding to the initial image with the target location as the center.

5. The method according to claim 2, wherein, The step of adjusting the first region in the mask image based on the similarity between the intermediate image corresponding to the previous time step and the target text in a specified time step of the multiple time steps includes: In a specified time step among the multiple time steps, the similarity between the intermediate image corresponding to the previous time step of the specified time step and the target text is determined; Determine the gradient of the similarity relative to the mask image, and superimpose the gradient onto the potential energy map corresponding to the mask image to obtain an updated potential energy map; Based on a preset potential energy threshold, the updated potential energy map is binarized to obtain an updated mask image; wherein, in the updated mask image, the first region is updated.

6. The method according to claim 5, wherein, The step of determining the similarity between the intermediate image corresponding to the previous time step of the specified time step and the target text includes: Extract the region image content of the first region indicated by the mask image from the intermediate image corresponding to the previous time step of the specified time step; Calculate the cosine distance between the content of the region image and the target text; wherein the cosine distance indicates the similarity between the content of the region image and the target text.

7. The method according to claim 5, wherein, The step of determining the gradient of the similarity relative to the mask image includes: calculating the gradient of the similarity relative to the downsampled mask image.

8. The method according to claim 5, wherein, The step of superimposing the gradient onto the potential energy map corresponding to the mask image to obtain an updated potential energy map includes: Calculate the absolute value of the gradient and superimpose the absolute value onto the potential energy map corresponding to the mask image to obtain an updated potential energy map.

9. The method according to claim 5, wherein, Before the step of binarizing the updated potential energy map based on a preset potential energy threshold to obtain an updated mask image, the method further includes: increasing the preset potential energy threshold.

10. The method according to claim 1, wherein, Based on the mask image, the step of performing multi-time-step feature fusion and feature denoising on the image features of the initial image and the text features of the target text until the target image is generated includes: In the first time step, the text features of the target text are calculated as a first Hadamard product of the downsampled mask image, and the image features of the initial image are calculated as a second Hadamard product of the inverted image of the downsampled mask image; wherein the inverted image is used to indicate the region outside the first region in the initial image; The first Hadamard product result and the second Hadamard product result are fused to obtain a fused feature; the fused feature is then input into a denoising network for processing to obtain a denoised feature. In subsequent time steps of the first time step, the first Hadamard product of the text features of the target text and the downsampled mask image is calculated, and the third Hadamard product of the denoised features corresponding to the previous time step and the inverted image of the downsampled mask image is calculated. The first Hadamard product result and the third Hadamard product result are fused to obtain a fused feature; the fused feature is then input into a denoising network for processing to obtain a denoised feature. The denoising features of the last time step are decoded to obtain the target image.

11. An image generation apparatus, the apparatus comprising: The location determination module is configured to execute a location selection instruction for an initial image and determine a target location on the initial image. A mask generation module is configured to generate a mask image corresponding to the initial image based on the target location; wherein the mask image is used to indicate a first region to be edited in the initial image; The image generation module is configured to perform multi-time-step feature fusion and feature denoising processing on the image features of the initial image and the text features of the target text based on the mask image, until the target image is generated; The second region in the target image contains the image content corresponding to the target text, and the region outside the second region contains the image content corresponding to the initial image; the second region corresponds to the first region.

12. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the image generation method according to any one of claims 1 to 10.

13. A computer-readable storage medium storing computer-executable instructions, which, when invoked and executed by a processor, cause the processor to perform the image generation method according to any one of claims 1 to 10.