Image generation method, apparatus, and electronic device

By generating a target noisy image and performing denoising processing, the problem of subject insertion not conforming to identity characteristics and text prompts in the existing technology is solved, thereby improving the image generation effect and user experience.

CN122199714APending Publication Date: 2026-06-12HANGZHOU NETEASE ZHIQI TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
HANGZHOU NETEASE ZHIQI TECH CO LTD
Filing Date
2025-12-23
Publication Date
2026-06-12

Smart Images

  • Figure CN122199714A_ABST
    Figure CN122199714A_ABST
Patent Text Reader

Abstract

The application provides an image generation method and device and electronic equipment, obtains a reference image, a background image and text description information, generates a target noise image based on the reference image and the background image, determines a first image feature based on the target noise image and the text description information, performs denoising processing on the target noise image based on the first image feature, a second image feature and a third image feature, and generates a target image. In the process of integrating the subject in the reference image into the background image based on the text description, the target noise image generated based on the two images and the image feature determined by the text description, the image feature determined by the noise image corresponding to the text description information and the image feature determined by the noise image formed only by the two images are considered simultaneously, the denoising processing is performed on the target noise image, the subject identity feature in the image generation process and the display feature in the text description are considered, the display effect of the image is improved, and the user experience is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of image processing technology, and more specifically, to an image generation method, apparatus, and electronic device. Background Technology

[0002] Image customization is a fundamental task in the field of visual content generation, aiming to seamlessly integrate a user-specified subject into a new scene. Related technologies often combine subject embedding with text embedding, fine-tuning specific instances before inserting the subject, and diffusion models to insert a customized subject into a specified region of an existing image. However, these methods struggle to ensure that the subject in the generated image matches both the user-specified identity and the features indicated in the text prompts for that subject, resulting in poor image quality that does not meet user expectations. Summary of the Invention

[0003] In view of this, the purpose of the present invention is to provide an image generation method, apparatus and electronic device that takes into account both the identity features of the target subject in the generated image and the display features indicated by the text description information, thereby improving the display effect of the generated image and enhancing the user experience.

[0004] In a first aspect, embodiments of the present invention provide an image generation method, which includes: acquiring a reference image, a background image, and text description information; the reference image includes a target subject; the text description information is used to: describe target display features corresponding to the target subject; the target display features are different from the initial display features of the target subject in the reference image; generating a target noise image based on the reference image and the background image; determining a first image feature based on the target noise image and the text description information; performing denoising processing on the target noise image based on the first image feature, a second image feature, and a third image feature to generate a target image; wherein the second image feature is determined based on the first noise image corresponding to the reference image and the background image; the third image feature is determined based on the second noise image corresponding to the text description information; the target image includes display content in the background image and the target subject; the target subject in the target image conforms to the target display features.

[0005] Secondly, embodiments of the present invention provide an image generation apparatus, comprising: an image acquisition module for acquiring a reference image, a background image, and text description information; the reference image includes a target subject; the text description information is used to: describe target display features corresponding to the target subject; the target display features are different from the initial display features of the target subject in the reference image; a target noise image generation module for generating a target noise image based on the reference image and the background image; and a denoising processing module for determining a first image feature based on the target noise image and the text description information, and performing denoising processing on the target noise image based on the first image feature, a second image feature, and a third image feature to generate a target image; wherein the second image feature is determined based on the first noise image corresponding to the reference image and the background image; the third image feature is determined based on the second noise image corresponding to the text description information; the target image includes display content in the background image and a target subject; and the target subject in the target image conforms to the target display features.

[0006] Thirdly, embodiments of the present invention provide an electronic device, including a processor and a memory, wherein the memory stores machine-executable instructions that can be executed by the processor, and the processor executes the machine-executable instructions to implement the above-described image generation method.

[0007] Fourthly, embodiments of the present invention provide a machine-readable storage medium storing machine-executable instructions. When the machine-executable instructions are invoked and executed by a processor, the machine-executable instructions cause the processor to implement the above-described image generation method.

[0008] The embodiments of the present invention bring the following beneficial effects: The aforementioned image generation method, apparatus, and electronic device acquire a reference image, a background image, and text description information; generate a target noise image based on the reference image and the background image; determine first image features based on the target noise image and the text description information; and perform denoising processing on the target noise image based on the first image features, second image features, and third image features to generate the target image. This method, in the process of integrating the target subject from the reference image into the background image display scene based on the text description, simultaneously considers the image features determined by the target noise image generated from both images and the text description, the image features determined by the noise image corresponding to the text description information, and the image features determined by the noise image formed solely from the two images. By performing denoising processing on the target noise image, it takes into account both the identity features of the target subject in the generated image and the display features indicated by the text description information, improving the display effect of the generated image and enhancing the user experience.

[0009] Other features and advantages of the invention will be set forth in the description which follows, and will be apparent in part from the description, or may be learned by practicing the invention. The objects and other advantages of the invention are realized and obtained in accordance with the structures particularly pointed out in the description, claims and drawings.

[0010] To make the above-mentioned objects, features and advantages of the present invention more apparent and understandable, preferred embodiments are described below in detail with reference to the accompanying drawings. Attached Figure Description

[0011] To more clearly illustrate the specific embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the specific embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are some embodiments of the present invention. For those skilled in the art, other drawings can be obtained from these drawings without creative effort.

[0012] Figure 1 A flowchart of an image generation method provided in an embodiment of the present invention; Figure 2 A schematic diagram of a reference image and a first masking image provided for an embodiment of the present invention; Figure 3 A schematic diagram of the background image and the second masking image provided in an embodiment of the present invention; Figure 4 A schematic diagram of the composite image and the third masking image provided in an embodiment of the present invention; Figure 5 This is a schematic diagram of a mask image corresponding to a transition region provided in an embodiment of the present invention; Figure 6 This is a schematic diagram illustrating the execution process of three denoised streams according to an embodiment of the present invention; Figure 7 A schematic diagram of a conditional guidance structure mechanism provided in an embodiment of the present invention; Figure 8 A schematic diagram illustrating a controllable energy determination process provided in an embodiment of the present invention; Figure 9 This is a schematic diagram of the structure of an image generation device provided in an embodiment of the present invention; Figure 10 This is a schematic diagram of the structure of an electronic device provided in an embodiment of the present invention. Detailed Implementation

[0013] To make the objectives, technical solutions, and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0014] Image customization is a fundamental task in the field of visual content generation, aiming to seamlessly integrate a user-specified subject into a new scene. Significant progress has been made in this area with the development of text-to-image diffusion models. Besides synthesizing new scenes from scratch, a more practical and challenging approach is customized subject teleportation, which involves inserting a customized subject into a specified area of ​​an existing image. This task requires inserting the subject naturally without altering the existing image background, while maintaining consistency in the subject's identity and faithfully adhering to user-defined prompts. This capability has wide applications in image editing, artistic creation, and visual effects rendering.

[0015] Training-based methods replace text embeddings with subject embeddings through training. While effective in preserving subject identity, this severely limits the ability to edit via text prompts. Personalized Content Synthesis (PCS) methods first capture subject identity by fine-tuning a model for specific instances, then insert the subject using an additional editing model. However, this workflow is prone to overfitting and reduces text editability. Multi-Stream frameworks have also been constructed to learn and insert subjects without training, using techniques such as inversion and attention blending. However, due to the unclear objectives of each stream and overly simplistic blending strategies, feature interactions between streams become blurred, leading to loss of subject identity and reduced text prompting capabilities.

[0016] The core challenge of the aforementioned methods lies in balancing fidelity and controllability. Fidelity requires the generated subject to conform to the user-specified identity features, without losing or distorting key attributes; while controllability emphasizes the ability of the generated subject to be flexibly adjusted based on text prompts. However, these two goals often conflict: enhancing controllability usually compromises identity preservation due to aggressive feature manipulation, while increasing fidelity limits the model's ability to perform meaningful edits. This trade-off makes it difficult for existing methods to simultaneously achieve high fidelity and strong controllability within the same framework.

[0017] Based on this, the present invention provides an image generation method, apparatus and electronic device, which can be applied to scenarios of version code comparison.

[0018] See Figure 1 First, an image generation method provided by an embodiment of the present invention will be introduced. This method includes the following steps: Step S102: Obtain a reference image, a background image, and text description information; the reference image includes the target subject; the text description information is used to: describe the target display features corresponding to the target subject; the target display features are different from the initial display features of the target subject in the reference image.

[0019] The target subject can be a person, animal, building, etc., and usually has relatively clear boundaries. Image recognition algorithms can be used to segment the display area of ​​the target subject from other areas in the reference image. For example, an automatic segmenter (Segment Anything Model, SAM) can be used to process the reference image to generate a mask image, which can distinguish the display area of ​​the target subject from other areas.

[0020] The target subject in the reference image has multiple initial display features, such as the target subject's color, pose, orientation, relative position between different body parts, and relative size in the reference image. In the target image to be generated, one or more display features of the target subject are usually different from its initial display features in the reference image. To distinguish them from the initial display features, these display features are called "target display features".

[0021] The aforementioned textual description information is typically used to describe the display features of the target subject in the target image that differ from its initial display features; these are known as target display features. Similar to the initial display features, the target display features described in the textual description information can be one or more of the following: color, pose, orientation, relative position between different body parts, relative size, etc.

[0022] The initial display features of a target subject are usually multiple. Aside from the initial display features mentioned in the text description that change in the target image, other initial display features are generally considered unchanged in the target image. For example, when the reference image shows a black puppy standing upright, the text description could be: "Show a white puppy." In this case, the target image needs to show a white puppy standing upright.

[0023] The background image described above typically defines the display area of ​​the target subject. This display area can be represented by a position and size parameter in the background image. The position can be the center of the target subject within the background image. The size parameter is the display size of the target subject within the background image. Four positions in the background image that can form a rectangle can also be predefined. These four positions define the display area of ​​the target subject within the background image. Alternatively, a mask image can be used to define the display area of ​​the target subject within the background image. Specific settings can be configured according to requirements and are not limited here.

[0024] Step S104: Generate a target noise image based on the reference image and the background image.

[0025] Typically, the target subject in the reference image needs to be placed within a predefined display area in the background image to create a composite image of the reference and background images.

[0026] After generating the synthetic image, the principle of image diffusion technology can be used to transform the synthetic image into a target image that matches the text description information. In image diffusion technology, the image is first gradually transformed into a noisy image through a forward diffusion process, and then the noisy image is denoised through a reverse diffusion process to finally generate the target image.

[0027] Therefore, it is necessary to generate a target noise image based on the synthesized image. Specifically, Gaussian noise can be added to the image according to a preset noise schedule in each diffusion step. The noise schedule determines the noise intensity at each step, and usually increases with the number of steps or remains constant. Finally, after adding a specified level of noise to the synthesized image or turning the synthesized image into a pure noise image, the target noise image is obtained. The specific settings can be configured according to requirements and are not limited here.

[0028] Step S106: Determine the first image features based on the target noisy image and text description information; perform denoising processing on the target noisy image based on the first image features, the second image features, and the third image features to generate the target image; wherein, the second image features are determined based on the first noisy image corresponding to the reference image and the background image; the third image features are determined based on the second noisy image corresponding to the text description information; the target image includes the display content in the background image and the target subject; the target subject in the target image conforms to the target display features.

[0029] After obtaining the target noisy image, it needs to be denoised. In a typical denoising process, the encoder extracts multi-level features from the noisy image. These features include the global structure of the image, local details, and clues about the noise distribution. Then, the decoder gradually amplifies the feature map and combines it with low-level detail information from skip connections to predict the noise that should be removed.

[0030] In practical applications, it is usually necessary to map the text description information into feature vectors, then extract the image features of the target noisy image through an encoder, and generate the first image feature based on the feature vector corresponding to the text description information.

[0031] Similarly, a first noisy image can be generated based on a reference image and a background image. This first noisy image can be the same as or different from the target noisy image, for example, by varying the number of noise additions. Then, an encoder extracts multi-level features from the first noisy image as features for the second image.

[0032] Similarly, the random noise image can be used as the second noise image corresponding to the text description information. Then, the encoder extracts the multi-level features of the second noise image and combines them with the feature vector obtained by mapping the text description information to generate the third image features.

[0033] Then, target image features can be generated based on the first image features, the second image features, and the third image features. In one specific embodiment, the weights of the three image features can be preset, and then the target image features are generated based on the weight parameters and the three image features. Then, prediction noise is determined based on the target image features, and denoising processing is performed on the target noisy image based on the prediction noise.

[0034] Typically, the denoising process for the target noisy image is performed in multiple steps (also known as "time steps"). Each denoising step is based on the target noisy image after the previous denoising. It is necessary to determine the target image features corresponding to the current processing step and predict the noise, so as to perform denoising on the target noisy image after the previous denoising again.

[0035] In determining the target image corresponding to the current processing step, it is necessary to determine the first image feature, the second image feature, and the third image feature corresponding to the current step. In a specific embodiment, denoising processing steps for the first noisy image, the second noisy image, and the target noisy image can be performed simultaneously to obtain the first image feature, the second image feature, and the third image feature under that step.

[0036] The first and second noisy images can be pre-denoised to obtain second and third image features corresponding to each denoising step. Then, in each step of denoising the target noisy image, after determining the first image feature corresponding to the current step based on the current target noisy image and text description information, it is necessary to obtain the second and third image features corresponding to the current step. For example, if the current step is used for the third denoising of the target noisy image, the second image feature obtained during the third denoising of the first noisy image can be used as the second image feature corresponding to the current step, and the third image feature can be obtained similarly.

[0037] In practical applications, target image features can be generated based on the first, second, and third image features only in certain denoising steps, and then used for denoising processing. In other denoising steps, only the first image features can be used for denoising processing. The specific settings can be configured according to requirements, and no restrictions are imposed here.

[0038] The aforementioned image generation method involves acquiring a reference image, a background image, and text description information; generating a target noisy image based on the reference image and the background image; determining first image features based on the target noisy image and the text description information; and denoising the target noisy image based on the first image features, second image features, and third image features to generate the target image. This method, in the process of integrating the target subject from the reference image into the background image display scene based on the text description, simultaneously considers the image features determined by the target noisy image generated from both images and the text description, the image features determined by the noise image corresponding to the text description information, and the image features determined by the noise image formed solely from the two images. By denoising the target noisy image, it takes into account both the identity features of the target subject in the generated image and the display features indicated by the text description information, improving the display effect of the generated image and enhancing the user experience.

[0039] The following embodiments provide a specific method for generating a target noise image based on a reference image and a background image.

[0040] In practical applications, a composite image is usually generated first based on a reference image and a background image; the composite image includes the display content of the background image and the target subject.

[0041] The aforementioned reference image typically has a corresponding first mask image. The first mask image indicates the display area of ​​the target subject within the reference image and can also be considered as a means of locating the target subject. In practical applications, the reference image can also be called the subject image, and can be... This is indicated. When the first mask image is generated by an automatic segmenter, it can also be called a segmentation mask, assuming... The segmentation mask is Indicates. For example... Figure 2 As shown, the left image is a schematic reference image, and the right image is the first masking image corresponding to the reference image.

[0042] The aforementioned background image typically has a corresponding second mask image; the second mask image is used to indicate the target region in the background image that corresponds to the target subject. In practical applications, the background image can... The second masking image is typically user-defined, for example, it can be qualified using a binary mask. It can usually be represented as... This represents the second masking image. For example... Figure 3 As shown, the left image is a schematic background image, and the right image is the second masking image corresponding to the background image.

[0043] When generating a composite image, it is necessary to determine the display content of the target subject based on the first mask image and the reference image. That is, using... Remove The background is used to obtain the foreground subject as the display content of the target subject. Then, based on the second mask image, the display parameters corresponding to the target subject are determined; these display parameters typically include the display position and display size of the target subject. Specifically, the display content of the target subject needs to be repositioned and compared with... In Alignment. After determining the display parameters, the displayed content and background image need to be composited based on these parameters to obtain a composite image. In this composite image, the target subject is located in the image region corresponding to the target area. This can be achieved through... x c This represents a composite image. When obtaining the composite image, a corresponding third masking image is usually also obtained. The third masking image indicates the display area of ​​the target subject within the composite image. For example... Figure 4 As shown, the left image is a schematic composite image, and the right image is the third masking image corresponding to the composite image.

[0044] After obtaining the synthesized image, it needs to be inverted to generate the target noise image. In one specific embodiment, given the effectiveness of high-order ordinary differential equation (ODE) solvers in image reconstruction, the inversion result of the synthesized image using DPM-Solver++ can be used as the target noise image and the first noise image.

[0045] The synthesized image is then inverted to obtain the processed synthesized image. Furthermore, to enhance the harmony of the inserted subject, the portion of the image region in the composite image that does not display the target subject can be defined as a transition region. Figure 5 The following is based on Figure 4 The mask image corresponding to the transition region determined by the synthesized image shown. Figure 5 The white areas in the image represent transition regions in the composite image. Transition regions can be represented as... Specifically, it can be obtained through the following formula. ,in This represents the XOR operation. Then, the image regions corresponding to the transition areas in the synthesized image can be filled with preset random noise, specifically standard Gaussian noise. The transition region is filled. Therefore, the synthesized image after noise filling can be identified as the target noise image. The target noise image can be represented as... ,in This indicates pixel-by-pixel multiplication.

[0046] In one specific embodiment, the second image features can be determined in the following manner.

[0047] First, a first noisy image needs to be generated based on the reference image and the background image. To maintain the subject's subjectivity, the synthesized image after inversion processing can be directly used as the source image. As the first noisy image.

[0048] Then, the first noisy image can be denoised. The denoising task of the first noisy image can be regarded as an auxiliary task in the process of generating the target image. It can be called a fidelity-focused reconstruction task, which aims to preserve the identity information of the subject while maintaining the consistency of the background.

[0049] During the processing of the first noisy image, the first noisy image can be identified as the first current image, and the first current time step can be determined. Typically, an initial value for the time step is preset, and after each time step is executed, the corresponding value is decremented by one.

[0050] At the first current time step, it is necessary to determine the second image features corresponding to the first current time step based on the first current image. In a specific embodiment, image features in the first current image can be extracted using the self-attention module and cross-attention module in a Stable Diffusion Model (SDM). In this method, the second image features mainly refer to the image features of the first current image determined by the self-attention module. Then, the current image is denoised based on the second image features to obtain the processed current image. Further, the first current image is updated to the processed first current image, and the first current time step is updated.

[0051] Continue executing the step of determining the second image feature corresponding to the first current time step based on the first current image, until the first current time step meets a preset condition. The preset condition is typically that the value corresponding to the first time step is 0.

[0052] In one specific embodiment, the third image feature can be determined in the following manner.

[0053] First, a second noise image corresponding to the text description information needs to be generated based on random noise. In one specific embodiment, standard Gaussian noise can be used as the second noise image corresponding to the text description information. = .

[0054] The second noisy image can then be denoised. This denoising task can be viewed as an auxiliary task in the process of generating the target image; it can be termed a controllability-focused editing task designed to support modifications to the subject based on textual prompts.

[0055] During the processing of the second noisy image, it can be designated as the second current image, and a second current time step can be determined. Similar to the first current time step, the second current time step also has an initial value, which is usually the same as the initial value of the first current time step. After each time step is executed, the value corresponding to that time step is decremented by one.

[0056] At the second current time step, it is necessary to determine the third image features corresponding to the second current time step based on the second current image and edge features. In a specific embodiment, image features in the second current image can be extracted using the self-attention module and cross-attention module in a Stable Diffusion Model (SDM). In this method, the third image features mainly refer to the image features of the second current image determined by the cross-attention module. To ensure that the image generated after denoising based on the second noisy image effectively preserves the identity features of the subject while maintaining sufficient edit control, a ControlNet can be added to the Stable Diffusion Model, thereby introducing control over the denoising process by determining edge features based on a reference image. These edge features can be obtained by processing the reference image using the Canny edge algorithm.

[0057] Furthermore, the second current image is denoised based on the features of the third image to obtain the processed current image. After denoising, the second current image is updated to the processed second current image, and the second current time step is updated.

[0058] Continue executing the step of determining the third image feature corresponding to the second current time step based on the second current image and edge features, until the second current time step meets a preset condition. The preset condition is typically that the value corresponding to the second time step is 0.

[0059] The following embodiments provide a specific method for determining a first image feature based on a target noisy image and text description information, and performing denoising processing on the target noisy image based on the first image feature, a second image feature and a third image feature to generate a target image.

[0060] The process of denoising the target noisy image is similar to the denoising processes of the first and second noisy images described above. The denoising process of the target noisy image is the main task in the process of generating the target image, and can be referred to as a customized task.

[0061] First, the target noisy image is identified as the third current image, and the third current time step is determined. Similarly, the third current time step also has an initial value, which is usually the same as the initial values ​​of the first and second current time steps. After each time step is executed, the value corresponding to that time step is decremented by one.

[0062] Obtain the second and third image features corresponding to the third current time step. When the first noisy image, the second noisy image, and the target noisy image are denoised simultaneously, they can usually share the same time step, so the second and third image features corresponding to the current time step can be directly obtained. When the first noisy image, the second noisy image, and the target noisy image are not denoised simultaneously, the second image feature corresponding to the first current time step with the same value as the third current time step can be determined as the second image feature corresponding to the third current time step. Similarly, the third image feature corresponding to the second current time step with the same value as the third current time step can be determined as the third image feature corresponding to the third current time step.

[0063] Then, based on the third current image and text description information, the first image features corresponding to the third current time step can be determined, and the third current image can be denoised based on the first image features, the second image features and the third image features to obtain the denoised third current image.

[0064] The aforementioned first image feature typically includes a first sub-feature and a second sub-feature. The first sub-feature is generated based on a self-attention mechanism; the second sub-feature is generated based on a cross-attention mechanism. For example, when determining the first image feature using SDM, the first sub-feature can be generated by the self-attention module in SDM, and the second sub-feature can be generated by the cross-attention module in SDM. The second image feature is generated based on a self-attention mechanism, specifically by the self-attention module of SDM; the third image feature is generated based on a cross-attention mechanism, specifically by the cross-attention module of SDM.

[0065] In one specific embodiment, a first target feature can be determined based on a first sub-feature, a second image feature, and a preset first weight. In one specific embodiment, the first weight can be set to 0.5 and 0.5. A second target feature is then determined based on a second sub-feature, a third image feature, and a preset second weight. In one specific embodiment, the second weight can be set to 0.5 and 0.5. Furthermore, denoising processing can be performed on the current image based on the first target feature and the second target feature.

[0066] The early generation phase of the diffusion model focuses on building the global structure, while the later generation phase emphasizes refining the appearance details. To balance global structure and detailed features, the range of the first time step and the range of the second time step can be preset.

[0067] In practical applications, it is necessary to determine whether the third current time step is within the preset first time step range. If so, the first target feature is determined based on the first sub-feature, the second image feature, and the preset first weight. If not, the first sub-feature is determined as the first target feature. The first time step range usually corresponds to the early and intermediate stages of the denoising process.

[0068] In practical applications, it is necessary to determine whether the third current time step falls within the preset second time step range. If so, the second target feature is determined based on the second sub-feature, the third image feature, and the preset second weight; otherwise, the second sub-feature is determined as the second target feature. The second time step range typically corresponds to an intermediate stage in the denoising process and overlaps with the first time step range.

[0069] After denoising is completed, an updated third current image is determined based on the denoised third current image, and the third current time step is updated.

[0070] Continue executing the step of determining the first image features corresponding to the third current time step based on the third current image and text description information, until the third current time step meets the preset conditions, and then determine the denoised third current image as the target image. The preset conditions are typically 0 for the second time step.

[0071] In practical applications, classifier-independent guidance (CFG) can often be used to achieve a better balance between generation quality and diversity. Correspondingly, unconditional guidance, text guidance, and image guidance can be combined to achieve the fusion of different conditions and improve the quality of the generated target image. Different guidance processes correspond to different denoising processes. At each time step, multiple different denoised images are generated, and the current image to be used in the next time step is determined based on these multiple denoised images.

[0072] In one specific embodiment, the denoised third current image includes a first denoised image, a second denoised image, and a third denoised image. The first image features include a fourth image feature for generating the first denoised image, a fifth image feature for generating the second denoised image, and a sixth image feature for generating the third denoised image.

[0073] At each time step, a fourth image feature corresponding to the third current time step is determined based on the third current image. Based on the first, second, and fourth image features, the third current image is denoised to obtain a first denoised image; this process corresponds to unconditional guidance. A fifth image feature corresponding to the third current time step also needs to be determined based on the third current image and text description information. Based on the first, second, and fifth image features, the third current image is denoised to obtain a second denoised image; this process corresponds to text guidance. Simultaneously, a sixth image feature corresponding to the current time step needs to be determined based on the third current image and a reference image. Based on the first, second, and sixth image features, the third current image is denoised to obtain a third denoised image; this process corresponds to image guidance.

[0074] In one specific embodiment, edge features can be determined based on a reference image, and the third current image and edge features are input into a diffusion model. The diffusion model then outputs the sixth image feature corresponding to the third current time step based on the third current image and edge features. The diffusion model used in this process includes a ControlNet to introduce control over image features by the edge features.

[0075] In determining the updated third current image based on the denoised third current image, a target denoised image can be generated based on the first denoised image, the second denoised image, the third denoised image, and preset weight parameters, and this target denoised image is then determined as the updated third current image. The aforementioned weight parameters may include weight parameters set for the second and third denoised images, and the weight parameters for the first denoised image can be calculated from the weights corresponding to other denoised images. The weight parameters can be determined based on experience or model training.

[0076] To enhance the ability to manipulate finer details, during the process of determining the updated third current image at each time step, the similarity between the image features generated by the denoising process based on text description information and the image features generated by the denoising process of the target noisy image can be considered to further guide the image denoising process.

[0077] In one specific embodiment, at each third current time step, a second current image corresponding to the third current time step can be obtained; the second current image is generated during the denoising process of the second noisy image. Further, a first intermediate feature image corresponding to the second current image and a second intermediate feature image corresponding to the third current image are determined. The first intermediate feature image can be obtained by processing the second current image using a feature extraction network, which can be a U-net model. Specifically, the first intermediate feature image can be formed using the image features output from the second and third layers of a U-net model including four upsampling layers. The formation process of the second intermediate feature image is similar and will not be elaborated here.

[0078] Then, based on the first intermediate feature image and the second intermediate feature image, the update gradient parameters corresponding to the third current image can be determined. Specifically, the first intermediate feature image can be processed using the first masking image to determine the first target region in the first intermediate feature image, which corresponds to the display area of ​​the target subject in the reference image. The second intermediate feature image can then be processed using the third masking image to determine the second target region in the second intermediate feature image; the second target region corresponds to the display area of ​​the target subject in the target noise image. Further, the similarity parameters between the first and second target regions are determined; and based on the similarity parameters, the update gradient parameters corresponding to the third current image are determined.

[0079] In calculating the similarity parameter between two target regions, for each first feature point in the second target region, the feature similarity between that first feature point and each second feature point in the first target region is determined, and the second feature point with the highest feature similarity to the first feature point is identified as the corresponding second feature point. The feature similarity can be the cosine similarity between the feature vectors corresponding to the first and second feature points, or it can be vector distance, etc., without restriction. Then, the feature distance between the first feature point and the corresponding second feature point is calculated, and the sum of the feature distances corresponding to each first feature point in the second target region is determined as the similarity parameter between the first and second target regions.

[0080] After determining the similarity parameters, for the second target region in the third current image, the first gradient value corresponding to the similarity parameters is calculated, and this first gradient value is used as the update gradient parameter corresponding to the second target region. To prevent artifacts from being generated around the display area of ​​the target subject, for the transition region in the third current image, the product of the similarity parameters and the Gaussian smoothing kernel is calculated, and the second gradient value corresponding to the product is used as the update gradient parameter corresponding to the transition region in the third current image.

[0081] Further, a target denoised image is generated based on the first denoised image, the second denoised image, the third denoised image, the updated gradient parameters, and preset weight parameters. Here, the weight parameters can include the weights corresponding to the second denoised image, the third denoised image, and the updated gradient parameters. The weight of the first denoised image can be calculated based on the weights corresponding to the other denoised images and the updated gradient parameters.

[0082] In one specific embodiment, corresponding denoising streams can be set for reconstruction tasks, editing tasks, and customization tasks respectively, and the target image can be generated by the three denoising streams running simultaneously.

[0083] The three denoising streams are the reconstruction stream. Editing flow and customized stream .in and They handle reconstruction and editing tasks separately, while The auxiliary information from the two streams is then fused to generate a final customized image that combines high fidelity and high controllability.

[0084] Specifically, the noise image (also known as the initial noise or latent representation) corresponding to each denoising stream can be determined using the method described above. Furthermore, it can be... and corresponding text prompts As input, it is fed to... Figure 6 Within the framework shown.

[0085] In this framework, the customized stream aims to combine the fidelity advantages of the reconstruction stream with the controllability advantages of the editing stream. The attention features in the diffusion model U-Net structure are plug-and-play, allowing seamless integration into different parallel denoising streams to achieve information transfer. The attention features can be formally represented as:

[0086] Here, , and These represent the query, key, and value characteristics, respectively, and the subscripts. , , respectively, represent the self-attention layer and the cross-attention layer. Indicates the feature dimension.

[0087] Fusing the self-attention features of multiple parallel streams can integrate information from different streams. However, the information from the streams may conflict, leading to deviations in the controllability and fidelity of the generated results.

[0088] To address this information fusion conflict, an attention-decoupled fusion mechanism is proposed. Specifically, in the customized flow, different fusion modules are designed for the reconstruction flow and the editing flow: the reconstruction flow injects identity information through self-attention features, while the editing flow injects text editing information through cross-attention features. This decoupling of fusion positions effectively alleviates information conflicts. At each time step, the output features of the self-attention block and the cross-attention block in the customized flow are denoted as follows: and .in addition, The output of the self-attention block representing the reconstruction flow. The output represents the cross-attention block of the edit flow. The blending mechanism is defined as follows:

[0089] in, and It is the time step threshold, and Used for control and The timing of injection.

[0090] The early generation phase of the diffusion model focuses on building the global structure, while the later generation phase emphasizes refining appearance details. A temporal control scheme across time steps is proposed to balance global structure and detailed features. In the early stage ( ), only use To establish the overall layout. In the intermediate stage ( ), and inject and Joint optimization was performed to solidify the layout while also enhancing the foreground editing of text-guided text. In the later stages ( The fusion operation is stopped, and instead, regular prompts are used to generate the final image by leveraging the model's prior knowledge.

[0091] Although the task decoupling architecture integrates auxiliary information into the image customization task through auxiliary tasks, it still faces problems such as inconsistent subject identity and text guidance failure in certain scenarios. These challenges stem from the indirectness of attention feature fusion. Therefore, a more direct and flexible approach is needed to simultaneously improve fidelity and controllability.

[0092] It is worth noting that classifier-independent guidance (CFG) is often used to achieve a better balance between generation quality and diversity. This method is guided by a hyperparameter-decoupled condition. and unconditional guidance :

[0093] Inspired by this, Condition-Decoupled Guidance was proposed, which separates conditional guidance in CFG into two parts: text guidance focused on controllability. and image guidance focused on fidelity . Responsible for handling text-driven main body editing, while This ensures that the subject's identity remains consistent with the reference image. For example, when customizing the Corgi in the reference image based on the text prompt "a photo of a black Corgi," Ensure the corgi's identification information matches the reference image, and This ensures that the Corgi's fur is black. After this decoupling, Predictions can be made using prior knowledge from diffusion models, and This can be flexibly implemented based on existing image-to-image generation techniques. and These represent text and image conditions, respectively, and their guidance strength is controlled by user-defined hyperparameters. For example... Figure 10 As shown, different U-Nets are used to predict each bootstrap: (Unconditional) (Text conditions), and (Image conditions). At each time step, the guidance process is as follows:

[0094] The guiding strength of text and image is determined by hyperparameters. and Independent control.

[0095] Within this framework, and Both are derived from pre-trained diffusion models, and It integrates the diffusion model of ControlNet with Canny edge graph control, such as Figure 7 As shown, during the denoising process, each component independently guides the denoising direction to maximize the likelihood probability under its own conditions, thus providing direct and flexible guidance for fidelity and controllability.

[0096] As mentioned earlier, fidelity and controllability are decoupled at both the task and condition levels, effectively mitigating their inherent conflict. However, in both cases, controllability depends on the text, and due to the modal differences between textual description and visual presentation, this dependence limits the ability to manipulate finer details. To address this issue, Controllability Energy is introduced, which enhances the ability to edit details in the later stages of generation, providing a more granular improvement in controllability.

[0097] It is noted that this editing flow performs excellently in terms of text control and detail rendering. Therefore, an optimized path is established between the editing flow and the customization flow through a controllable energy guidance mechanism, such as... Figure 8 As shown. At each time step, the U-Net in the diffusion model is reused from the latent representation of the custom flow. Extracting intermediate features Similarly, from the potential representation of the edit stream Extracting features These intermediate features provide high-level semantic alignment, enabling precise point-to-point correspondence. A point-to-point feature alignment method is established, which aligns corresponding points by maximizing the similarity between them. and .like Figure 8 As shown, two binary masks are used. and To constrain separately and The main area is selected and background interference is eliminated. For Index of each point in ,exist Find the corresponding point with the highest cosine similarity. :

[0098] in, and Representing points respectively and points The feature vector at that location, This represents the cosine similarity. In each... Find the corresponding point Then, the controllability energy is defined by the feature distance of the corresponding points:

[0099] Restricting gradient updates to specific regions can reduce editing flexibility and introduce artifacts in transition areas. To mitigate this issue, gradient smoothing operations are introduced in transition regions to achieve more natural boundary transitions. The modified gradient is represented as:

[0100] in, It is a Gaussian smooth kernel. Finally, controllable energy guidance can be combined with conditional decoupling guidance mechanisms:

[0101] in, It represents the intensity of controllable energy guidance.

[0102] This method has the following advantages; 1. Excellent balance between high fidelity and strong controllability: Through the dual decoupling design at the task level and the condition level, the inherent conflict between fidelity and controllability is fundamentally alleviated, so that the generated image can accurately maintain the identity characteristics of the reference subject and flexibly respond to the semantic changes of the text prompts.

[0103] 2. High efficiency without training: The entire solution does not require any fine-tuning or training of the pre-trained diffusion model. It directly utilizes the existing model, which significantly reduces computational costs and time overhead, and achieves efficient plug-and-play image customization.

[0104] 3. Flexible and controllable user experience: The attention decoupling and fusion mechanism, the conditional decoupling guidance mechanism, and the controllability energy guidance mechanism provide intuitive hyperparameters, allowing users to dynamically adjust the emphasis on fidelity and controllability, as well as the intensity of detail control, according to specific needs.

[0105] 4. Excellent generation quality: The attention decoupling fusion mechanism avoids feature conflicts, and the controllable energy guidance mechanism enhances detail alignment, resulting in natural foreground and background fusion, fewer visual artifacts, and high overall visual quality in the generated image.

[0106] 5. Wide applicability: This method can be widely applied to various scenarios that require high-precision image customization, such as subject transmission, virtual try-on, product display, artistic creation, and image editing.

[0107] 6. Generalizability and resource friendliness of the base model: It can achieve excellent results on different base models and can be deployed on consumer-grade GPUs, which has high practical value and commercial potential.

[0108] For the above method embodiments, see Figure 9 An image generation apparatus is shown, the apparatus comprising: Image acquisition module 902 is used to acquire a reference image, a background image, and text description information; the reference image includes the target subject; the text description information is used to: describe the target display features corresponding to the target subject; the target display features are different from the initial display features of the target subject in the reference image; The target noise image generation module 904 is used to generate a target noise image based on a reference image and a background image; The denoising module 906 is used to determine a first image feature based on the target noisy image and text description information, and to perform denoising processing on the target noisy image based on the first image feature, the second image feature, and the third image feature to generate a target image; wherein, the second image feature is determined based on the first noisy image corresponding to the reference image and the background image; the third image feature is determined based on the second noisy image corresponding to the text description information; the target image includes the display content in the background image and the target subject; the target subject in the target image conforms to the target display features.

[0109] The aforementioned image generation apparatus acquires a reference image, a background image, and text description information; generates a target noise image based on the reference image and the background image; determines first image features based on the target noise image and the text description information; and performs denoising processing on the target noise image based on the first image features, second image features, and third image features to generate the target image. This method, in the process of integrating the target subject from the reference image into the background image display based on the text description, simultaneously considers the image features determined by the target noise image generated from both images and the text description, the image features determined by the noise image corresponding to the text description information, and the image features determined by the noise image formed solely from the two images. By performing denoising processing on the target noise image, it takes into account both the identity features of the target subject in the generated image and the display features indicated by the text description information, improving the display effect of the generated image and enhancing the user experience.

[0110] The aforementioned target noise image generation module is also used to: generate a synthetic image based on a reference image and a background image; the synthetic image includes the display content of the background image and the target subject; and perform inversion processing on the synthetic image to generate a target noise image.

[0111] The aforementioned reference image has a corresponding first mask image; the first mask image is used to indicate the display area of ​​the target subject in the reference image; the background image has a corresponding second mask image; the second mask image is used to indicate the target area in the background image corresponding to the target subject; the aforementioned target noise image generation module is further used to: determine the display content of the target subject based on the first mask image and the reference image; determine the display parameters corresponding to the target subject based on the second mask image; and perform composite processing on the display content and the background image based on the display parameters to obtain a composite image; in the composite image, the target subject is located in the image area corresponding to the target area.

[0112] The aforementioned apparatus further includes a first image feature determination module, configured to: generate a first noisy image based on a reference image and a background image; determine the first noisy image as a first current image and determine a first current time step; determine a second image feature corresponding to the first current time step based on the first current image; perform denoising processing on the current image based on the second image feature to obtain a processed current image; update the first current image to the processed first current image and update the first current time step; continue executing the step of determining the second image feature corresponding to the first current time step based on the first current image until the first current time step meets a preset condition.

[0113] The aforementioned target noise image is generated based on a composite image synthesized from a reference image and a background image; the background image has a corresponding second mask image; the second mask image is used to indicate: the target region in the background image corresponding to the target subject; in the composite image, the target subject is located in the image region corresponding to the target region; the aforementioned first image feature determination module is further used to: determine the area in the image region of the composite image that does not display the target subject as a transition region; fill the image region in the target noise image corresponding to the transition region in the composite image with preset random noise, and determine the noise-filled target noise image as the first noise image.

[0114] The aforementioned apparatus further includes a third image feature determination module, configured to: generate a second noise image corresponding to the text description information based on random noise; determine the second noise image as the second current image and determine the second current time step; determine the third image feature corresponding to the second current time step based on the second current image and edge features; perform denoising processing on the second current image based on the third image feature to obtain the processed current image; determine the edge features based on a reference image; update the second current image to the processed second current image and update the second current time step; continue to execute the step of determining the third image feature corresponding to the second current time step based on the second current image and edge features until the second current time step meets a preset condition.

[0115] The aforementioned denoising module is further configured to: determine the target noisy image as the third current image and determine the third current time step; acquire the second image feature and the third image feature corresponding to the third current time step; determine the first image feature corresponding to the third current time step based on the third current image and text description information; perform denoising processing on the third current image based on the first image feature, the second image feature and the third image feature to obtain the denoised third current image; determine the updated third current image based on the denoised third current image and update the third current time step; continue to execute the step of determining the first image feature corresponding to the third current time step based on the third current image and text description information until the third current time step meets the preset conditions, and determine the denoised third current image as the target image.

[0116] The aforementioned first image feature includes a first sub-feature and a second sub-feature; the first sub-feature is generated based on a self-attention mechanism; the second sub-feature is generated based on a cross-attention mechanism; the second image feature is generated based on a self-attention mechanism; the third image feature is generated based on a cross-attention mechanism; the aforementioned denoising module is further configured to: determine a first target feature based on the first sub-feature, the second image feature, and a preset first weight; determine a second target feature based on the second sub-feature, the third image feature, and a preset second weight; and perform denoising processing on the current image based on the first target feature and the second target feature.

[0117] The aforementioned denoising module is also used to: determine whether the third current time step is within the preset first time step range; if so, determine the first target feature based on the first sub-feature, the second image feature and the preset first weight; if not, determine the first sub-feature as the first target feature.

[0118] The aforementioned denoising module is also used to: determine whether the third current time step is within the preset second time step range; if so, determine the second target feature based on the second sub-feature, the third image feature and the preset second weight; if not, determine the second sub-feature as the second target feature.

[0119] The third current image after the above denoising process includes a first denoised image, a second denoised image, and a third denoised image; the first image feature includes a fourth image feature, a fifth image feature, and a sixth image feature; the above denoising processing module is further configured to: determine the fourth image feature corresponding to the third current time step based on the third current image; perform denoising processing on the third current image based on the first image feature, the second image feature, and the fourth image feature to obtain a first denoised image; determine the fifth image feature corresponding to the third current time step based on the third current image and text description information; perform denoising processing on the third current image based on the first image feature, the second image feature, and the fifth image feature to obtain a second denoised image; determine the sixth image feature corresponding to the current time step based on the third current image and a reference image; perform denoising processing on the third current image based on the first image feature, the second image feature, and the sixth image feature to obtain a third denoised image.

[0120] The aforementioned denoising module is also used to: input the third current image and edge features into the diffusion model, and output the sixth image features corresponding to the third current time step based on the third current image and edge features through the diffusion model; the diffusion model includes a control network; the edge features are determined based on the reference image.

[0121] The aforementioned denoising module is also used to: generate a target denoised image based on the first denoised image, the second denoised image, the third denoised image and preset weight parameters, and determine the target denoised image as the updated third current image.

[0122] The aforementioned denoising module is further configured to: acquire the second current image corresponding to the third current time step; generate the second current image based on the second noisy image; determine the first intermediate feature image corresponding to the second current image and the second intermediate feature image corresponding to the third current image; determine the update gradient parameter corresponding to the third current image based on the first intermediate feature image and the second intermediate feature image; and generate the target denoised image based on the first denoised image, the second denoised image, the third denoised image, the update gradient parameter, and the preset weight parameter.

[0123] The aforementioned reference image has a corresponding first mask image; the first mask image is used to indicate the display area of ​​the target subject in the reference image; the target noise image has a corresponding third mask image; the third mask image is used to indicate the display area of ​​the target subject in the target noise image; the aforementioned denoising module is further used to: process the first intermediate feature image through the first mask image to determine the first target region in the first intermediate feature image; the first target region corresponds to the display area of ​​the target subject in the reference image; process the second intermediate feature image through the third mask image to determine the second target region in the second intermediate feature image; the second target region corresponds to the display area of ​​the target subject in the target noise image; determine the similarity parameter between the first target region and the second target region; and determine the update gradient parameter corresponding to the third current image based on the similarity parameter.

[0124] The aforementioned denoising module is further configured to: for each first feature point in the second target region, determine the feature similarity between the first feature point and each second feature point in the first target region; determine the second feature point with the highest feature similarity to the first feature point as the second feature point corresponding to the second feature point; calculate the feature distance between the first feature point and the corresponding second feature point; and determine the sum of the feature distances corresponding to each first feature point in the second target region as the similarity parameter between the first target region and the second target region.

[0125] The aforementioned denoising module is further configured to: calculate the first gradient value corresponding to the similarity parameter for the second target region in the third current image, and determine the first gradient value as the update gradient parameter corresponding to the second target region; calculate the product of the similarity parameter and the Gaussian smoothing kernel for the transition region in the third current image, calculate the second gradient value corresponding to the product, and determine the second gradient value as the update gradient parameter corresponding to the transition region in the third current image; the transition region is adjacent to the second target region.

[0126] This embodiment also provides an electronic device, including a processor and a memory. The memory stores machine-executable instructions that can be executed by the processor. The processor executes the machine-executable instructions to implement the above-described image generation method, for example: The process involves acquiring a reference image, a background image, and text description information. The reference image includes the target subject. The text description information describes the target display features corresponding to the target subject. The target display features differ from the initial display features of the target subject in the reference image. Based on the reference image and the background image, a target noise image is generated. Based on the target noise image and the text description information, a first image feature is determined. Based on the first image feature, a second image feature, and a third image feature, the target noise image is denoised to generate the target image. The second image feature is determined based on the first noise image corresponding to the reference image and the background image. The third image feature is determined based on the second noise image corresponding to the text description information. The target image includes the display content in the background image and the target subject. The target subject in the target image conforms to the target display features.

[0127] The above method, in the process of integrating the target subject in the reference image into the background image display based on the text description, simultaneously considers the target noise image generated from the two images and the image features determined by the text description, the image features determined by the noise image corresponding to the text description information, and the image features determined by the noise image formed only by the two images. It performs noise reduction processing on the target noise image, taking into account both the identity features of the target subject in the generated image and the display features indicated by the text description information, thereby improving the display effect of the generated image and enhancing the user experience.

[0128] Optionally, the step of generating a target noise image based on a reference image and a background image includes: generating a composite image based on the reference image and the background image; the composite image includes the display content of the background image and the target subject; and performing inversion processing on the composite image to generate the target noise image.

[0129] Optionally, the reference image has a corresponding first mask image; the first mask image is used to indicate the display area of ​​the target subject in the reference image; the background image has a corresponding second mask image; the second mask image is used to indicate the target area in the background image corresponding to the target subject; the step of generating a composite image based on the reference image and the background image includes: determining the display content of the target subject based on the first mask image and the reference image; determining the display parameters corresponding to the target subject based on the second mask image; and performing composite processing on the display content and the background image based on the display parameters to obtain a composite image; in the composite image, the target subject is located in the image area corresponding to the target area.

[0130] Optionally, the second image feature is determined as follows: a first noisy image is generated based on a reference image and a background image; the first noisy image is determined as the first current image, and a first current time step is determined; the second image feature corresponding to the first current time step is determined based on the first current image; the current image is denoised based on the second image feature to obtain a processed current image; the first current image is updated to the processed first current image, and the first current time step is updated; the step of determining the second image feature corresponding to the first current time step based on the first current image is continued until the first current time step meets a preset condition.

[0131] Optionally, the aforementioned target noise image is generated based on a composite image synthesized from a reference image and a background image; the background image has a corresponding second mask image; the second mask image is used to indicate: the target region in the background image corresponding to the target subject; in the composite image, the target subject is located in the image region corresponding to the target region; the step of generating a first noise image based on the reference image and the background image includes: determining the area in the image region of the composite image that does not display the target subject as a transition region; filling the image region in the target noise image corresponding to the transition region in the composite image with preset random noise, and determining the noise-filled target noise image as the first noise image.

[0132] Optionally, the aforementioned third image feature is determined as follows: based on random noise, a second noisy image corresponding to the text description information is generated; the second noisy image is determined as the second current image, and a second current time step is determined; based on the second current image and edge features, the third image feature corresponding to the second current time step is determined; the second current image is denoised based on the third image feature to obtain the processed current image; the edge features are determined based on a reference image; the second current image is updated to the processed second current image, and the second current time step is updated; the step of determining the third image feature corresponding to the second current time step based on the second current image and edge features continues to be executed until the second current time step meets the preset conditions.

[0133] Optionally, the steps described above—determining the first image feature based on the target noisy image and text description information, and denoising the target noisy image based on the first image feature, the second image feature, and the third image feature to generate the target image—include: determining the target noisy image as the third current image and determining the third current time step; obtaining the second image feature and the third image feature corresponding to the third current time step; determining the first image feature corresponding to the third current time step based on the third current image and text description information; denoising the third current image based on the first image feature, the second image feature, and the third image feature to obtain the denoised third current image; determining the updated third current image based on the denoised third current image and updating the third current time step; continuing to execute the step of determining the first image feature corresponding to the third current time step based on the third current image and text description information until the third current time step meets a preset condition, and determining the denoised third current image as the target image.

[0134] Optionally, the first image feature includes a first sub-feature and a second sub-feature; the first sub-feature is generated based on a self-attention mechanism; the second sub-feature is generated based on a cross-attention mechanism; the second image feature is generated based on a self-attention mechanism; the third image feature is generated based on a cross-attention mechanism; the step of denoising the current image based on the first image feature, the second image feature, and the third image feature includes: determining a first target feature based on the first sub-feature, the second image feature, and a preset first weight; determining a second target feature based on the second sub-feature, the third image feature, and a preset second weight; and denoising the current image based on the first target feature and the second target feature.

[0135] Optionally, the step of determining the first target feature based on the first sub-feature, the second image feature, and the preset first weight includes: determining whether the third current time step is within the preset first time step range; if so, determining the first target feature based on the first sub-feature, the second image feature, and the preset first weight; if not, determining the first sub-feature as the first target feature.

[0136] Optionally, the step of determining the second target feature based on the second sub-feature, the third image feature, and the preset second weight includes: determining whether the third current time step is within the preset second time step range; if so, determining the second target feature based on the second sub-feature, the third image feature, and the preset second weight; if not, determining the second sub-feature as the second target feature.

[0137] Optionally, the third current image after denoising includes a first denoised image, a second denoised image, and a third denoised image; the first image features include a fourth image feature, a fifth image feature, and a sixth image feature; the step of determining the first image feature corresponding to the third current time step based on the third current image and text description information, and denoising the third current image based on the first image feature, the second image feature, and the third image feature to obtain the denoised third current image includes: determining the fourth image feature corresponding to the third current time step based on the third current image; denoising the third current image based on the first image feature, the second image feature, and the fourth image feature to obtain the first denoised image; determining the fifth image feature corresponding to the third current time step based on the third current image and text description information; denoising the third current image based on the first image feature, the second image feature, and the fifth image feature to obtain the second denoised image; and determining the sixth image feature corresponding to the current time step based on the third current image and a reference image; denoising the third current image based on the first image feature, the second image feature, and the sixth image feature to obtain the third denoised image.

[0138] Optionally, the step of determining the sixth image feature corresponding to the current time step based on the third current image and the reference image includes: inputting the third current image and edge features into the diffusion model, and outputting the sixth image feature corresponding to the third current time step based on the third current image and edge features through the diffusion model; the diffusion model includes a control network; and the edge features are determined based on the reference image.

[0139] Optionally, the above step of determining the updated third current image based on the denoised third current image includes: generating a target denoised image based on the first denoised image, the second denoised image, the third denoised image and preset weight parameters, and determining the target denoised image as the updated third current image.

[0140] Optionally, the step of generating a target denoised image based on the first denoised image, the second denoised image, the third denoised image, and preset weight parameters includes: obtaining the second current image corresponding to the third current time step; generating the second current image based on the second noisy image; determining the first intermediate feature image corresponding to the second current image and the second intermediate feature image corresponding to the third current image; determining the update gradient parameters corresponding to the third current image based on the first intermediate feature image and the second intermediate feature image; and generating the target denoised image based on the first denoised image, the second denoised image, the third denoised image, the update gradient parameters, and preset weight parameters.

[0141] Optionally, the reference image has a corresponding first mask image; the first mask image is used to indicate the display area of ​​the target subject in the reference image; the target noise image has a corresponding third mask image; the third mask image is used to indicate the display area of ​​the target subject in the target noise image; the step of determining the update gradient parameters corresponding to the third current image based on the first intermediate feature image and the second intermediate feature image includes: processing the first intermediate feature image through the first mask image to determine a first target region in the first intermediate feature image; the first target region corresponds to the display area of ​​the target subject in the reference image; processing the second intermediate feature image through the third mask image to determine a second target region in the second intermediate feature image; the second target region corresponds to the display area of ​​the target subject in the target noise image; determining the similarity parameters of the first target region and the second target region; and determining the update gradient parameters corresponding to the third current image based on the similarity parameters.

[0142] Optionally, the steps for determining the similarity parameters of the first target region and the second target region include: for each first feature point in the second target region, determining the feature similarity between the first feature point and each second feature point in the first target region; determining the second feature point with the highest feature similarity to the first feature point as the second feature point corresponding to the second feature point; calculating the feature distance between the first feature point and the corresponding second feature point; and determining the sum of the feature distances corresponding to each first feature point in the second target region as the similarity parameter of the first target region and the second target region.

[0143] Optionally, the step of determining the update gradient parameter corresponding to the third current image based on the similarity parameter includes: for the second target region in the third current image, calculating the first gradient value corresponding to the similarity parameter, and determining the first gradient value as the update gradient parameter corresponding to the second target region; for the transition region in the third current image, calculating the product of the similarity parameter and the Gaussian smoothing kernel, calculating the second gradient value corresponding to the product, and determining the second gradient value as the update gradient parameter corresponding to the transition region in the third current image; the transition region is adjacent to the second target region.

[0144] See Figure 10 As shown, the electronic device includes a processor 100 and a memory 101. The memory 101 stores machine-executable instructions that can be executed by the processor 100. The processor 100 executes the machine-executable instructions to implement the above-described image generation method.

[0145] Furthermore, Figure 10 The electronic device shown also includes a bus 102 and a communication interface 103, with the processor 100, the communication interface 103 and the memory 101 connected via the bus 102.

[0146] The memory 101 may include high-speed random access memory (RAM) and may also include non-volatile memory, such as at least one disk storage device. Communication between this system network element and at least one other network element is achieved through at least one communication interface 103 (which can be wired or wireless), such as the Internet, wide area network, local area network, metropolitan area network, etc. The bus 102 may be an ISA bus, PCI bus, or EISA bus, etc. The bus can be divided into address bus, data bus, control bus, etc. For ease of representation, Figure 10 The symbol is represented by a single double-headed arrow, but this does not mean that there is only one bus or one type of bus.

[0147] Processor 100 may be an integrated circuit chip with signal processing capabilities. In implementation, each step of the above method can be completed by the integrated logic circuitry in the hardware of processor 100 or by instructions in software form. The processor 100 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), etc.; it may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field-Programmable Gate Array (FPGA), or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components. It can implement or execute the methods, steps, and logic block diagrams of the invention in the embodiments of this invention. The general-purpose processor may be a microprocessor or any conventional processor. The steps of the method invented in conjunction with the embodiments of this invention can be directly manifested as execution by a hardware decoding processor, or execution by a combination of hardware and software modules in the decoding processor. The software module can reside in a readily available storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, or registers. This storage medium is located in memory 101, and the processor 100 reads the information from memory 101 and, in conjunction with its hardware, completes the steps of the method described in the foregoing embodiments.

[0148] This embodiment also provides a machine-readable storage medium storing machine-executable instructions. When the machine-executable instructions are called and executed by a processor, the machine-executable instructions cause the processor to implement the above-described image generation method.

[0149] The image generation method, apparatus, and electronic device provided in this invention include a computer-readable storage medium storing program code. The instructions included in the program code can be used to execute the methods described in the preceding method embodiments, for example: The process involves acquiring a reference image, a background image, and text description information. The reference image includes the target subject. The text description information describes the target display features corresponding to the target subject. The target display features differ from the initial display features of the target subject in the reference image. Based on the reference image and the background image, a target noise image is generated. Based on the target noise image and the text description information, a first image feature is determined. Based on the first image feature, a second image feature, and a third image feature, the target noise image is denoised to generate the target image. The second image feature is determined based on the first noise image corresponding to the reference image and the background image. The third image feature is determined based on the second noise image corresponding to the text description information. The target image includes the display content in the background image and the target subject. The target subject in the target image conforms to the target display features.

[0150] The above method, in the process of integrating the target subject in the reference image into the background image display based on the text description, simultaneously considers the target noise image generated from the two images and the image features determined by the text description, the image features determined by the noise image corresponding to the text description information, and the image features determined by the noise image formed only by the two images. It performs noise reduction processing on the target noise image, taking into account both the identity features of the target subject in the generated image and the display features indicated by the text description information, thereby improving the display effect of the generated image and enhancing the user experience.

[0151] Optionally, the step of generating a target noise image based on a reference image and a background image includes: generating a composite image based on the reference image and the background image; the composite image includes the display content of the background image and the target subject; and performing inversion processing on the composite image to generate the target noise image.

[0152] Optionally, the reference image has a corresponding first mask image; the first mask image is used to indicate the display area of ​​the target subject in the reference image; the background image has a corresponding second mask image; the second mask image is used to indicate the target area in the background image corresponding to the target subject; the step of generating a composite image based on the reference image and the background image includes: determining the display content of the target subject based on the first mask image and the reference image; determining the display parameters corresponding to the target subject based on the second mask image; and performing composite processing on the display content and the background image based on the display parameters to obtain a composite image; in the composite image, the target subject is located in the image area corresponding to the target area.

[0153] Optionally, the second image feature is determined as follows: a first noisy image is generated based on a reference image and a background image; the first noisy image is determined as the first current image, and a first current time step is determined; the second image feature corresponding to the first current time step is determined based on the first current image; the current image is denoised based on the second image feature to obtain a processed current image; the first current image is updated to the processed first current image, and the first current time step is updated; the step of determining the second image feature corresponding to the first current time step based on the first current image is continued until the first current time step meets a preset condition.

[0154] Optionally, the aforementioned target noise image is generated based on a composite image synthesized from a reference image and a background image; the background image has a corresponding second mask image; the second mask image is used to indicate: the target region in the background image corresponding to the target subject; in the composite image, the target subject is located in the image region corresponding to the target region; the step of generating a first noise image based on the reference image and the background image includes: determining the area in the image region of the composite image that does not display the target subject as a transition region; filling the image region in the target noise image corresponding to the transition region in the composite image with preset random noise, and determining the noise-filled target noise image as the first noise image.

[0155] Optionally, the aforementioned third image feature is determined as follows: based on random noise, a second noisy image corresponding to the text description information is generated; the second noisy image is determined as the second current image, and a second current time step is determined; based on the second current image and edge features, the third image feature corresponding to the second current time step is determined; the second current image is denoised based on the third image feature to obtain the processed current image; the edge features are determined based on a reference image; the second current image is updated to the processed second current image, and the second current time step is updated; the step of determining the third image feature corresponding to the second current time step based on the second current image and edge features continues to be executed until the second current time step meets the preset conditions.

[0156] Optionally, the steps described above—determining the first image feature based on the target noisy image and text description information, and denoising the target noisy image based on the first image feature, the second image feature, and the third image feature to generate the target image—include: determining the target noisy image as the third current image and determining the third current time step; obtaining the second image feature and the third image feature corresponding to the third current time step; determining the first image feature corresponding to the third current time step based on the third current image and text description information; denoising the third current image based on the first image feature, the second image feature, and the third image feature to obtain the denoised third current image; determining the updated third current image based on the denoised third current image and updating the third current time step; continuing to execute the step of determining the first image feature corresponding to the third current time step based on the third current image and text description information until the third current time step meets a preset condition, and determining the denoised third current image as the target image.

[0157] Optionally, the first image feature includes a first sub-feature and a second sub-feature; the first sub-feature is generated based on a self-attention mechanism; the second sub-feature is generated based on a cross-attention mechanism; the second image feature is generated based on a self-attention mechanism; the third image feature is generated based on a cross-attention mechanism; the step of denoising the current image based on the first image feature, the second image feature, and the third image feature includes: determining a first target feature based on the first sub-feature, the second image feature, and a preset first weight; determining a second target feature based on the second sub-feature, the third image feature, and a preset second weight; and denoising the current image based on the first target feature and the second target feature.

[0158] Optionally, the step of determining the first target feature based on the first sub-feature, the second image feature, and the preset first weight includes: determining whether the third current time step is within the preset first time step range; if so, determining the first target feature based on the first sub-feature, the second image feature, and the preset first weight; if not, determining the first sub-feature as the first target feature.

[0159] Optionally, the step of determining the second target feature based on the second sub-feature, the third image feature, and the preset second weight includes: determining whether the third current time step is within the preset second time step range; if so, determining the second target feature based on the second sub-feature, the third image feature, and the preset second weight; if not, determining the second sub-feature as the second target feature.

[0160] Optionally, the third current image after denoising includes a first denoised image, a second denoised image, and a third denoised image; the first image features include a fourth image feature, a fifth image feature, and a sixth image feature; the step of determining the first image feature corresponding to the third current time step based on the third current image and text description information, and denoising the third current image based on the first image feature, the second image feature, and the third image feature to obtain the denoised third current image includes: determining the fourth image feature corresponding to the third current time step based on the third current image; denoising the third current image based on the first image feature, the second image feature, and the fourth image feature to obtain the first denoised image; determining the fifth image feature corresponding to the third current time step based on the third current image and text description information; denoising the third current image based on the first image feature, the second image feature, and the fifth image feature to obtain the second denoised image; and determining the sixth image feature corresponding to the current time step based on the third current image and a reference image; denoising the third current image based on the first image feature, the second image feature, and the sixth image feature to obtain the third denoised image.

[0161] Optionally, the step of determining the sixth image feature corresponding to the current time step based on the third current image and the reference image includes: inputting the third current image and edge features into the diffusion model, and outputting the sixth image feature corresponding to the third current time step based on the third current image and edge features through the diffusion model; the diffusion model includes a control network; and the edge features are determined based on the reference image.

[0162] Optionally, the above step of determining the updated third current image based on the denoised third current image includes: generating a target denoised image based on the first denoised image, the second denoised image, the third denoised image and preset weight parameters, and determining the target denoised image as the updated third current image.

[0163] Optionally, the step of generating a target denoised image based on the first denoised image, the second denoised image, the third denoised image, and preset weight parameters includes: obtaining the second current image corresponding to the third current time step; generating the second current image based on the second noisy image; determining the first intermediate feature image corresponding to the second current image and the second intermediate feature image corresponding to the third current image; determining the update gradient parameters corresponding to the third current image based on the first intermediate feature image and the second intermediate feature image; and generating the target denoised image based on the first denoised image, the second denoised image, the third denoised image, the update gradient parameters, and preset weight parameters.

[0164] Optionally, the reference image has a corresponding first mask image; the first mask image is used to indicate the display area of ​​the target subject in the reference image; the target noise image has a corresponding third mask image; the third mask image is used to indicate the display area of ​​the target subject in the target noise image; the step of determining the update gradient parameters corresponding to the third current image based on the first intermediate feature image and the second intermediate feature image includes: processing the first intermediate feature image through the first mask image to determine a first target region in the first intermediate feature image; the first target region corresponds to the display area of ​​the target subject in the reference image; processing the second intermediate feature image through the third mask image to determine a second target region in the second intermediate feature image; the second target region corresponds to the display area of ​​the target subject in the target noise image; determining the similarity parameters of the first target region and the second target region; and determining the update gradient parameters corresponding to the third current image based on the similarity parameters.

[0165] Optionally, the steps for determining the similarity parameters of the first target region and the second target region include: for each first feature point in the second target region, determining the feature similarity between the first feature point and each second feature point in the first target region; determining the second feature point with the highest feature similarity to the first feature point as the second feature point corresponding to the second feature point; calculating the feature distance between the first feature point and the corresponding second feature point; and determining the sum of the feature distances corresponding to each first feature point in the second target region as the similarity parameter of the first target region and the second target region.

[0166] Optionally, the step of determining the update gradient parameter corresponding to the third current image based on the similarity parameter includes: for the second target region in the third current image, calculating the first gradient value corresponding to the similarity parameter, and determining the first gradient value as the update gradient parameter corresponding to the second target region; for the transition region in the third current image, calculating the product of the similarity parameter and the Gaussian smoothing kernel, calculating the second gradient value corresponding to the product, and determining the second gradient value as the update gradient parameter corresponding to the transition region in the third current image; the transition region is adjacent to the second target region.

[0167] Those skilled in the art will clearly understand that, for the sake of convenience and brevity, the specific working process of the system and apparatus described above can be referred to the corresponding process in the foregoing method embodiments, and will not be repeated here.

[0168] Furthermore, in the description of the embodiments of the present invention, unless otherwise explicitly specified and limited, the terms "installation," "connection," and "linking" should be interpreted broadly. For example, they can refer to a fixed connection, a detachable connection, or an integral connection; they can refer to a mechanical connection or an electrical connection; they can refer to a direct connection or an indirect connection through an intermediate medium; and they can refer to the internal connection of two components. Those skilled in the art can understand the specific meaning of the above terms in the present invention based on the specific circumstances.

[0169] If the aforementioned functions are implemented as software functional units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, or a part of the technical solution, can be embodied in the form of a software product. This computer software product is stored in a storage medium and includes several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present invention. The aforementioned storage medium includes various media capable of storing program code, such as USB flash drives, portable hard drives, read-only memory (ROM), random access memory (RAM), magnetic disks, or optical disks.

[0170] In the description of this invention, it should be noted that the terms "center," "upper," "lower," "left," "right," "vertical," "horizontal," "inner," and "outer," etc., indicate the orientation or positional relationship based on the orientation or positional relationship shown in the accompanying drawings. They are used only for the convenience of describing the invention and for simplifying the description, and do not indicate or imply that the device or element referred to must have a specific orientation, or be constructed and operated in a specific orientation. Therefore, they should not be construed as limitations on the invention. Furthermore, the terms "first," "second," and "third" are used for descriptive purposes only and should not be construed as indicating or implying relative importance.

[0171] Finally, it should be noted that the above embodiments are merely specific implementations of the present invention, used to illustrate the technical solutions of the present invention, and not to limit it. The scope of protection of the present invention is not limited thereto. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art should understand that any person skilled in the art can still modify or easily conceive of changes to the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present invention, or make equivalent substitutions for some of the technical features; and these modifications, changes, or substitutions do not cause the essence of the corresponding technical solutions to deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should all be covered within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. An image generation method, characterized in that, The method includes: Acquire a reference image, a background image, and text description information; the reference image includes a target subject; the text description information is used to describe the target display features corresponding to the target subject; the target display features are different from the initial display features of the target subject in the reference image. A target noise image is generated based on the reference image and the background image; Based on the target noisy image and the text description information, a first image feature is determined, and the target noisy image is denoised based on the first image feature, the second image feature and the third image feature to generate a target image; Wherein, the second image feature is determined based on the first noise image corresponding to the reference image and the background image; the third image feature is determined based on the second noise image corresponding to the text description information; the target image includes the display content in the background image and the target subject; the target subject in the target image conforms to the target display feature.

2. The method according to claim 1, characterized in that, The step of generating a target noise image based on the reference image and the background image includes: A composite image is generated based on the reference image and the background image; the composite image includes the display content of the background image and the target subject; The synthesized image is inverted to generate a target noise image.

3. The method according to claim 2, characterized in that, The reference image has a corresponding first mask image; the first mask image is used to indicate the display area of ​​the target subject in the reference image; The background image has a corresponding second masking image; The second masking image is used to indicate the target area in the background image that corresponds to the target subject; The step of generating a composite image based on the reference image and the background image includes: Based on the first masking image and the reference image, the display content of the target subject is determined; Based on the second masking image, determine the display parameters corresponding to the target subject; Based on the display parameters, the display content and the background image are combined to obtain a composite image; in the composite image, the target subject is located in the image region corresponding to the target region.

4. The method according to claim 3, characterized in that, The step of inverting the synthesized image to generate a target noise image includes: The synthesized image is inverted to obtain the processed synthesized image; The region in the image region corresponding to the target region of the synthesized image where the target subject is not displayed is defined as a transition region. By filling the image region corresponding to the transition region in the processed synthetic image with preset random noise, the noise-filled synthetic image is determined as the target noise image.

5. The method according to claim 1, characterized in that, The second image feature is determined in the following way: A first noise image is generated based on the reference image and the background image; The first noisy image is determined as the first current image, and the first current time step is determined; Determine the second image features corresponding to the first current time step based on the first current image; The current image is denoised based on the second image feature to obtain the processed current image; Update the first current image to the processed first current image, and update the first current time step; Continue executing the step of determining the second image feature corresponding to the first current time step based on the first current image until the first current time step meets the preset conditions.

6. The method according to claim 1, characterized in that, The second image feature is determined in the following way: The third image feature is determined in the following way: A second noise image corresponding to the text description information is generated based on random noise. The second noisy image is determined as the second current image, and the second current time step is determined; Based on the second current image and edge features, determine the third image features corresponding to the second current time step; The second current image is denoised based on the third image features to obtain the processed current image; the edge features are determined based on the reference image. Update the second current image to the processed second current image, and update the second current time step; Continue executing the step of determining the third image feature corresponding to the second current time step based on the second current image and edge features, until the second current time step meets the preset conditions.

7. The method according to claim 1, characterized in that, The steps of determining a first image feature based on the target noisy image and the text description information, and performing denoising processing on the target noisy image based on the first image feature, the second image feature, and the third image feature to generate a target image include: The target noise image is determined as the third current image, and the third current time step is determined; Obtain the second image features and the third image features corresponding to the third current time step; Based on the third current image and the text description information, the first image feature corresponding to the third current time step is determined. Based on the first image feature, the second image feature and the third image feature, the third current image is denoised to obtain the denoised third current image. Based on the denoised third current image, an updated third current image is determined, and the third current time step is updated; Continue executing the step of determining the first image feature corresponding to the third current time step based on the third current image and the text description information, until the third current time step meets the preset conditions, and then determine the denoised third current image as the target image.

8. An image generation apparatus, characterized in that, The device includes: An image acquisition module is used to acquire a reference image, a background image, and text description information; the reference image includes a target subject; the text description information is used to describe the target display features corresponding to the target subject; the target display features are different from the initial display features of the target subject in the reference image; The target noise image generation module is used to generate a target noise image based on the reference image and the background image; The noise reduction module is used to determine a first image feature based on the target noise image and the text description information, and to perform noise reduction processing on the target noise image based on the first image feature, the second image feature and the third image feature to generate a target image. Wherein, the second image feature is determined based on the first noise image corresponding to the reference image and the background image; the third image feature is determined based on the second noise image corresponding to the text description information; the target image includes the display content in the background image and the target subject; the target subject in the target image conforms to the target display feature.

9. An electronic device, characterized in that, The method includes a processor and a memory, the memory storing machine-executable instructions that can be executed by the processor, the processor executing the machine-executable instructions to implement the image generation method according to any one of claims 1-7.

10. A machine-readable storage medium, characterized in that, The machine-readable storage medium stores machine-executable instructions, which, when invoked and executed by a processor, cause the processor to implement the image generation method according to any one of claims 1-7.