Ultra-high resolution image editing method based on latent diffusion model

By adopting a multi-stage progressive editing method based on a potential diffusion model, the problems of computational resource consumption and detail loss in ultra-high resolution image editing are solved, enabling high-quality image editing on common hardware devices while preserving image details and avoiding memory overflow.

CN122265031APending Publication Date: 2026-06-23WENZHOU UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
WENZHOU UNIV
Filing Date
2026-02-06
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing methods for ultra-high resolution image editing suffer from high computational resource requirements, loss of image details, and poor results, making it difficult to achieve high-quality ultra-high resolution image editing on common hardware devices.

Method used

A multi-stage progressive editing method based on a latent diffusion model is adopted. Through initialization and multi-scale construction, block encoding, global-local consistency denoising and hybrid sampling, image editing is performed step by step from low resolution to high resolution. The pre-trained model does not require fine-tuning.

Benefits of technology

High-quality ultra-high resolution image editing is achieved on common hardware devices, preserving image details and avoiding memory overflow, thus improving the consistency and quality of editing results.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265031A_ABST
    Figure CN122265031A_ABST
Patent Text Reader

Abstract

The application discloses a super-high-resolution image editing method based on a latent diffusion model, which comprises the following steps: acquiring a high-resolution image and a given editing region; performing down-sampling on the image to obtain a multi-scale input image; performing multi-stage progressive editing, starting from a low resolution, editing the image of each stage, performing up-sampling on the obtained image, and continuing optimization as the input of the next stage. In each stage, a denoised feature map is obtained through the execution of block coding, a global-local consistent denoising process and a block-based hybrid sampling in the current stage. The denoised feature map is decoded to obtain the editing result of the current stage. The obtained super-high-resolution editing result can be controlled by using a text prompt.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention belongs to the field of image editing technology and relates to an ultra-high resolution image editing method based on a latent diffusion model. Background Technology

[0002] In recent years, diffusion-based generative methods have demonstrated outstanding performance in image generation and editing tasks. However, in the field of ultra-high-resolution image editing, existing methods are typically limited to images with resolutions below 1K due to memory requirements and the high cost of training with high-resolution data, while images captured by modern devices have resolutions as high as 8K. Simply enlarging low-resolution edits often results in blurry images and loss of detail.

[0003] To address this issue, existing methods mainly fall into two categories. The first category involves directly retraining a new model for high-resolution image input. However, this method consumes a significant amount of computational resources. For example, training the StableDiffusion 1.5 model to support 512 × 512 resolution requires continuous training with 256 A100 GPUs for 20 days, a prohibitive resource requirement that deters most institutions. Furthermore, collecting high-quality, high-resolution images also presents challenges, being far more difficult to obtain than acquiring ordinary image resources.

[0004] The second type of approach involves appending a high-resolution model for upsampling after the low-resolution editing algorithm. However, the effectiveness of this method heavily relies on the coordination between the two-stage models. Poor performance in the initial low-resolution editing or a subpar performance in the subsequent upsampling model will significantly impact the final result. Furthermore, because high-resolution images need to be downsampled before processing, a significant amount of detail is lost. Low-resolution editing algorithms can only rely on this incomplete information for reasoning, thus exhibiting clear limitations in detail quality.

[0005] Therefore, how to efficiently utilize pre-trained large latent variable models to achieve high-quality ultra-high resolution image editing on common hardware devices remains a significant challenge in the field of image editing. Summary of the Invention

[0006] The technical problem to be solved by this invention is to provide a fine-tuning-free ultra-high resolution image editing method based on a latent variable diffusion model, which can achieve high-quality ultra-high resolution image editing on common hardware devices using a pre-trained latent variable diffusion model. The technical solution of this invention is an ultra-high resolution image editing method based on a latent diffusion model, comprising the following steps: S10, Initialization and Multi-Scale Construction; S20, multi-stage progressive editing; S30, Result Output and Post-processing.

[0007] Preferably, step S10 includes the following steps: S11, Model initialization, initializes the encoder, denoiser and decoder of the pre-trained model and loads the pre-trained model; S12, Image Pyramid Generation: Given a high-resolution image X and its corresponding binary edit region mask M, S stages of downsampling are performed to obtain S images at different scales. and masks of different scales ,and , .

[0008] Preferably, in S11, the encoder encodes image blocks and maps them to a latent variable space, given an image block... The encoding formula is expressed as follows: By adding noise to the latent variable feature blocks using a diffusion method, a Gaussian-distributed noise feature block is obtained, and the corresponding formula is expressed as: , , go through By adding noise step by step, noise feature blocks can be obtained. , The variance coefficient for each time step. It is an identity matrix.

[0009] Preferably, in S11, the denoiser denoises the noise feature blocks, gradually obtaining the corresponding denoised latent variable feature blocks. Specifically, the denoising process is the reverse of the noise addition process, starting from the noise feature blocks... The noise is gradually reduced to obtain the final denoised feature block. ; The corresponding formula is expressed as: , , The denoising process is the reverse of the noise addition process, starting from the noise feature block. The noise is gradually reduced to obtain the final denoised feature block. , and The mean and variance of the predictions at each stage; finally, the decoder calculates the mean and variance of the feature block. Decode the code to obtain the corresponding edited result.

[0010] Preferably, in step S20, for each scale s of the image pyramid, the following steps are performed cyclically, starting from the minimum resolution: S21, block encoding; S22, noise reduction iterative optimization S23, block decoding and fusion; S24, Progressive Input Update.

[0011] 6. The method according to claim 5, wherein step S21 comprises the following steps: S211, using a sliding window to crop the image into several image blocks; S212, Encode each block independently into the latent variable space to obtain a set of feature blocks; S213, averaging the overlapping parts of the feature blocks to synthesize a latent variable feature map of the corresponding resolution; Given an image to edit Define multi-image block coding The corresponding formula is: , in, , This is a block-based cropping function based on local window capture, which performs cropping in the form of a sliding window, resulting in... Image blocks. The calculation formula is: , in, and These are the height and width of the corresponding cropping window. and To correspond to the step size in the opposite direction of height and width, This represents the scaling factor for the current scale. By traversing all image patches and encoding them, a set of corresponding latent variable feature blocks can be obtained. The corresponding calculation formula is: ; By merging and fusing all the encoding results, the final latent variable feature map is obtained, and the corresponding calculation formula is: , in, By fusing all latent variable feature blocks, we obtain the result at a scale of Latent variable feature map .

[0012] Preferably, in step S22, at the current scale s, the following iterative steps are performed from time step t=T to 1: S221, Global-Local Consistency Denoising; the corresponding formula is as follows: , , The denoising formula for the pre-trained denoiser is as follows: , for Denoising feature map at time step for Denoising feature map at time step For the corresponding pre-trained denoiser model, The corresponding noise feature map with added noise retains the background information in the latent variable space. For the first Latent variable feature map after denoising step These are the corresponding weighting factors; S222, based on block-based hybrid sampling, includes: S2221, block-based local sampling, the corresponding calculation formula is as follows. , in, This represents the input feature map at scale s and time step t. This is a block-based cropping function based on local window capture, performing cropping in a sliding window manner. The corresponding formula is: , Among them, the corresponding The formula for the number of feature blocks in the overall dataset is: , The corresponding batch denoising function denoises each block and stores the denoising data in memory. The corresponding formula is: , For the corresponding reconstruction function, all denoised blocks are fused to obtain the corresponding denoised feature map. The corresponding formula is: ; S2222, upsampling-guided sampling, the calculation formula for block sampling based on upsampling guidance is as follows: , in, This is a block-based cropping function based on local window capture, performing cropping in a sliding window manner. The corresponding formula is: , in, The corresponding formula is: ; The size of the sliding window is The step sizes are respectively and , The corresponding upsampling-guided batch denoising function denoises each block and stores the denoising data in memory. The corresponding formula is: , For each feature block, the corresponding upsampling-based denoising function The formula is: , in, ,and and For the corresponding upsampling and downsampling functions, the scaling factor is... , , In the denoising process, the noise scheduling parameters change with the time step t, and all denoised blocks are fused to obtain the corresponding denoised feature map. The corresponding formula is: ; S2223, global sampling based on the void kernel; the corresponding global sampling formula based on the void kernel is: , in, To perform sparse truncation based on a hollow kernel, a sliding window is used for pruning, extracting a tensor point at intervals. The corresponding formula is: ; The number of sampling windows is [size missing]. By controlling the void ratio using the current scale factor, The corresponding batch denoising function denoises each sparse block and stores the denoising data in memory. The corresponding formula is: ; For the corresponding reconstruction function, all sparse denoised blocks are fused to obtain the corresponding denoised feature map. The corresponding formula is ; S223, multi-sampling path fusion, the corresponding fusion formula is as follows:

[0013] in, For block-based local sampling, For intermediate sampling guided by upsampling, For global sampling based on void kernels, These are the corresponding weight parameters.

[0014] Preferably, step S23 uses a decoder to restore the denoised feature map into an image, and then fuses it with the original background according to the mask to obtain the editing result of the current stage.

[0015] Preferably, in step S24, the current stage editing result is upsampled and used as the initial input for the next stage at a higher resolution scale, thus entering the next scale cycle.

[0016] Preferably, step S30 includes the following steps: S31, boundary smoothing optimization; S32, Final Result Output The present invention has at least the following beneficial effects: 1. This invention proposes a novel fine-tuning-free ultra-high resolution image editing method based on a large latent variable diffusion model, enabling high-quality ultra-high resolution image editing on common hardware devices.

[0017] 2. This invention introduces a globally consistent denoising method that can effectively fuse the edited content and the generated content in the latent variable space to remove the boundary artifacts of the edited content.

[0018] 3. This invention introduces a block-based hybrid sampling method that can capture feature information at local, intermediate, and global scales, maintain global consistency, and optimize the generation of details.

[0019] 4. The method of the present invention can be applied to existing large diffusion model image editing models based on latent variables, and can be extended to multimodal editing methods, such as modal manipulation based on text guidance, human key points, sketching, depth maps, etc. Attached Figure Description

[0020] Figure 1 This is a flowchart illustrating the steps of the ultra-high resolution image editing method based on a latent diffusion model according to an embodiment of the present invention. Figure 2 This is a schematic diagram illustrating the specific principle of the ultra-high resolution image editing method based on the latent diffusion model according to an embodiment of the present invention. Figure 3 This is a comparison of the visual effects of the ultra-high resolution image editing method based on the latent diffusion model according to an embodiment of the present invention and the existing best editing algorithm; Figure 4 This is a comparison of the visual effects of the ultra-high resolution image editing method based on the latent diffusion model in this invention and the existing best ultra-high resolution image editing algorithm; Figure 5This is a high-resolution editing effect image based on sketch information, representing an embodiment of the present invention's method for fine-tuning-free ultra-high-resolution image editing based on a large latent variable diffusion model. Figure 6 This image shows the high-resolution editing effect based on depth map information of the latent variable diffusion large model-based ultra-high resolution image editing method without fine-tuning according to an embodiment of the present invention. Detailed Implementation

[0021] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the invention.

[0022] Conversely, this invention encompasses any substitutions, modifications, equivalent methods, and solutions made within the spirit and scope of the invention as defined in the claims. Furthermore, to provide a better understanding of the invention, certain specific details are described in detail below. However, those skilled in the art will fully understand the invention even without these detailed descriptions.

[0023] See Figure 1 , Figure 2 The flowchart below is a method embodiment of the present invention, including the following steps: S10, Initialization and Multi-Scale Construction; specifically including S11, Model Initialization, which initializes the encoder, denoiser, and decoder of the pre-trained model and loads them into the pre-trained model; wherein the encoder encodes image patches, mapping the image patches to the latent variable space, given an image patch. The encoding formula is expressed as follows: A noise feature block following a Gaussian distribution is obtained by adding noise to the latent variable feature block using a diffusion method. A denoiser then denoises the noise feature block, gradually obtaining the corresponding denoised latent variable feature block. Specifically, the denoising process is the reverse of the noise addition process, starting from the noise feature block... The noise is gradually reduced to obtain the final denoised feature block. The decoder for this feature block Decode the code to obtain the corresponding edited result.

[0024] S12, Image Pyramid Generation: Given a high-resolution image X and its corresponding binary edit region mask M, S stages of downsampling are performed to obtain S images at different scales. and masks of different scales ,and , .

[0025] S20, multi-stage progressive editing; given a scale s, for each scale s of the image pyramid, the following steps are performed iteratively, starting from the minimum resolution: S21, block encoding; includes the following steps: S211, using a sliding window to crop the image into several image blocks; S212, Encode each block independently into the latent variable space to obtain a set of feature blocks; S213, averaging the overlapping portions of the feature blocks to synthesize a latent variable feature map at the corresponding resolution. .

[0026] S22, Denoising Iterative Optimization: Given a time step t, at the current scale s, perform the following iterative steps from time step t=T to 1: S221, global-local consistency denoising; S222, block-based hybrid sampling, including S2221, block local sampling, S2222, upsampling guided sampling, and S2223, hole kernel global sampling; S223, multi-sampling path fusion; S23, Block Decoding and Fusion: The decoder restores the denoised feature map into an image, and the image is fused with the mask and the original background to obtain the editing result of the current stage. S24, Progressive Input Update, upsamples the current stage's editing results and uses them as the initial input for the next stage at a higher resolution scale, entering the next scale cycle; S30, Result Output and Post-processing, including the following steps: S31, Boundary Smoothing Optimization, performing Poisson fusion at the final resolution stage to eliminate artifacts at the edges of the edited region; S32, Final Result Output, returning the processed ultra-high resolution image. .

[0027] In a specific embodiment, this invention uses a pre-trained latent variable diffusion model to expand the inference image resolution limited by the model itself, enabling image editing at ultra-high resolution without the need for fine-tuning the model parameters during training. The corresponding pre-trained model includes an encoder. noise denoiser and decoder First, initialize the encoder, denoiser, and decoder of the pre-trained model and load the pre-trained model.

[0028] S10 of this invention first utilizes a pre-trained large latent variable model at low resolution. In this invention, a low-resolution image is defined as an image patch, and its corresponding mapping in the latent variables is a latent variable feature block; an ultra-high resolution input is defined as an image, and its corresponding mapping in the latent variables is a latent variable feature map.

[0029] In low-resolution scenarios, the encoder primarily encodes image patches, mapping them into a latent variable space. Given an image patch... Then the encoding formula can be expressed as: Then, noise is added to the latent variable feature block through a diffusion step to obtain a noise feature block that follows a Gaussian distribution. The corresponding formula can be expressed as: , , go through By adding noise step by step, noise feature blocks can be obtained. , The variance coefficient for each time step. It is an identity matrix.

[0030] The denoiser denoises the noise feature blocks, gradually obtaining the corresponding denoised latent variable feature blocks. The corresponding formula can be expressed as: , , The denoising process is the reverse of the noise addition process, starting from the noise feature block. The noise is gradually reduced to obtain the final denoised feature block. . and The mean and variance of the predictions at each stage are given. Finally, the decoder calculates the mean and variance for this feature block. Decode the code to obtain the corresponding edited result.

[0031] The method of this invention does not require retraining and fine-tuning of the model; it is an inference optimization method. It can extend a large latent variable model pre-trained at a specific resolution to adapt to ultra-high resolution image input, maintaining high-quality image editing results without causing memory overflow.

[0032] S20 employs a multi-stage progressive editing approach. It first generates a multi-scale image pyramid from the input high-resolution image, with each stage corresponding to a specific scale. The framework uses a coarse-to-fine granular editing strategy, starting with the image of lowest resolution as the initial stage. Within each stage, block encoding, globally and locally consistent denoising, and block-based hybrid sampling are performed, followed by decoding to generate the corresponding edit at the current scale. This result is then upsampled and fed into the next stage for further optimization.

[0033] Given image and binary mask First, image pyramid generation needs to be performed, followed by S-stage downsampling to obtain S images at different scales. and masks of different scales ,and , .

[0034] In multi-stage progressive editing, S21 block encoding is used to address the issue of encoding ultra-high-resolution images into the latent variable space without affecting encoding quality or causing memory overflow. The main idea of ​​this method is to treat the encoder as a processor (such as a convolutional kernel) with a given window size. Each time only the model is processed, the window size can be the scale corresponding to the model's training, such as a resolution of 1024×1024. The image is block-encoded in the form of a sliding window to obtain corresponding mapped feature blocks. By performing operation encoding on all image blocks and maintaining the relative two-dimensional spatial relationship of these feature mapping blocks, the average of the overlapping parts of the feature blocks is calculated to obtain the encoded feature map of the corresponding ultra-high-resolution image. Given an editable image... Define multi-image block coding The corresponding formula is: , in, , This is a block-based cropping function based on local window capture, which performs cropping in the form of a sliding window, resulting in... Image blocks. The calculation formula is:

[0035] , in, and These are the height and width of the corresponding cropping window. and To correspond to the step size in the opposite direction of height and width, This represents the scaling factor for the current scale. By traversing all image patches and encoding them, a set of corresponding latent variable feature blocks can be obtained. Given a pre-cut image patch The corresponding calculation formula is: .

[0036] By merging and fusing all the encoding results, the final latent variable feature map is obtained, and the corresponding calculation formula is: , in, By fusing all latent variable feature blocks, we obtain the result at a scale of Latent variable feature map .

[0037] Using the block coding method described above, we can further obtain coded feature maps for images at each input scale. The background information in these encoded feature maps is further used in multi-stage progressive editing, thereby preserving the background information while guiding the model to generate globally consistent results. Its computational model can be expressed as the following formula: , in, It is a set of scale stages, a hyperparameter that allows users to control which scales the model performs inference on, as needed. When using it, this invention will first use the scale... Perform calculations, and then directly transfer the results to the scale. The main idea is to start with the lowest-scale encoded feature map and calculate the final result. By gradually adding and removing noise, a denoised feature map at the current scale is obtained. This serves as the encoding input for the next stage, resulting in a new encoded feature map: , in, Return to the current scale The scaling factor of the previous scale. For the corresponding number The formula for calculating the editing results at different scales is as follows: , in, This corresponds to a block-based decoding function that decodes the denoised feature map to obtain the corresponding generated result. To preserve information in the unedited regions, the above formula reuses background information to obtain the result... Boundary results at different scales. The final editing result is obtained by traversing and calculating the entire set of stages. , This is the final maximum scale factor. In the final stage, Poisson fusion is applied to further improve the smoothness of the edit boundaries.

[0038] Then, at each scale, the global-local consistent denoising process of S221 is guided by a given binary mask. The corresponding formula is as follows: , , The denoising formula for the pre-trained denoiser is as follows: . for Denoising feature map at time step for The denoised feature map at each time step. This is the corresponding pre-trained denoiser model. The corresponding noise feature map with added noise retains the background information in the latent variable space. For the first The latent variable feature map after denoising. Given the corresponding weight factors. The corresponding formula is: , in, The weight information is preset. By introducing the global-local consistency denoising process described in the above formula, the corresponding denoising process can be described as follows: This step first stores the noise feature maps of each stage in the original diffusion process, then retains the unmodified areas, thus enabling editing only of the corresponding areas while preserving the unedited areas. The role of the global-local consistent denoising process is to generate and fill the content of the edited areas while preserving the information of the unedited areas. Therefore, through the denoising process of this invention, the information of the unedited areas can be successfully preserved, and further guided to generate a denoising process that is consistent with and related to the background information in the edited areas.

[0039] Each step of the global-local denoising process requires the use of S222's block-based hybrid sampling method to capture global and local information. This avoids memory overflow issues caused by direct inference and achieves high-quality denoising results.

[0040] The block-based hybrid sampling method integrates three sampling methods to enhance local and global consistency, achieving efficient denoising on ultra-high resolution feature maps. Its basic principle is to divide the denoising process into blocks, then perform sampling denoising, and finally fuse the results based on the corresponding coordinate information. The corresponding fusion formula is as follows:

[0041] in, For block-based local sampling, For intermediate sampling guided by upsampling, This is for global sampling based on a void kernel. The corresponding weight parameters are automatically adjusted based on the time step, and the corresponding formula is: , in, This refers to the preset weight information.

[0042] The block-based hybrid sampling method integrates S2221 block-based local sampling, S2222 sampling guided by intermediate-scale upsampling, and S2223 global sampling based on dilated kernels. This allows for the capture of local, intermediate-scale, and global features, achieving high-quality denoising while avoiding memory overflow issues. The essence of S2221 block-based local sampling is to transform the sampling process into sampling in the form of independent windows, then using windowing to perform independent inference on each block and obtain the corresponding inference result. The advantage of this approach is that it calculates only independent image blocks at a time, thus avoiding memory overflow caused by tensor expansion during intermediate inference. The corresponding calculation formula is as follows. , in, This represents the input feature map at scale s and time step t. This is a block-based cropping function that uses a sliding window to crop. The corresponding formula is: , Among them, the corresponding The formula for the number of feature blocks in the overall dataset is: .

[0043] Given For each block, noise is denoised and stored in memory. The corresponding formula is: .

[0044] For the corresponding reconstruction function, all denoised blocks are fused to obtain the corresponding denoised feature map. The corresponding formula is: .

[0045] Although block-based local sampling can achieve inference, the receptive field for inference remains limited, thus requiring a further intermediate-scale upsampling-guided sampling process. Because the model depends on the training resolution, directly applying the model to high-resolution computation still results in significant pattern collapse due to noise. Therefore, an upsampling-guided sampling method is needed to capture intermediate-scale features. The formula for S2222's upsampling-guided block sampling calculation is as follows: , in, This is a block-based cropping function based on local window capture, performing cropping in a sliding window manner. The corresponding formula is: , Given The corresponding formula is: .

[0046] The size of the sliding window is The step sizes are respectively and .

[0047] The corresponding upsampling-guided batch denoising function denoises each block and stores the denoising data in memory. The corresponding formula is: .

[0048] For each feature block, the corresponding upsampling-based denoising function The formula is: , in, ,and and For the corresponding upsampling and downsampling functions, the scaling factor is... , , In the denoising process, the noise scheduling parameters change with the time step t. All denoised blocks are then fused to obtain the corresponding denoised feature map. The corresponding formula is: .

[0049] To capture global information, the S2223 dilated kernel-based global sampling method is further introduced during the denoising process. The basic idea is to introduce a dilation rate into each sampled point, i.e., sampling is performed at intervals of certain points to obtain sparse sampling points. This approach significantly expands the sampling range while keeping the total number of all sampled points within the equivalent scale during training. The corresponding dilated kernel-based global sampling formula is: , in, To perform sparse truncation based on a hollow kernel, a sliding window is used for pruning, extracting a tensor point at intervals. The corresponding formula is: .

[0050] The number of sampling windows is [size missing]. The void ratio is controlled by the current scale factor. The corresponding batch denoising function denoises each sparse block and stores the denoising data in memory. The corresponding formula is: .

[0051] For the corresponding reconstruction function, all sparse denoised blocks are fused to obtain the corresponding denoised feature map. The corresponding formula is .

[0052] See Figure 3 This is a comparison of the visual effects of the ultra-high resolution image editing method based on the latent diffusion model according to an embodiment of the present invention and the existing best editing algorithm; Figure 4 This is a comparison of the visual effects of the ultra-high resolution image editing method based on the latent diffusion model in this invention and the existing best ultra-high resolution image editing algorithm; Figure 5 This is a high-resolution editing effect image based on sketch information, representing an embodiment of the present invention's method for fine-tuning-free ultra-high-resolution image editing based on a large latent variable diffusion model. Figure 6 This image shows the high-resolution editing effect based on depth map information of the latent variable diffusion large model-based ultra-high resolution image editing method without fine-tuning according to an embodiment of the present invention.

[0053] In summary, this invention addresses the limitations of existing pre-trained large models in scaling resolution by designing a progressive editing framework that moves from coarse to fine granularity. This framework gradually generates and optimizes the input high-resolution image from global information to detailed levels. During each denoising step, a denoising method that ensures consistency between global and local information is introduced, effectively preserving the surrounding information of unedited regions while progressively integrating the consistency between the background and the edited region. Furthermore, to address the inference memory overflow problem of high-resolution images on common devices, this invention innovatively proposes a block-based hybrid sampling method. This method comprehensively utilizes block-based local sampling, upsampling-guided sampling, and dilated convolution kernel-based global sampling to fully capture both global and local information, achieving high-quality editing of high-resolution images.

[0054] Implementing the embodiments of the present invention has the following beneficial effects: 1. This invention proposes a novel fine-tuning-free ultra-high resolution image editing method based on a large latent variable diffusion model, enabling high-quality ultra-high resolution image editing on common hardware devices.

[0055] 2. This invention introduces a globally consistent denoising method that can effectively fuse the edited content and the generated content in the latent variable space to remove the boundary artifacts of the edited content.

[0056] 3. This invention introduces a block-based hybrid sampling method that can capture feature information at local, intermediate, and global scales, maintain global consistency, and optimize the generation of details.

[0057] 4. The method of the present invention can be applied to existing large diffusion model image editing models based on latent variables, and can be extended to multimodal editing methods, such as modal manipulation based on text guidance, human key points, sketching, depth maps, etc.

[0058] Those skilled in the art will understand that all or part of the steps in the methods of the above embodiments can be implemented by a program instructing related hardware. The program can be stored in a computer-readable storage medium, such as ROM / RAM, disk, optical disk, etc.

[0059] The above description is merely a preferred embodiment of the present invention and is not intended to limit the present invention. Any modifications, equivalent substitutions, and improvements made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A method for editing ultra-high resolution images based on a latent diffusion model, characterized in that, Includes the following steps: S10, Initialization and Multi-Scale Construction; S20, multi-stage progressive editing; S30, Result Output and Post-processing.

2. The method according to claim 1, characterized in that, S10 includes the following steps: S11, Model initialization, initializes the encoder, denoiser and decoder of the pre-trained model and loads the pre-trained model; S12, Image Pyramid Generation: Given a high-resolution image X and its corresponding binary edit region mask M, S stages of downsampling are performed to obtain S images at different scales. and masks of different scales ,and , .

3. The method according to claim 2, characterized in that, In S11, the encoder encodes the image blocks and maps them to the latent variable space. Given an image block... The encoding formula is expressed as follows: By adding noise to the latent variable feature blocks using a diffusion method, a Gaussian-distributed noise feature block is obtained, and the corresponding formula is expressed as: , , go through By adding noise step by step, noise feature blocks can be obtained. , The variance coefficient for each time step. It is an identity matrix.

4. The method according to claim 2, characterized in that, In step S11, the denoiser denoises the noise feature blocks, gradually obtaining the corresponding denoised latent variable feature blocks. Specifically, the denoising process is the reverse of the noise addition process, starting from the noise feature blocks... The noise is gradually reduced to obtain the final denoised feature block. ; The corresponding formula is expressed as: , , The denoising process is the reverse of the noise addition process, starting from the noise feature block. The noise is gradually reduced to obtain the final denoised feature block. , and The mean and variance of the predictions at each stage; finally, the decoder calculates the mean and variance of the feature block. Decode the code to obtain the corresponding edited result.

5. The method according to claim 1, characterized in that, Given a scale s, in step S20, for each scale s of the image pyramid, the following steps are performed iteratively, starting with the minimum resolution: S21, block encoding; S22, noise reduction iterative optimization S23, block decoding and fusion; S24, Progressive Input Update.

6. The method according to claim 5, characterized in that, S21 includes the following steps: S211, using a sliding window to crop the image into several image blocks; S212, Encode each block independently into the latent variable space to obtain a set of feature blocks; S213, averaging the overlapping parts of the feature blocks to synthesize a latent variable feature map of the corresponding resolution; Given an image to edit Define multi-image block coding The corresponding formula is: , in, , This is a block-based cropping function based on local window capture, which performs cropping in the form of a sliding window, resulting in... Image blocks. The calculation formula is: , in, and These are the height and width of the corresponding cropping window. and To correspond to the step size in the opposite direction of height and width, This represents the scaling factor for the current scale. By traversing all image patches and encoding them, a set of corresponding latent variable feature blocks can be obtained. The corresponding calculation formula is: ; By merging and fusing all the encoding results, the final latent variable feature map is obtained, and the corresponding calculation formula is: , in, By fusing all latent variable feature blocks, we obtain the result at a scale of Latent variable feature map .

7. The method according to claim 5, characterized in that, In step S22, at the current scale s, the following iterative steps are performed from time step t=T to 1: S221, Global-Local Consistency Denoising; the corresponding formula is as follows: , , The denoising formula for the pre-trained denoiser is as follows: , for Denoising feature map at time step for The denoised feature map at each time step. For the corresponding pre-trained denoiser model, The corresponding noise feature map with added noise retains the background information in the latent variable space. For the first Latent variable feature map after denoising step These are the corresponding weighting factors; S222, based on block-based hybrid sampling, includes: S2221, block-based local sampling, the corresponding calculation formula is as follows. , in, This represents the input feature map at scale s and time step t. This is a block-based cropping function based on local window capture, performing cropping in a sliding window manner. The corresponding formula is: , Among them, the corresponding The formula for the number of feature blocks in the overall dataset is: , The corresponding batch denoising function denoises each block and stores the denoising data in memory. The corresponding formula is: , For the corresponding reconstruction function, all denoised blocks are fused to obtain the corresponding denoised feature map. The corresponding formula is: ; S2222, upsampling-guided sampling, the calculation formula for block sampling based on upsampling guidance is as follows: , in, This is a block-based cropping function based on local window capture, performing cropping in a sliding window manner. The corresponding formula is: , in, The corresponding formula is: ; The size of the sliding window is The step sizes are respectively and , The corresponding upsampling-guided batch denoising function denoises each block and stores the denoising data in memory. The corresponding formula is: , For each feature block, the corresponding upsampling-based denoising function The formula is: , in, ,and and For the corresponding upsampling and downsampling functions, the scaling factor is... , , In the denoising process, the noise scheduling parameters change with the time step t, and all denoised blocks are fused to obtain the corresponding denoised feature map. The corresponding formula is: ; S2223, global sampling based on the void kernel; the corresponding global sampling formula based on the void kernel is: , in, To perform sparse truncation based on a hollow kernel, a sliding window is used for pruning, extracting a tensor point at intervals. The corresponding formula is: ; The number of sampling windows is [size missing]. By controlling the void ratio using the current scale factor, The corresponding batch denoising function denoises each sparse block and stores the denoising data in memory. The corresponding formula is: ; For the corresponding reconstruction function, all sparse denoised blocks are fused to obtain the corresponding denoised feature map. The corresponding formula is ; S223, multi-sampling path fusion, the corresponding fusion formula is as follows: ; in, For block-based local sampling, For intermediate sampling guided by upsampling, For global sampling based on void kernels, These are the corresponding weight parameters.

8. The method according to claim 5, characterized in that, S23 uses a decoder to restore the denoised feature map into an image, and then fuses it with the original background according to the mask to obtain the editing result of the current stage.

9. The method according to claim 5, characterized in that, S24 upsamples the current stage editing result and uses it as the initial input for the next stage at a higher resolution scale, thus entering the next scale cycle.

10. The method according to claim 1, characterized in that, S30 includes the following steps: S31, boundary smoothing optimization; S32, final result output.