A multi-wheel image editing method and system based on high-frequency detail information injection
By constructing a reference velocity field and performing Laplacian pyramid decomposition and fusion, the problems of subject identity distortion and texture detail loss in multi-round image editing are solved, improving the accuracy and quality of image editing.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- SUN YAT SEN UNIV
- Filing Date
- 2026-04-09
- Publication Date
- 2026-06-12
Smart Images

Figure CN122199642A_ABST
Abstract
Description
Technical Field
[0001] This application relates to the field of image editing technology, and in particular to a multi-round image editing method and system based on high-frequency detail information injection. Background Technology
[0002] In recent years, generative models, represented by diffusion models and flow matching models, have achieved revolutionary breakthroughs, demonstrating unprecedented capabilities in the field of image synthesis. Building upon this foundation, instruction-based image editing has rapidly developed as a particularly important branch, allowing users to intuitively modify images using natural language instructions. However, real-world creative tasks are inherently complex and progressive, and a single instruction often fails to fully express the user's final intent. Therefore, multi-turn instruction-based image editing has become an inevitable trend. It supports users in making refined and global iterative adjustments to images through a series of consecutive instructions—for example, changing a virtual character's entire outfit, or implementing complex scene transitions step-by-step in film post-production. This interactive mode greatly enhances the flexibility and precise control of creation, and is key to driving the large-scale application of generative editing technology.
[0003] Existing superior models are Flux.Kontext and Qwen-Image-Edit. While these models perform well in single-round operations, their stability and fidelity face significant challenges in multi-round continuous editing scenarios. When an image is iteratively edited, information loss that is barely perceptible in the previous round accumulates and amplifies in subsequent iterations. This process ultimately leads to two common core problems: identity distortion and texture degradation. After multiple rounds of editing, the identity features of the subject gradually shift, resulting in a significant difference from the original image; simultaneously, the surface textures of objects (such as skin texture, hair details, or fabric patterns) gradually become blurred, losing their realism. These degradation problems caused by error accumulation severely limit the application potential of existing models in refined, long-flow creative tasks, highlighting the urgent need to design a general solution for flow matching models. Summary of the Invention
[0004] Based on this, the objective of this invention is to provide a general method framework for stream matching models that do not require training, which can alleviate the degradation problem in multi-round image editing and achieve high-frequency texture protection.
[0005] To achieve the above-mentioned objectives, the first aspect of this application provides a multi-round image editing method based on high-frequency detail information injection, comprising: Obtain the image edited in the previous round from the multi-round instruction image editing model based on flow matching as the reference image; During the current round of image editing, based on the generation sequence of the model, a reference velocity field is constructed within a preset time window according to the reference image and the noise image at the current moment; The predicted velocity field of the model itself and the reference velocity field are decomposed into Laplacian pyramids respectively to obtain their respective multi-scale Laplacian pyramid representations, wherein the multi-scale Laplacian pyramid representations include a high-frequency detail layer and a low-frequency base layer. The high-frequency detail layers of the reference velocity field and the corresponding high-frequency detail layers of the predicted velocity field are weighted and fused, and the low-frequency base layer of the predicted velocity field is used as the fused low-frequency base layer to construct the fused Laplace pyramid. The fused Laplacian pyramid is reverse-reconstructed to obtain the edit velocity field. The edit velocity field and the predicted velocity field are then used to denoise the model generation process, generating the image after the current round of editing.
[0006] Preferably, the preset time window corresponds to the early stage of the flow matching model generation process, which is the first 30% of the total denoising steps.
[0007] Preferably, the reference velocity field is constructed using linear interpolation, specifically as follows:
[0008] in, For reference velocity field, The reference image, This is the noise image at the current time t. This is the final time step.
[0009] Preferably, the specific steps for performing Laplace pyramid decomposition on the velocity field include: For the input velocity field v, first construct the corresponding Gaussian pyramid, and let the Gaussian feature of the 0th layer be:
[0010] Subsequent layers of features are generated recursively through Gaussian smoothing and downsampling.
[0011] Based on this, a Laplacian pyramid is constructed. For the i-th layer, the Gaussian features of the (i+1)-th layer are first upsampled to the same resolution as the i-th layer.
[0012] Then, subtract the current layer's Gaussian features from its upsampled approximation to obtain the i-th layer's high-frequency detail features:
[0013] in, This represents high-frequency residual information at the current scale; This represents an upsampling approximation of the Gaussian features of the current layer; For the topmost layer, i.e., the (N-1)th layer, no further difference operations are performed; instead, this coarsest-scale Gaussian feature is directly used as the low-frequency base layer.
[0014] Finally, the multi-scale Laplace pyramid representations of the predicted velocity field and the reference velocity field are obtained respectively:
[0015]
[0016] in, , The multi-scale Laplacian pyramid representations of the predicted velocity field and the reference velocity field are respectively; layers 0 to N-2 are high-frequency detail layers, layer N-1 is a low-frequency base layer, and N is the total number of layers in the pyramid decomposition.
[0017] Preferably, in the weighted fusion, for each high-frequency detail layer, a weighted sum is performed according to a preset guiding coefficient, and the fusion formula is as follows:
[0018] in, This is the fused i-th high-frequency detail layer; and The predicted velocity field and the reference velocity field are respectively in the th... High-frequency components of the layer; is the preset guiding coefficient for the i-th layer, where the value of i ranges from 0 to N-2.
[0019] Preferably, the reverse reconstruction of the merged Laplace pyramid specifically includes: Starting from the fused low-frequency base layer, each high-frequency detail layer is upsampled and superimposed with the current layer at the pixel level, following the order of feature scale from coarse to fine (N-2 to 0 layers). After traversal, the editing velocity field is obtained.
[0020] Preferably, the denoising using the edited velocity field and the predicted velocity field specifically includes: Within the preset time window, the edited velocity field is used to replace the predicted velocity field for noise reduction; After the preset time window ends, the predicted velocity field is used for denoising.
[0021] To achieve the purpose of the invention, a second aspect of this application provides a multi-round image editing system based on high-frequency detail information injection, applying the multi-round image editing method based on high-frequency detail information injection described above. The system includes: The reference image acquisition module is used to acquire the image after the previous round of editing in the multi-round instruction image editing model as the reference image and to acquire the noise image at the current moment; The reference flow construction module is used to construct a reference velocity field based on the model's generation sequence within a preset time window during the current round of image editing, according to the reference image and the noise image at the current moment. The frequency domain decomposition module is used to perform Laplacian pyramid decomposition on the predicted velocity field and the reference velocity field respectively to obtain their respective multi-scale Laplacian pyramid representations, wherein the multi-scale Laplacian pyramid representation includes a high-frequency detail layer and a low-frequency base layer. The fusion module is used to perform weighted fusion of each high-frequency detail layer of the reference velocity field with the corresponding high-frequency detail layer of the predicted velocity field, and use the low-frequency base layer of the predicted velocity field as the fused low-frequency base layer to construct the fused Laplace pyramid. The reconstruction and generation module is used to reverse reconstruct the fused Laplacian pyramid to obtain the edit velocity field, and to use the edit velocity field and the predicted velocity field to denoise the model generation process and generate the image after the current round of editing.
[0022] Preferably, the reference flow construction module constructs the reference velocity field using linear interpolation, specifically as follows:
[0023] in, For reference velocity field, The reference image, This is the noise image at the current time t. This is the final time step.
[0024] Preferably, when the fusion module performs weighted fusion, for each high-frequency detail layer, a weighted sum is performed according to a preset guiding coefficient, and the fusion formula is:
[0025] in, This is the fused i-th high-frequency detail layer; and The predicted velocity field and the reference velocity field are respectively in the th... High-frequency components of the layer; is the preset guiding coefficient for the i-th layer, where i ranges from 0 to N-2, and N is the total number of layers in the pyramid decomposition.
[0026] Compared with the prior art, the beneficial effects of this invention are: This invention resolves the contradiction between "texture preservation" and "semantic editing flexibility" by injecting reference information within a time window and releasing constraints later. It utilizes the previous image as an ideal target and, combined with the current noise state, constructs a reference flow field that conforms to the linear trajectory characteristics of flow matching. Its high-frequency components are then directly superimposed as correction terms into the current model's prediction, achieving plug-and-play enhancement without training. Furthermore, it directly performs frequency domain decomposition of the velocity field using Laplacian pyramid technology. By separating the high-frequency components of the velocity field, it achieves precise extraction and control of texture details at the dynamic level rather than the pixel level. This improves the accuracy of image editing by reducing subject identity distortion and texture detail loss errors in multi-round image editing tasks. Attached Figure Description
[0027] Figure 1 A flowchart illustrating the steps of a multi-round image editing method based on high-frequency detail information injection; Figure 2 This is a schematic diagram of a multi-round image editing system based on high-frequency detail information injection; Figure 3 A comparative schematic diagram showing the generation of animal-related images using the method of this invention and existing technologies; Figure 4 This is a comparative diagram showing the generation of item-related images using the method of this invention and existing technologies. Detailed Implementation
[0028] To make the objectives, technical solutions, and advantages of this application clearer, the following detailed description is provided in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative and not intended to limit the scope of the invention. The following embodiments are used to illustrate the invention but are not intended to limit its scope.
[0029] Example 1 Embodiment 1 of this application provides a multi-round image editing method based on high-frequency detail information injection, such as... Figure 1 As shown, it includes: S1: Obtain the image edited in the previous round from the multi-round instruction image editing model based on stream matching as the reference image; S2: During the current round of image editing, based on the generation sequence of the model, within a preset time window, a reference velocity field is constructed according to the reference image and the noise image at the current moment; S3: Perform Laplacian pyramid decomposition on the predicted velocity field of the model itself and the reference velocity field respectively to obtain their respective multi-scale Laplacian pyramid representations, wherein the multi-scale Laplacian pyramid representation includes a high-frequency detail layer and a low-frequency base layer. S4: Weighted fusion of each high-frequency detail layer of the reference velocity field with the corresponding high-frequency detail layer of the predicted velocity field, and using the low-frequency base layer of the predicted velocity field as the fused low-frequency base layer to construct the fused Laplace pyramid. S5: Reverse reconstruction of the fused Laplacian pyramid is performed to obtain the edit velocity field, and the edit velocity field and the predicted velocity field are used to denoise the model generation process to generate the image after the current round of editing.
[0030] The specific contents of steps S1-S5 are as follows: 1. Construction of reference flow field based on time window: Unlike applying constraints throughout the generation process, this invention utilizes the temporal characteristics of the generation process of a multi-round image editing model based on flow matching to construct a reference flow field within a time window.
[0031] Temporal window mechanism: The early stages of the image generation process (e.g., the first 30% of denoising steps) are a critical window for establishing high-frequency texture features. The construction and injection of the reference flow field are only activated within this temporal window.
[0032] Flow field construction: At any point within this time window, using the image from the previous round of editing... (corresponding time) ) and current noise state (corresponding time) A reference velocity field pointing to the ideal texture target is constructed based on the linear interpolation assumption. :
[0033] This construction method ensures that the flow field can obtain clear high-frequency guidance information in the critical early stages.
[0034] in, This is the image after the previous round of editing. Indicates the current step Noisy images, It is the final time step. It is the current time step.
[0035] 2. Frequency domain decomposition of the flow field based on the Laplace pyramid: By using frequency domain decomposition and reconstruction techniques, the predicted velocity field can be analyzed. ) and reference velocity field ( Feature fusion. Specifically, it includes the following steps: (1) Constructing a multi-scale Laplace pyramid: For the predicted velocity field and reference velocity field Perform Laplace's pyramid decomposition. Decompose the input velocity field into... Feature pyramid at each level and Among them, the first Layer to the first The layer is a high-frequency detail layer (containing texture and edge residual information), the first... The layer is a low-frequency base layer (containing the overall image structure and semantic color information).
[0036] Detailed construction process: For the input velocity field v, i.e. ( First, construct the corresponding Gaussian pyramid. Let the Gaussian feature of the (0)th layer be...
[0037] Subsequently, Gaussian smoothing and downsampling are recursively used to generate features for each subsequent layer:
[0038] Based on this, a Laplacian pyramid is constructed. For the (i)th layer, the Gaussian features of the (i+1)th layer are first upsampled to the same resolution as the (i)th layer:
[0039] Then, subtract the Gaussian features of the current layer from its upsampled approximation to obtain the high-frequency detail features of the (i)th layer:
[0040] in, This represents high-frequency residual information at the current scale, mainly including local texture, edge variations, and fine-grained structural features. For the top (N-1)th layer, no further differencing is performed; instead, the coarsest-scale Gaussian features are directly used as the low-frequency base layer.
[0041] Finally, the multi-scale Laplace pyramid representations of the predicted velocity field and the reference velocity field can be obtained respectively:
[0042]
[0043] In this decomposition, layers (0) to (N-2) are high-frequency detail layers, and layer (N-1) is a low-frequency base layer. This decomposition method can explicitly decouple multi-scale local detail information and global structural information in the velocity field, providing a foundation for subsequent hierarchical supervision and frequency consistency constraints.
[0044] (2) Integration of differentiated features: Constructing the merged Laplace Pyramid Features of each layer Differentiated processing strategies are adopted based on hierarchical attributes: For high-frequency detail layer ( ): A weighted blending strategy is employed to incorporate texture details from the reference velocity field. This is based on the guiding coefficients of the corresponding layers. The calculation formula is:
[0045] in and The predicted velocity field and the reference velocity field are respectively in the th... High-frequency components of the layer.
[0046] For low-frequency substrate ( ): A structure-locking strategy is employed. To ensure that the semantic structure of the final edited velocity field remains consistent with the predicted velocity field, the low-frequency components of the predicted velocity field are used directly without weighted mixing.
[0047] (3) Reverse reconstruction of the pyramids: Based on the merged pyramid Reconstruct the final edit velocity field. The reconstruction process follows a coarse-to-fine order: First, define the fused low-frequency base layer. From the index Reverse traversal to Upsample the current level to double its resolution, then compare it with the detail components of the current level. Pixel-level overlay is performed. After the traversal is complete, the output is the final edited velocity field, which incorporates the details of the reference velocity field while preserving the structure of the predicted velocity field. The edited velocity field is used for denoising within a preset time window (the first 30% of the time steps), and then the predicted velocity field is used for denoising as usual, finally resulting in the edited image.
[0048] Example 2 Embodiment 2 of this application, based on Embodiment 1, provides a multi-round image editing system based on high-frequency detail information injection, such as... Figure 2 As shown, it includes: The reference image acquisition module 201 is used to acquire the image after the previous round of editing in the multi-round instruction image editing model as a reference image and acquire the noise image at the current moment; The reference flow construction module 202 is used to construct a reference velocity field based on the model's generation sequence within a preset time window during the current round of image editing, according to the reference image and the noise image at the current moment. The frequency domain decomposition module 203 is used to perform Laplacian pyramid decomposition on the predicted velocity field and the reference velocity field respectively to obtain their respective multi-scale Laplacian pyramid representations, wherein the multi-scale Laplacian pyramid representation includes a high-frequency detail layer and a low-frequency base layer. The fusion module 204 is used to perform weighted fusion of each high-frequency detail layer of the reference velocity field with the corresponding high-frequency detail layer of the predicted velocity field, and use the low-frequency base layer of the predicted velocity field as the fused low-frequency base layer to construct the fused Laplace pyramid. The reconstruction and generation module 205 is used to reverse reconstruct the fused Laplacian pyramid to obtain the edit velocity field, and to use the edit velocity field and the predicted velocity field to denoise the model generation process and generate the image after the current round of editing.
[0049] Example 3 In this embodiment 3, an image editing model that uses the multi-round image editing method based on high-frequency detail information injection of the present invention to edit the same initial image is compared with an image editing model that does not use the method of the present invention (the original model) for multi-round image editing. Figure 3 and Figure 4 As shown. The leftmost image corresponds to the original image. The images in the first column (corresponding to the 1st to 10th iterations from top to bottom) and the second column (corresponding to the 1st to 10th iterations from top to bottom) correspond to the implementations of the multi-round image editing method (FreqEdit) based on high-frequency detail information injection of this invention on two different models (Qwen-Image-Edit and Flux.1 Kontext). The images in the third and fourth columns (corresponding to the 1st to 10th iterations from top to bottom) correspond to the implementations of the original model (Qwen-Image-Edit and Flux.1 Kontext). The images in columns 5-9 (corresponding to the 1st to 10th iterations from top to bottom) correspond to the implementations of other existing models (Nano Banana, VINCIE, MTC, Seedream 4.0, Bagel) that do not use the method of this invention.Figure 3 and Figure 4 It can be seen that the image after multiple rounds of editing and scene / atmosphere changes using the method of this invention retains more texture details and has higher image quality compared to the original model and other existing models. The main subject of the image generated by the original model and other existing models is significantly distorted (e.g., Figure 3 Animal face, feet, and posture. Figure 4 (Bottle size and texture), and the image generated by editing using the method of this invention produces less deformation of the subject identity and higher image quality compared to the original model without the method of this invention.
[0050] In summary, this invention provides a multi-round image editing method and system based on high-frequency detail information injection, which can improve the errors of subject identity deformation and texture detail loss in multi-round image editing tasks and improve the accuracy of image editing.
[0051] The above description is only a preferred embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any changes or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in the present invention should be included within the scope of protection of the present invention.
Claims
1. A multi-round image editing method based on high-frequency detail information injection, characterized in that, Includes the following steps: In a multi-round instruction image editing process based on stream matching, the image edited in the previous round is used as a reference image. During the current round of image editing, based on the generation sequence, a reference velocity field is constructed within a preset time window according to the reference image and the noise image at the current moment; The predicted velocity field and the reference velocity field of the current round of image editing are decomposed into Laplacian pyramids respectively to obtain their respective multi-scale Laplacian pyramid representations, wherein the multi-scale Laplacian pyramid representations include a high-frequency detail layer and a low-frequency base layer. The high-frequency detail layers of the reference velocity field and the corresponding high-frequency detail layers of the predicted velocity field are weighted and fused, and the low-frequency base layer of the predicted velocity field is used as the fused low-frequency base layer to construct the fused Laplace pyramid. The fused Laplacian pyramid is reverse-reconstructed to obtain the edit velocity field. The edit velocity field and the predicted velocity field are then used to denoise the model generation process, generating the image after the current round of editing.
2. The method according to claim 1, characterized in that, The preset time window corresponds to the early stage of the flow matching model generation process, which is the first 30% of the total denoising steps.
3. The method according to claim 1, characterized in that, The reference velocity field is constructed using linear interpolation, specifically as follows: in, For reference velocity field, The reference image, This is the noise image at the current time t. This is the final time step.
4. The method according to claim 3, characterized in that, The specific steps for performing Laplace pyramid decomposition on the velocity field include: For the input velocity field v, first construct the corresponding Gaussian pyramid, and let the Gaussian feature of the 0th layer be: Subsequent layers of features are generated recursively through Gaussian smoothing and downsampling. Based on this, a Laplacian pyramid is constructed. For the i-th layer, the Gaussian features of the (i+1)-th layer are first upsampled to the same resolution as the i-th layer. Then, subtract the current layer's Gaussian features from its upsampled approximation to obtain the i-th layer's high-frequency detail features: in, This represents high-frequency residual information at the current scale; This represents an upsampling approximation of the Gaussian features of the current layer; For the topmost layer, i.e., the (N-1)th layer, no further difference operations are performed; instead, this coarsest-scale Gaussian feature is directly used as the low-frequency base layer. Finally, the multi-scale Laplace pyramid representations of the predicted velocity field and the reference velocity field are obtained respectively: in, , The multi-scale Laplacian pyramid representations of the predicted velocity field and the reference velocity field are respectively; layers 0 to N-2 are high-frequency detail layers, layer N-1 is a low-frequency base layer, and N is the total number of layers in the pyramid decomposition.
5. The method according to claim 4, characterized in that, In the weighted fusion, for each high-frequency detail layer, a weighted sum is performed according to a preset guiding coefficient, and the fusion formula is as follows: in, This is the fused i-th high-frequency detail layer; and The predicted velocity field and the reference velocity field are respectively in the th... High-frequency components of the layer; is the preset guiding coefficient for the i-th layer, where the value of i ranges from 0 to N-2.
6. The method according to claim 4, characterized in that, The reverse reconstruction of the merged Pyramid of Laplace includes: Starting from the fused low-frequency base layer, each high-frequency detail layer is upsampled and superimposed with the current layer at the pixel level, following the order of feature scale from coarse to fine (N-2 to 0 layers). After traversal, the editing velocity field is obtained.
7. The method according to claim 1, characterized in that, Denoising is performed on the model generation process using edited and predicted velocity fields, specifically including: Within the preset time window, the edited velocity field is used to replace the predicted velocity field for noise reduction; After the preset time window ends, the predicted velocity field is used for denoising.
8. A multi-round image editing system based on high-frequency detail information injection, employing the multi-round image editing method based on high-frequency detail information injection according to any one of claims 1-7, characterized in that, The system includes: The reference image acquisition module is used to acquire the image after the previous round of editing in the multi-round instruction image editing model as the reference image and to acquire the noise image at the current moment; The reference flow construction module is used to construct a reference velocity field based on the model's generation sequence within a preset time window during the current round of image editing, according to the reference image and the noise image at the current moment. The frequency domain decomposition module is used to perform Laplacian pyramid decomposition on the predicted velocity field and the reference velocity field respectively to obtain their respective multi-scale Laplacian pyramid representations, wherein the multi-scale Laplacian pyramid representation includes a high-frequency detail layer and a low-frequency base layer. The fusion module is used to perform weighted fusion of each high-frequency detail layer of the reference velocity field with the corresponding high-frequency detail layer of the predicted velocity field, and use the low-frequency base layer of the predicted velocity field as the fused low-frequency base layer to construct the fused Laplace pyramid. The reconstruction and generation module is used to reverse reconstruct the fused Laplacian pyramid to obtain the edit velocity field, and to use the edit velocity field and the predicted velocity field to denoise the model generation process and generate the image after the current round of editing.
9. The system according to claim 8, characterized in that, The reference flow construction module constructs the reference velocity field using linear interpolation, specifically as follows: in, For reference velocity field, The reference image, This is the noise image at the current time t. This is the final time step.
10. The system according to claim 8, characterized in that, When the fusion module performs weighted fusion, for each high-frequency detail layer, it performs weighted summation according to a preset guiding coefficient. The fusion formula is as follows: in, This is the fused i-th high-frequency detail layer; and The predicted velocity field and the reference velocity field are respectively in the th... High-frequency components of the layer; is the preset guiding coefficient for the i-th layer, where i ranges from 0 to N-2, and N is the total number of layers in the pyramid decomposition.