Image style transfer method based on distribution calibration competitive attention mechanism

CN122243725APending Publication Date: 2026-06-19CHONGQING NORMAL UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
CHONGQING NORMAL UNIVERSITY
Filing Date
2026-03-19
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing diffusion-based image style transfer methods ignore the inherent statistical distribution differences between the initial content latent variables and style latent variables in the latent space, resulting in global tone shift, local texture sparsity, and contrast imbalance in the generated images. They lack spatial awareness and feature competition mechanisms, are prone to polarization, and lack effective posterior feature correction methods, causing the generated stylized images to lose the key contours and visual recognizability of the original content.

Method used

By employing a competitive attention mechanism based on distribution calibration, latent variables of content and style images are obtained. High-order distribution alignment of quantiles and low-order statistical matching of mean and variance are performed. Semantic anchors are constructed by combining a visual language model and text prompts. Semantic guided loss is iteratively updated to achieve adaptive scaling and posterior statistical shaping of style features, generating attention aggregation features. Finally, the stylized image is reconstructed through a decoder.

🎯Benefits of technology

It effectively eliminates the statistical mismatch between content features and style features, achieves natural and coherent image style transfer, improves the accuracy of fine-grained brushstroke expression and the main semantic consistency and visual recognizability of the transfer results, alleviates the excessive stacking or blurring degradation of local textures, and suppresses the semantic drift phenomenon caused by denoising iteration.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122243725A_ABST
    Figure CN122243725A_ABST
Patent Text Reader

Abstract

This invention relates to the fields of artificial intelligence and computer vision, and particularly to an image style transfer method based on a competitive attention mechanism using distribution calibration. The method includes: S1, acquiring a content image, a style image, and a mask for a specified region; extracting initial content latent variables and initial style latent variables of the content image and the style image at preset time steps through an encoder and a diffusion model inverse process, respectively; simultaneously extracting and caching attention key-value features of the style image at each time step of the inverse process, and constructing a style key-value feature set. This invention effectively eliminates the statistical mismatch between content features and style features in the initial stage by sequentially performing high-order distribution alignment based on quantiles and low-order statistical matching based on mean and variance on the initial content latent variables and initial style latent variables. This constructs a consistent feature basis for the reverse denoising process, thereby achieving natural and coherent image style transfer while preserving the original image structure.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of artificial intelligence and computer vision, and in particular to an image style transfer method based on a competitive attention mechanism using distributed calibration. Background Technology

[0002] In recent years, with the development of deep learning and generative artificial intelligence, image generation technology based on pre-trained latent space diffusion models (such as StableDiffusion) has achieved significant breakthroughs. In the field of image style transfer, existing mainstream training-free methods typically utilize the deterministic inverse process of the diffusion model to invert the content image and style image to the initial noisy state of the latent space. In subsequent inverse denoising iterations, the target artistic style is transferred to the content image by manipulating the feature representations within the diffusion model network (e.g., replacing or injecting key-value features of the style image into the self-attention layer). These methods do not require consuming large amounts of computational resources to readjust the massive model weights, and can endow the original image with specific color systems, brushstrokes, and texture features in scenarios such as digital art creation and image editing, exhibiting extremely high application flexibility.

[0003] However, existing diffusion-based image style transfer methods still have several significant shortcomings in practical applications. First, most existing methods ignore the inherent statistical distribution differences between the initial content latent variables and style latent variables in the latent space. This mismatch in feature starting points easily leads to global tone shift in the generated image, which accumulates and is amplified during reverse denoising, resulting in local texture sparsity and contrast imbalance. Second, in the feature fusion stage, existing methods often use simple replacement at the global scale or constant intensity attention injection, lacking spatial awareness and feature competition mechanisms. This makes the model prone to polarization in complex scenes: either insufficient style injection leads to over-smoothing and detail collapse, or excessive style injection causes severe perturbation of the content structure and stacking of block artifacts. Finally, in the long, multi-step iterative denoising process, existing models lack effective posterior feature correction and semantic constraints, which easily lead to uncontrolled semantic drift during feature modification, ultimately causing the generated stylized image to lose the key contours and visual recognizability of the original content. Summary of the Invention

[0004] To overcome the above shortcomings, this invention provides a competitive attention mechanism image style transfer method based on distribution calibration, which aims to improve the problem in the prior art that ignores the inherent statistical distribution differences between the initial content latent variables and style latent variables in the latent space.

[0005] This invention provides the following technical solution: an image style transfer method based on a competitive attention mechanism using distributed calibration, comprising: S1. Obtain the content image, style image, and mask for a specified region. Extract the initial content latent variables and initial style latent variables of the content image and the style image at a preset time step through the inverse process of encoder and diffusion model, respectively. At the same time, extract and cache the attention key value features of the style image at each time step of the inverse process to construct a style key value feature set. S2. Perform high-order distribution alignment based on quantiles and low-order statistical matching based on mean and variance on the initial content latent variables and the initial style latent variables in sequence to generate initial fusion latent variables; S3. Using the initial fusion latent variable as the initial input for reverse denoising, in the predefined self-attention layer set of each denoising time step, query features and current key features are generated according to the current latent variable features. Combined with the style key value features of the corresponding time step retrieved from the style key value feature set, joint normalization calculation is performed in the shared space using the mask as a weight condition for the mixing of space and time step. After query-by-query adaptive scaling and posterior statistical shaping, attention aggregation features are generated. S4. Based on the attention aggregation features, calculate the current preliminary denoising latent variable, extract the image embedding of the current preliminary denoising latent variable through the visual language model, and calculate the semantic guidance loss by combining the semantic anchors constructed by the text prompts. Iteratively update the preliminary denoising latent variable along the loss gradient to obtain the semantic correction latent variable as the input of the next time step. S5. Repeat steps S3 and S4 until the preset denoising time step is completed. Input the latent variables after denoising into the decoder to reconstruct the image and obtain a stylized image.

[0006] Preferably, in S1, constructing the style key-value feature set specifically includes the following steps: The content image and style image are respectively input into the encoder and mapped to latent space features, wherein the encoder is a pre-trained variational autoencoder; Perform the inverse diffusion model process on the latent space features to obtain initial content latent variables and initial style latent variables at a preset time step, wherein the inverse diffusion model process is a one-way denoising diffusion implicit model inverse process; In each discrete time step of the inverse process, attention key-value features are extracted from the predefined set of self-attention layers of the diffusion model and cached to form a style key-value feature set composed of time step indices.

[0007] Preferably, in step S2, the generation of the initial fusion latent variables specifically includes the following steps: The initial content latent variables and the initial style latent variables are flattened in the spatial dimension, and the quantile correspondence between the initial content latent variables and the initial style latent variables is established based on the rank structure of the flattened features. A linear interpolation update is performed between the marginal distributions of the initial content latent variables and the initial style latent variables using a first intensity coefficient; Calculate the mean and standard deviation of the linearly interpolated updated features in the spatial dimension, and perform scaling and translation on the linearly interpolated updated features based on the mean and standard deviation of the initial style latent variables; The second intensity coefficient is used to perform quantile approximation on the scaled and translated features again, and the initial fusion latent variables are output.

[0008] Preferably, in S3, the generation of attention aggregation features specifically includes the following steps: The content query features corresponding to the current time step are obtained based on the independently executed reverse denoising content branch. The content query features and query features are linearly fused according to a preset fusion coefficient to obtain the fused query features. The first unnormalized attention score and the second unnormalized attention score are calculated based on the fused query features and style key features, and the fused query features and the current key features, respectively. When calculating the first unnormalized attention score, a preset style temperature adjustment coefficient is used to perform numerical scaling on the first unnormalized attention score. A continuous gating is constructed based on the statistical difference between the logarithmic summation exponents of the first unnormalized attention score and the second unnormalized attention score, and then a query-by-query adaptive scaling coefficient is generated. Using the query-by-query adaptive scaling factor, position adaptive scaling is performed only on the first unnormalized attention score; Top-K sparsity constraints are applied to the scaled first unnormalized attention score, and it is concatenated with the second unnormalized attention score along the key dimension. Softmax joint normalization calculation is performed in the same probability space to obtain the joint attention distribution and preliminary aggregated output. Local statistical anchor points are generated based on the joint attention distribution, and affine offsets are constructed by combining the standardized fusion query features. Gated residual offset injection is then performed on the preliminary aggregated output to generate attention aggregation features.

[0009] Preferably, in S3, the joint normalization calculation combining the mask as a weighting condition for spatial and temporal step mixing specifically includes the following steps: The mask is subjected to expansion and smoothing operations based on the preset expansion radius and smoothing kernel size; Based on the mask after the dilation and smoothing operations, a masking process is performed on the probability space calculated by joint normalization.

[0010] Preferably, in S4, obtaining the semantic correction latent variable as the input for the next time step specifically includes the following steps: Based on the attention aggregation feature, conditional noise prediction and unconditional noise prediction for the current time step are obtained. Noise estimation is synthesized through a classifier-free guidance mechanism to obtain preliminary denoised latent variables. The initial denoising latent variables are decoded and input into the image encoder contained in the visual language model to obtain the corresponding normalized image embedding. Using the text encoder included in the visual language model, the preset content prompts and style prompts in the text prompts are encoded to form a set of positive semantic anchors, and the preset negative prompts in the text prompts are encoded to form a set of negative semantic anchors. The inner product similarity between the normalized image embedding and each semantic anchor in the set of positive semantic anchors and the set of negative semantic anchors is calculated to construct a semantic guidance loss. The initial denoising latent variable is updated in a fixed number of steps along the negative gradient direction of the semantic guidance loss, and the semantic correction latent variable is output.

[0011] Preferably, in S5, the process of repeatedly executing steps S3 and S4 until the preset denoising time step is completed, and then inputting the latent variables after denoising into the decoder for image reconstruction, specifically includes the following steps: According to the reverse denoising discrete time step sequence consisting of preset denoising time steps, the semantic correction latent variable output at the current time step is used as the input of the next time step, and the process is iterated until the time step is zero, and the latent variable after denoising is output. The latent variables after denoising are input into the decoder; The decoder maps latent variables back to pixel space, outputting a stylized image.

[0012] The present invention has the following beneficial effects: 1. In this invention, by sequentially performing high-order distribution alignment based on quantiles and low-order statistical matching based on mean and variance on the initial content latent variables and initial style latent variables, the statistical mismatch between content features and style features in the initial stage is effectively eliminated, and a consistent feature base is constructed for the reverse denoising process, thereby achieving natural and coherent image style transfer while preserving the original image structure.

[0013] 2. In this invention, a competitive mechanism is introduced into the self-attention layer. Style key features and content features are used for joint normalization calculation. Combined with query-by-query adaptive scaling and posterior statistical shaping operations, the positional adaptive fusion of the two within the target area is achieved, which effectively alleviates the excessive stacking or blurring degradation of local textures and improves the accuracy of fine-grained brush stroke expression.

[0014] 3. In this invention, the latent variables are extracted using a visual language model and image embeddings are used. Semantic guidance loss is calculated by combining semantic anchors constructed from text prompts. The latent variables are iteratively updated and corrected along the negative gradient direction of the loss, providing effective posterior semantic constraints for the generation process. This significantly suppresses the semantic drift phenomenon caused by the denoising iteration and ensures the main semantic consistency and visual recognizability of the transfer results. Attached Figure Description

[0015] Figure 1 A schematic diagram illustrating the overall process of the image style transfer method based on a competitive attention mechanism with distributed calibration, as provided in an embodiment of the present invention. Figure 2 A schematic diagram of the network processing mechanism for competitive attention fusion and semantic correction provided in an embodiment of the present invention; Figure 3 Qualitative comparison results of various image style transfer methods provided in embodiments of the present invention under different artistic styles; Figure 4 Qualitative comparison results of ablation experiments for the core processing mechanism provided in this embodiment of the invention; Figure 5 A qualitative comparison result diagram of the ablation of internal computational terms in the competitive attention mechanism provided in an embodiment of the present invention; Figure 6 Qualitative comparison results of ablation of key hyperparameters of the competitive attention mechanism provided in this embodiment of the invention; Figure 7 This is a qualitative comparison result diagram of the ablation of key hyperparameters of the semantic guidance correction mechanism provided in the embodiments of the present invention; Figure 8 This is a graph showing the comparison results of quantitative indicators among different image style transfer methods provided in the embodiments of the present invention; Figure 9 A comparison chart of quantitative indicators of the ablation experiment of the core processing mechanism provided in the embodiments of the present invention; Figure 10 This is a comparison chart of quantitative indicators of ablation of parameters in the distributed calibration mechanism provided in this embodiment of the invention; Figure 11 The figure shows a comparison of the composition of the competitive attention mechanism and the quantitative indicators of hyperparameter ablation provided in the embodiments of the present invention. Detailed Implementation

[0016] The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the present invention.

[0017] This invention provides an image style transfer method based on a competitive attention mechanism using distributed calibration, such as... Figures 1-11 As shown, it includes the following steps: S1. Obtain the content image, style image, and mask for the specified region. Extract the initial content latent variables and initial style latent variables of the content image and style image at the preset time step through the inverse process of encoder and diffusion model respectively. At the same time, extract and cache the attention key value features of the style image at each time step of the inverse process to construct a style key value feature set. Furthermore, in S1, constructing the style key-value feature set specifically includes the following steps: The content image and style image are respectively input into the encoder and mapped to the latent space features, where the encoder is the encoder of a pre-trained variational autoencoder; Perform the inverse process of the diffusion model on the latent space features to obtain the initial content latent variables and the initial style latent variables at a preset time step. The inverse process of the diffusion model is the inverse process of the unidirectional denoising diffusion implicit model. In each discrete time step of the inverse process, attention key-value features are extracted from the predefined set of self-attention layers of the diffusion model and cached to form a style key-value feature set composed of time step indices.

[0018] Specifically, firstly, the content image, style image, and a mask for a specified region are acquired. The mask is obtained through user-interactive drawing combined with a semantic segmentation algorithm. After acquiring the basic data, the content image and style image are input into the encoder of a pre-trained variational autoencoder, mapping the high-dimensional image data in pixel space to a low-dimensional latent space to obtain the corresponding latent space features. The input content image is set as... The style image is The pre-trained encoder is After the encoding and mapping operation, the basic latent variables of the content image in the initial state are obtained. and the basic latent variables of the style image in the initial state. The mask will serve as a spatial constraint in subsequent calculations, defining the style features injected into the region boundaries. After obtaining the latent space features, a one-way denoising diffusion implicit model inverse process is performed on the basis latent variables. This inverse process utilizes the deterministic ordinary differential equation trajectory of the pre-trained diffusion model, guided by unconditional features, to progressively add noise to the latent variables from time step zero and backtrack to the preset maximum time step. Latent variables from the current time step One reverse time step down The derivation formula is expressed as: ; In the derivation of the formula, and Representing time steps With time step The corresponding latent variable state, and These represent the preset noise scheduling accumulation parameters, The weight parameters of the denoising network are represented. This represents the noise tensor predicted by the denoising network at a given time step and with the current latent variable. The formula is iteratively derived, and the noise tensor at the preset maximum time step is calculated. initial content latent variables With initial style latent variables When performing the inverse process on the latent space features of the style image separately, it is necessary to simultaneously extract and retain the feature representations within the denoising network at each discrete time step. This is done at each discrete time step of the inverse derivation. From a predefined set of self-attention layers in the diffusion model, attention key features generated from the style image at the current time step are extracted. With attention value features The attention key-value features extracted from all time steps are cached to form a style key-value feature set composed of time step indices. This step accurately extracts the latent space structure basis of the image and constructs a style key-value feature set containing fine-grained feature tensors, providing reliable underlying data for feature alignment and attention calculation.

[0019] S2. Perform high-order distribution alignment based on quantiles and low-order statistical matching based on mean and variance on the initial content latent variables and initial style latent variables in sequence to generate initial fusion latent variables; Furthermore, in S2, generating the initial fusion latent variables specifically includes the following steps: The initial content latent variables and initial style latent variables are flattened in the spatial dimension, and the quantile correspondence between the initial content latent variables and the initial style latent variables is established based on the rank structure of the flattened features. A linear interpolation update is performed between the marginal distributions of the initial content latent variables and the initial style latent variables using the first intensity coefficient; Calculate the mean and standard deviation of the features updated by linear interpolation in the spatial dimension, and perform scaling and translation on the features updated by linear interpolation based on the mean and standard deviation of the initial style latent variables; The second intensity coefficient is used to perform quantile approximation on the scaled and translated features again, and the initial fusion latent variables are output.

[0020] Specifically, obtain the preset maximum time step generated by the aforementioned steps. initial content latent variables With initial style latent variables Subsequently, the initial content latent variables and initial style latent variables are sequentially subjected to high-order distribution alignment based on quantiles and low-order statistical matching based on mean and variance to generate initial fusion latent variables. First, the initial content latent variables and initial style latent variables are flattened in the spatial dimension. Based on the rank structure of the flattened features, the quantile correspondence between the initial content latent variables and the initial style latent variables is established. The flattened initial content latent variables are set as follows: The initial style latent variables after flattening are The sorting operation function is The ascending order sorting function is: The formula for calculating the relative position rank of the initial content latent variables is expressed as: ; In the relative position rank formula This represents the position index of a tensor element within its own distribution. The mapping formula for finding the corresponding feature values ​​in the marginal distribution of the initial style latent variables based on these position indices and constructing the quantile correspondence is expressed as: ; In the mapping formula, This represents a high-order alignment feature that perfectly matches the style on the edge distribution. After obtaining the higher-order alignment features, the first intensity coefficient is used. A linear interpolation update is performed between the marginal distributions of the initial content latent variables and the initial style latent variables. The first intensity coefficient is set to... The linear interpolation update calculation formula is expressed as: ; In the interpolation formula, This represents the features updated via linear interpolation. After linear interpolation, low-order statistical matching based on the mean and variance is performed. The mean of the features updated via linear interpolation is calculated in the spatial dimension. with standard deviation Simultaneously, the mean of the initial style latent variables in the spatial dimension is calculated. with standard deviation Based on the mean and standard deviation of the initial style latent variables, the features updated by linear interpolation are scaled and translated, with a numerical stability minimum constant set to prevent division by zero. The scaling and translation calculation formulas are expressed as follows: ; In the scaling and translation formula, This represents the features after scaling and translation. A second intensity coefficient is used. Quantile approximation is performed again on the scaled and translated features. Secondary quantile-matching features are then obtained based on the rank structure of the current features and the initial style latent variables. The second strength coefficient is set as... The formula for calculating the quadratic approximation is expressed as: ; In the quadratic approximation formula, This step initializes the fusion latent variables for the final output. It achieves precise bidirectional calibration of the high- and low-dimensional distributions of content features towards style features.

[0021] S3. Using the initial fusion latent variable as the initial input for inverse denoising, in the predefined self-attention layer set of each denoising time step, query features and current key features are generated based on the current latent variable features. Combined with the style key value features of the corresponding time step retrieved from the style key value feature set, joint normalization calculation is performed in the shared space using the mask as a weight condition for the mixing of space and time step. After query-by-query adaptive scaling and posterior statistical shaping, attention aggregation features are generated. Furthermore, in S3, generating attention aggregation features specifically includes the following steps: The content query features corresponding to the current time step are obtained by the independently executed reverse denoising content branch. The content query features and query features are linearly fused according to the preset fusion coefficient to obtain the fused query features. The first unnormalized attention score and the second unnormalized attention score are calculated based on the fused query features and style key features, and the fused query features and the current key features, respectively. When calculating the first unnormalized attention score, a preset style temperature adjustment coefficient is used to perform numerical scaling on the first unnormalized attention score. A continuous gating is constructed based on the statistical difference between the logarithmic summation exponents of the first unnormalized attention score and the second unnormalized attention score, and then a query-by-query adaptive scaling coefficient is generated. Using a query-by-query adaptive scaling factor, positional adaptive scaling is performed only on the first unnormalized attention score; Top-K sparsity constraints are applied to the scaled first unnormalized attention score, and it is concatenated with the second unnormalized attention score along the key dimension. Softmax joint normalization calculation is performed in the same probability space to obtain the joint attention distribution and preliminary aggregated output. Local statistical anchors are generated based on the joint attention distribution calculation, and affine offsets are constructed by combining the standardized fusion query features. Gated residual offset injection is performed on the initial aggregated output to generate attention aggregation features.

[0022] Furthermore, in S3, the joint normalization calculation, which combines the mask as a weight condition for the spatial and temporal steps, specifically includes the following steps: The mask is subjected to expansion and smoothing operations based on the preset expansion radius and smoothing kernel size; Based on the mask after dilation and smoothing operations, a masking process is performed on the probability space of the joint normalization calculation.

[0023] Specifically, the initialized fusion latent variables generated in the aforementioned steps are obtained and used as the initial input tensor for inverse denoising. Within the predefined set of self-attention layers at each denoising time step, query features and current key features are generated based on the current latent variable features. These are combined with the style key features retrieved from the style key feature set for the corresponding time step. Joint normalization calculation is performed in the shared space, using the mask as a weighting condition for spatial and temporal mixing. After query-by-query adaptive scaling and posterior statistical shaping, attention aggregation features are generated. The joint normalization calculation with the mask includes performing dilation and smoothing operations on the mask according to a preset dilation radius and smoothing kernel size, and performing mask occlusion processing on the probability space of the joint normalization calculation based on the dilated and smoothed mask. The current latent variable feature of the current denoising time step is set as... The linear projection weight matrices are respectively , , The current latent variable features are mapped using a linear projection weight matrix to generate query features. Current key features Features of the current value The content query features corresponding to the current time step are obtained based on the independently executed reverse denoising content branch. A preset fusion coefficient is set as follows. Content query characteristics are The content query features and query features are linearly fused according to a preset fusion coefficient, thus fusing the query features. The calculation formula is: ; Based on fusion query features and style key features obtained from retrieval 1. Integrate query features with current key features Calculate the corresponding first and second unnormalized attention scores respectively. Perform numerical scaling on the first unnormalized attention score using a preset style temperature adjustment coefficient and the square root of the key dimension. Set the style temperature adjustment coefficient to... The feature dimension is The matrix transpose operation is The first unnormalized attention score after numerical scaling The calculation formula is expressed as follows: ; Set the second unnormalized attention score as A continuous gating system is constructed based on the statistical difference between the logarithmic summation exponents of the first and second unnormalized attention scores. The total number of attention heads is set to [value missing]. The current attention head index is The logarithmic summation exponential function is: Statistical difference of multiple averages The calculation formula is expressed as follows: ; Query-by-query adaptive scaling factors are generated based on multi-head average statistical difference mapping. The maximum scaling factor limit is set to... The nonlinear activation mapping function is g, and the query-by-query adaptive scaling factor is... The calculation formula is expressed as follows: ; Using a query-by-query adaptive scaling factor, position-adaptive scaling is performed only on the first unnormalized attention score, and the scaled output is... A sparse constraint of maximum preservation is applied to the scaled output, which is then concatenated with the second unnormalized attention score along the key dimension. Joint normalization is then performed within the same probability space, strictly combining this with the aforementioned masking process, to obtain the joint attention distribution. and preliminary aggregation output Local statistical anchor points are generated based on the joint attention distribution. The element-wise multiplication operation is defined as... The numerical stability constant is Local mean Local second moment Local standard deviation The derived formula is expressed as follows: ; Standardize the fused query features along the query dimension to obtain standardized query features. Combined with local mean Local standard deviation and standardized query features Constructing affine offsets The calculation formula is expressed as: ; A gated residual offset injection is performed on the initial aggregated output using affine offsets. Preset learnable injection strength control parameters are set as follows. The hyperbolic tangent activation function is The final attention aggregation feature The calculation formula is expressed as follows: ; This step achieved a high-quality competitive integration of style and content characteristics within the target area.

[0024] S4. Based on the attention aggregation feature, the current preliminary denoising latent variable is calculated. The image embedding of the current preliminary denoising latent variable is extracted through the visual language model. The semantic guidance loss is calculated by combining the semantic anchors constructed by the text prompts. The preliminary denoising latent variable is iteratively updated along the loss gradient to obtain the semantic correction latent variable as the input of the next time step. Furthermore, in S4, obtaining the semantic correction latent variables as input to the next time step specifically includes the following steps: Based on the attention aggregation feature, conditional noise prediction and unconditional noise prediction for the current time step are obtained. Noise estimation is synthesized through a classifier-free guidance mechanism to obtain preliminary denoised latent variables. The initial denoised latent variables are decoded and input into the image encoder contained in the visual language model to obtain the corresponding normalized image embedding. Using the text encoder contained in the visual language model, the preset content prompts and style prompts in the text prompts are encoded to form a set of positive semantic anchors, and the preset negative prompts in the text prompts are encoded to form a set of negative semantic anchors. The inner product similarity between the normalized image embedding and each semantic anchor in the positive and negative semantic anchor sets is calculated to construct the semantic guidance loss. The initial denoising latent variable is updated in a fixed number of steps along the negative gradient direction of the semantic guidance loss, and the semantic correction latent variable is output.

[0025] Specifically, the attention aggregation features generated in the preceding steps are obtained. Based on these attention aggregation features, the conditional noise prediction for the current time step is obtained. With unconditional noise prediction Noise estimation is synthesized using a classifier-free guided mechanism. The guided scaling coefficient is set to... Synthetic noise estimation The calculation formula is expressed as follows: ; Synthetic noise estimation is used to eliminate prediction noise components, and the preliminary denoising latent variables for the current time step are calculated. The initial denoised latent variables are decoded and input into the image encoder included in the visual language model to obtain the corresponding normalized image embedding. The decoding operation function is set to Dec, and the image encoder of the visual language model is... The second norm of a vector is Extract and perform norm normalization on the normalized image embedding. The calculation formula is expressed as follows: ; By utilizing the text encoder included in the visual language model, the pre-defined content cues and style cues in the text prompts are encoded to form a set of positive semantic anchors. The preset negative prompts in the text prompts are encoded to form a set of negative semantic anchors. The inner product similarity between the normalized image embedding and each semantic anchor in the positive and negative semantic anchor sets is calculated to construct the semantic guidance loss. The semantic guidance loss is set to... The formula for constructing inner product similarity and loss is expressed as: ; In the loss construction formula, This represents the vector dot product operation. This represents the semantic anchors in the set of positive semantic anchors. This represents the semantic anchors in the set of negative semantic anchors. A fixed number of updates are performed on the initial denoising latent variables along the negative gradient direction of the semantically guided loss. The preset fixed number of update steps is set to... Update step size is Preliminary denoising of latent variables As the initial state The formula for a single iteration along the negative gradient direction is expressed as: ; In the updated formula, This indicates the current iteration step number, with a value ranging from 1 to 1. , Indicates the latent variable state of the previous iteration step. This represents the gradient vector of the semantically guided loss relative to the features of the previous iteration. The update formula is executed iteratively until a fixed number of update steps are completed. latent variables of the output These serve as latent variables for semantic correction, which are then input to the next time step. This step significantly corrects semantic biases that occurred during the generation process.

[0026] S5. Repeat steps S3 and S4 until the preset denoising time step is completed. Input the latent variables after denoising into the decoder to reconstruct the image and obtain a stylized image.

[0027] Furthermore, in S5, steps S3 and S4 are executed repeatedly until the preset denoising time step is completed. The latent variables after denoising are input into the decoder for image reconstruction, specifically including the following steps: According to the reverse denoising discrete time step sequence consisting of preset denoising time steps, the semantic correction latent variable output at the current time step is used as the input of the next time step, and the process is iterated until the time step is zero, and the latent variable after denoising is output. The latent variables after denoising are input into the decoder; The latent variables are mapped back to pixel space by the decoder, and the stylized image is output.

[0028] Specifically, the semantic correction latent variables output from the previous steps are obtained. Steps S3 and S4 are executed iteratively until the preset denoising time step is completed. The latent variables after denoising are input into the decoder for image reconstruction to obtain a stylized image. Following the preset inverse denoising discrete time step sequence, the semantic correction latent variables output from the current time step are used as the input for the next time step. The starting maximum time step of the preset inverse denoising discrete time step sequence is set to... The time step index is In each inverse denoising loop, the current time step is... The output semantic correction latent variables are used as time steps The input tensor is used to iteratively perform the aforementioned feature aggregation and error correction calculations until the time step index. Decrement to zero, end the reverse denoising loop, and output the latent variable after denoising. Set the latent variable after denoising to... The latent variables after denoising are input into the decoder. The decoder maps the latent variables back to pixel space and outputs a stylized image. The preset decoder is set to... The formula for mapping back to pixel space is expressed as: ; In the formula for mapping back to pixel space This represents the output stylized image. This step closes the inverse denoising loop through explicit temporal index decrementing logic, outputting a high-quality stylized reconstructed image.

[0029] experiment: Experimental details: The implementation is based on a pre-trained latent space diffusion model, with the backbone network using SD-1.5. All experiments were performed on a single NVIDIA RTX 5880 GPU. During the inference phase, the DDIM inverse process is used to map the content and style images to the latent variable trajectories at time step T, with a fixed inversion step count of 50. The reverse denoising and inversion phases share the same sampler and step count configuration. A hierarchical distribution calibration mechanism is used, with weights set to... =0.35, =0.25. The hyperparameters of the competitive attention mechanism are set as follows: query fusion coefficient. =0.8, style branch temperature =1.1, upper bound of style logit =2.5, Top-k=128, statistical modulation injection strength η=0.2, style buffer scaling =1.2. In local style transfer, the mask expansion radius is set to... =1, smoothing kernel size set to =3.

[0030] Dataset: The style data is jointly constructed from datasets from WikiArt and ArtBank, covering eight art movements: Classicism (34 images), Romanticism (167 images), Cubism (15 images), Renaissance (112 images), Realism (77 images), Symbolism (314 images), Impressionism (253 images), and traditional Chinese painting (127 images). The content image dataset is constructed from 200 samples selected from the COCO dataset, covering typical semantic scenes such as landscapes, robots, people, animals, cars, streets, and architecture. Data and evaluation protocols will be provided in an anonymous repository.

[0031] Please refer to Figure 8 , Figure 8 For quantitative comparisons between different models, ↓ indicates that lower values ​​are better, and ↑ indicates that higher values ​​are better. The best result is highlighted in red, and the second best result is underlined in red. The resolution label 256 represents 256×256 pixels, and 768 represents 768×768 pixels.

[0032] Quantitative comparison: Figure 8 This indicates that our method exhibits more stable overall performance across metrics related to generation quality, content structure preservation, and style consistency. In terms of overall quality, the results from ArtFID and FID show that the generated distribution is closer to the distribution of real artistic images, while maintaining better global visual consistency. Compared to some diffusion-based style injection methods, it is less prone to tone drift or texture noise accumulation. Regarding structure preservation, the reduction in LPIPS indicates that the results are more perceptually close to the content input, suggesting a reduction in structural perturbation. Some strategies with stronger style constraints or more aggressive injection tend to cause more obvious deformation in the main outline and key areas. In terms of style consistency, StyleLoss and CFSD are both leading or in the top tier, indicating that this method can not only align with global style statistics but also more stably preserve fine-grained brushstrokes and local texture organization, thus maintaining a more consistent style expression during cross-category content transfer. Furthermore, the relative advantage of AestheticsScore also indicates that the generated results are more consistent with perceptual evaluation in terms of overall visual appeal and aesthetic consistency. It is worth noting that the advantages of this method remain consistent across multiple resolution settings. Even at high resolution, it can still generate clearer texture details and more coherent brushstroke structures, demonstrating its stability in style expression and structural fidelity across different resolutions.

[0033] Please refer to Figure 3 , Figure 3This is a qualitative comparison. Under eight representative art styles, the generation effects of the method presented in this paper are compared with those of several mainstream style transfer methods.

[0034] Qualitative comparison: Figure 3 The results show visual comparisons under eight typical art styles. Overall, the errors of each method fall into three categories. The first category is insufficient style expression or deviation from the reference style. AesFA, ArtFlow, and CAPVSTNet tend to produce weak stylistic results across multiple styles. In Impressionism and Romanticism, brushstroke direction and color block rhythm are not clear enough. In Cubism, geometric cutting and block composition struggle to form a consistent stylistic grammar. In ChinesePainting, the stability of ink wash rendering, white space control, and layer progression is insufficient. The second category is texture collapse and detail loss caused by excessive smoothing. CAST and StyleID are more prone to edge diffusion and texture smoothing in Classicism's line drawing boundaries and light and shadow levels, and in Renaissance's local material depiction, simultaneously damaging the main outline, local structure, and brushstroke hierarchy. The third category is structural disturbance and artifact accumulation caused by excessive style injection. DiffuseIT is more prone to high-saturation texture coverage and local abnormal enhancement in Symbolism and Realism results, with more obvious distortion of details in facial or main areas, and an increased risk of decreased semantic consistency. MambaST and S2WAT are more likely to exhibit localized texture instability, blocky noise, or inconsistent pattern stacking, with insufficient coupling between style textures and content structure. StyleShot and StyleTR 2The overall visual appeal is closer to usable results, but fluctuations remain in stylistic accuracy and local consistency. In styles such as Symbolism and Cubism, color system shifts or insufficient convergence of block organization are observable. Compared to the methods mentioned above, the proposed method exhibits more consistent migration behavior across different styles. In Impressionism, brushstrokes and color block rhythms are more coherent; in Classicism and Renaissance, structural boundaries and detail levels are more clearly preserved; in Cubism, blocky and geometric compositions are easier to establish; and in Chinese Painting, the ink density, white space relationships, and perspective are more harmonious. Key outlines of different content categories remain more complete; the morphological stability of architectural skylines, facial features, bird feather edges, and mountain texture layers is higher; stylistic texture injection has less interference with the main semantics; and a more reasonable balance is achieved between stylistic expression and structural fidelity. This set of phenomena and methodological mechanisms form a closed loop. Distribution calibration in the initialization phase reduces statistical mismatch, competitive attention allocation and position-level gating in the denoising phase improve spatial consistency and enhance texture representation, semantic posterior regularization suppresses semantic drift in the iterative process, and subsequent ablation can verify the contribution of the above steps to visual differences.

[0035] Ablation experiment: In ablation experiments, "w / o" indicates the removal of the corresponding mechanism, and "- / -" indicates the basic model setting without introducing any additional mechanisms.

[0036] Please refer to Figure 4 , Figure 4 For qualitative comparison of ablation. Compare the generated results of the complete model with those of the undistributed calibration mechanism (HSDC), the uncompetitive attention mechanism (APEM), the unsemantic correction mechanism (SERM), and the basic settings (- / -).

[0037] Figure 9 Comparison of quantitative ablation results. ArtFID, FID, LPIPS, and CFSD under different mechanism configurations; red values ​​indicate the optimal result for this indicator.

[0038] Research on distributed calibration mechanism Figure 9 and Figure 4 Together, they verified the necessity of the distributed calibration mechanism in diffusion-based style transfer. After removing the distributed calibration mechanism, most indicators showed a degradation in consistency, and the generated results were more likely to expose phenomena such as global tone instability, insufficient texture density, and loose local contrast relationships. Figure 4 In the comparison, it can be observed that the background color gamut and light and dark tones are more difficult to converge, some areas appear grayish and the layers are thinned, the coverage of stylistic brushstrokes relies more on the compensation of subsequent mechanisms, and the coupling relationship between structure and style tends to be fragile. When the sampling starting point lacks a stable statistical base, and The differences in edge distribution are directly incorporated into the reverse denoising iteration. Low-order offsets accumulate and are amplified in multiple updates, while high-order morphological differences manifest as scale drift in texture response and inconsistencies in local brushstroke organization. The distribution calibration mechanism performs hierarchical statistical alignment during the initialization phase, first aligning higher-order distribution morphologies while maintaining the hierarchical structure of the content, and then locking the mean and variance to reduce the drift space of overall brightness and contrast, providing competitive conditions closer to the same statistical domain for subsequent attention injection.

[0039] Figure 10 For weight , The parameters of the ablation results. It dominates the alignment intensity of the early global distribution, affecting the convergence speed of color tone, contrast relationship, and cross-regional consistency. When the signal is weak, the domain bias remains more obvious, and subsequent noise reduction needs to be done by adding style injection on an unstable base. Tone drift and texture sparseness are more likely to occur. When the signal is too strong, the distributed calibration is closer to forced transmission, and the natural local contrast on the content side may be compressed, reducing the dominance of structural details. It is more sensitive to tail distribution and local statistical residuals, and plays a role in fine-grained compensation. When the value is too large, local statistical perturbations are more easily amplified, causing unevenness in local appearance and discontinuity in texture response. A gentler... Matching the middle It can achieve a more reasonable balance between color stability, structural fidelity, and texture fullness.

[0040] Figure 10 This is a comparison chart of ablation parameters for the distributed calibration mechanism. (Distributed calibration weights are also shown.) and Parameter tests were conducted to compare the overall performance under different intensity configurations.

[0041] Mechanistic breakdown of competitive attention mechanisms: Figure 11 Competitive attention mechanism composition and parameter ablation. w / oa removes position-by-position gating, w / ob removes statistical shaping; other rows are compared differently. , , , Performance under the default configuration. Our method uses the default configuration. =2.5, Estimated by gating statistical analysis, =128, =0.2.

[0042] Please refer to Figure 5 , Figure 5A comparative analysis of the competitive attention mechanism is presented. The complete competitive attention mechanism, along with w / oa and w / ob results, demonstrates the contributions of position-wise gating and statistical shaping to structural stability and local texture consistency.

[0043] Please refer to Figure 6 , Figure 6 A comparative analysis of ablation parameters for competitive attention mechanisms. In different... , , Qualitative results under the settings.

[0044] like Figure 5 As shown, we remove the two core components of the competitive attention mechanism one by one to separate their contributions to structural stability and texture expression. After removing the query-adaptive style logit amplification mechanism with OA, style injection degenerates into a constant global-scale enhancement, joint competition loses spatial selectivity, structurally sensitive areas are more easily disturbed by style energy, the consistency of the main outline and local geometry is significantly degraded, and fine-grained brushstrokes are difficult to land stably in the correct positions. When removing the output-side posterior statistical shaping with OB, the overall structure can still be maintained, but the consistency between local color placement and texture statistics is more difficult to achieve, brushstroke density and color block organization show spatial unevenness, style fusion is more like local attachment, and there is a lack of unified organization across regions. Figure 11 Quantitative results and Figure 5 The visual differences are consistent, verifying the complementary relationship between the two mechanisms in structural protection and texture refinement.

[0045] Figure 6 and Figure 11 This further defines the controllable boundaries of the competitive attention mechanism. The upper limit of the advantage of style branches in joint normalization competition is specified. When the value is too low, the style response is suppressed and high-frequency textures are difficult to establish. When the value is too high, the style energy is over-concentrated, and the risk of local texture stacking and artifact accumulation increases. The distribution pattern of position-by-position gating is determined; an excessively large value amplifies local response fluctuations, while an excessively small value weakens the structural protection effect brought about by positional selectivity. η controls the magnitude of statistical residual compensation; an excessively large value introduces additional perturbations, while an excessively small value makes it difficult to maintain consistent convergence of local color and texture. Top-K represents a trade-off between attention focus and information coverage; an excessively small value results in insufficient style token coverage, while an excessively large value leads to a dispersed competitive distribution and diluted style response. Each control term has a stable impact on style intensity and structural fidelity, providing a basis for subsequent interpretable adjustments between style expression and content stability.

[0046] Mechanism analysis of semantic correction mechanism: Figure 9The item-by-item removal results show that the semantic correction mechanism plays a crucial role in sampling stability, and semantic drift is significantly suppressed. When the semantic correction mechanism is missing, most quality and structure-related indicators show consistent degradation, and the frequency and magnitude of semantic shifts both increase. Figure 4 The visual comparison further confirms this trend. Under the w / o semantic correction mechanism setting, the main outline and internal details are more prone to slight deformation, and the local texture response exhibits a diffuse stacking lacking semantic constraints. The coupling relationship between content semantics and style texture tends to be unstable, and the risk of decreased detail recognizability increases. The introduction of the semantic correction mechanism essentially provides a posterior semantic constraint. By progressively correcting the semantic consistency of the generated results, it suppresses semantic shifts in step-by-step denoising, thereby maintaining structural fidelity and object recognizability while enhancing style expression.

[0047] Please refer to Figure 7 , Figure 7 A comparative analysis of the parameters of the semantic correction mechanism. Different Qualitative results under the settings.

[0048] Figure 7 Key hyperparameters of semantic correction mechanisms Sensitivity analysis revealed a clear balance between the strength of semantic constraints and the strength of style injection. Weak semantic correction is insufficient to offset the cumulative bias caused by style injection, making local areas more prone to uneven texture aggregation and unstable details; excessive semantic correction, on the other hand, increases reliance on semantic anchors, causing local brushstrokes to become homogenized and even introducing unnecessary overcorrection artifacts. A moderate level of semantic correction is appropriate. Under these settings, semantic consistency and style intensity achieve an optimal balance, resulting in more stable visual outcomes. Combining initial distribution calibration with a distribution calibration mechanism and an attention injection strategy with a competitive attention mechanism further enhances the stability of the generation process and the consistency of the results.

[0049] in conclusion: This paper proposes a training-free latent diffusion style transfer framework, STYLEVAULT, to simultaneously improve fine-grained style representation and structural fidelity in a reference image-driven setting. The method reduces statistical mismatch between content and style latent variables through hierarchical distribution calibration during the sampling initialization phase. In the denoising phase, it achieves position-adaptive style injection through competitive attention with a shared normalization space, and introduces CLIP-based posterior semantic regularization to constrain stride semantic drift. Multi-style, multi-resolution, and local transfer experiments show that the framework achieves consistent improvements in style consistency, structural preservation, and overall quality metrics, corroborating user research results and quantitative conclusions. Ablation experiments demonstrate that initial statistical domain alignment, position-adaptive injection, and semantic posterior correction each make key contributions to stable gain and complement each other when combined. The current inference process includes inversion and lightweight posterior refinement, incurring some additional computational overhead; future work will focus on further optimizing inference efficiency through more efficient inversion / sampling and adaptive scheduling.

[0050] Finally, it should be noted that the above description is only a preferred embodiment of the present invention and is not intended to limit the present invention. Although the present invention has been described in detail with reference to the foregoing embodiments, those skilled in the art can still modify the technical solutions described in the foregoing embodiments or make equivalent substitutions for some of the technical features. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the protection scope of the present invention.

Claims

1. A competitive attention mechanism image style transfer method based on distribution calibration, characterized in that, Includes the following steps: S1. Obtain the content image, style image, and mask for a specified region. Extract the initial content latent variables and initial style latent variables of the content image and the style image at a preset time step through the inverse process of encoder and diffusion model, respectively. At the same time, extract and cache the attention key value features of the style image at each time step of the inverse process to construct a style key value feature set. S2. Perform high-order distribution alignment based on quantiles and low-order statistical matching based on mean and variance on the initial content latent variables and the initial style latent variables in sequence to generate initial fusion latent variables; S3. Using the initial fusion latent variable as the initial input for reverse denoising, in the predefined self-attention layer set of each denoising time step, query features and current key features are generated according to the current latent variable features. Combined with the style key value features of the corresponding time step retrieved from the style key value feature set, joint normalization calculation is performed in the shared space using the mask as a weight condition for the mixing of space and time step. After query-by-query adaptive scaling and posterior statistical shaping, attention aggregation features are generated. S4. Based on the attention aggregation features, calculate the current preliminary denoising latent variable, extract the image embedding of the current preliminary denoising latent variable through the visual language model, and calculate the semantic guidance loss by combining the semantic anchors constructed by the text prompts. Iteratively update the preliminary denoising latent variable along the loss gradient to obtain the semantic correction latent variable as the input of the next time step. S5. Repeat steps S3 and S4 until the preset denoising time step is completed. Input the latent variables after denoising into the decoder to reconstruct the image and obtain a stylized image.

2. The distribution-based calibration competitive attention mechanism image style transfer method according to claim 1, characterized in that, In S1, the construction of the style key-value feature set specifically includes the following steps: The content image and style image are respectively input into the encoder and mapped to latent space features, wherein the encoder is a pre-trained variational autoencoder; Perform the inverse diffusion model process on the latent space features to obtain initial content latent variables and initial style latent variables at a preset time step, wherein the inverse diffusion model process is a one-way denoising diffusion implicit model inverse process; In each discrete time step of the inverse process, attention key-value features are extracted from the predefined set of self-attention layers of the diffusion model and cached to form a style key-value feature set composed of time step indices.

3. The distribution alignment based competitive attention mechanism image style transfer method according to claim 1, characterized in that, In S2, the generation of initial fusion latent variables specifically includes the following steps: The initial content latent variables and the initial style latent variables are flattened in the spatial dimension, and the quantile correspondence between the initial content latent variables and the initial style latent variables is established based on the rank structure of the flattened features. A linear interpolation update is performed between the marginal distributions of the initial content latent variables and the initial style latent variables using a first intensity coefficient; Calculate the mean and standard deviation of the linearly interpolated updated features in the spatial dimension, and perform scaling and translation on the linearly interpolated updated features based on the mean and standard deviation of the initial style latent variables; The second intensity coefficient is used to perform quantile approximation on the scaled and translated features again, and the initial fusion latent variables are output.

4. The distribution alignment based competitive attention mechanism image style transfer method according to claim 1, characterized in that, In S3, the generation of attention aggregation features specifically includes the following steps: The content query features corresponding to the current time step are obtained based on the independently executed reverse denoising content branch. The content query features and query features are linearly fused according to a preset fusion coefficient to obtain the fused query features. The first unnormalized attention score and the second unnormalized attention score are calculated based on the fused query features and style key features, and the fused query features and the current key features, respectively. When calculating the first unnormalized attention score, a preset style temperature adjustment coefficient is used to perform numerical scaling on the first unnormalized attention score. A continuous gating is constructed based on the statistical difference between the logarithmic summation exponents of the first unnormalized attention score and the second unnormalized attention score, and then a query-by-query adaptive scaling coefficient is generated. Using the query-by-query adaptive scaling factor, position adaptive scaling is performed only on the first unnormalized attention score; Top-K sparsity constraints are applied to the scaled first unnormalized attention score, and it is concatenated with the second unnormalized attention score along the key dimension. Softmax joint normalization calculation is performed in the same probability space to obtain the joint attention distribution and preliminary aggregated output. Local statistical anchor points are generated based on the joint attention distribution, and affine offsets are constructed by combining the standardized fusion query features. Gated residual offset injection is then performed on the preliminary aggregated output to generate attention aggregation features.

5. The distribution alignment based competitive attention mechanism image style transfer method according to claim 1, characterized in that, In S3, the joint normalization calculation, which combines the mask as a weighting condition for the spatial and temporal steps, specifically includes the following steps: The mask is subjected to expansion and smoothing operations based on the preset expansion radius and smoothing kernel size; Based on the mask after the dilation and smoothing operations, a masking process is performed on the probability space calculated by joint normalization.

6. The distribution alignment based competitive attention mechanism image style transfer method according to claim 1, characterized in that, In S4, obtaining the semantic correction latent variables as input to the next time step specifically includes the following steps: Based on the attention aggregation feature, conditional noise prediction and unconditional noise prediction for the current time step are obtained. Noise estimation is synthesized through a classifier-free guidance mechanism to obtain preliminary denoised latent variables. The initial denoising latent variables are decoded and input into the image encoder contained in the visual language model to obtain the corresponding normalized image embedding. Using the text encoder included in the visual language model, the preset content prompts and style prompts in the text prompts are encoded to form a set of positive semantic anchors, and the preset negative prompts in the text prompts are encoded to form a set of negative semantic anchors. The inner product similarity between the normalized image embedding and each semantic anchor in the set of positive semantic anchors and the set of negative semantic anchors is calculated to construct a semantic guidance loss. The initial denoising latent variable is updated in a fixed number of steps along the negative gradient direction of the semantic guidance loss, and the semantic correction latent variable is output.

7. The distribution alignment based competitive attention mechanism image style transfer method according to claim 1, characterized in that, In S5, the process of repeatedly executing steps S3 and S4 until the preset denoising time step is completed, and then inputting the latent variables after denoising into the decoder for image reconstruction, specifically includes the following steps: According to the reverse denoising discrete time step sequence consisting of preset denoising time steps, the semantic correction latent variable output at the current time step is used as the input of the next time step, and the process is iterated until the time step is zero, and the latent variable after denoising is output. The latent variables after denoising are input into the decoder; The decoder maps latent variables back to pixel space, outputting a stylized image.