Image style representation learning method and device based on text inversion and content semantic separation
By performing data augmentation and adapter network processing on the reference style image, the problem of content semantic leakage in style transfer is solved, the separation of content semantics in the stylized image is achieved, and the purity of the generated image is improved.
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Patents(China)
- Current Assignee / Owner
- ZHEJIANG GONGSHANG UNIVERSITY
- Filing Date
- 2024-12-19
- Publication Date
- 2026-06-19
AI Technical Summary
In existing style transfer methods based on text inversion, the style text representation often contains the content semantic information of the reference style image, which makes the generated stylized image prone to content semantic interference from the reference style image.
By performing style-invariant content-destructive data augmentation and content-invariant style-transformation data augmentation on reference style images, and combining style adapter and content adapter networks, style features and content features are mapped to the text space. Then, by injecting text-generated graph diffusion models through a text encoder and cross-attention mechanism, a corresponding text representation regularization loss is constructed to suppress semantic interference.
It effectively suppresses the semantic content of reference style images mixed in stylized images, achieves separation of content semantics, and improves the purity of style transfer.
Smart Images

Figure CN119339090B_ABST
Abstract
Description
Technical Field
[0001] This invention relates to the field of image style transfer, and in particular to a method and apparatus for learning image style representations based on text inversion and content semantic separation. Background Technology
[0002] In recent years, large-scale pre-trained text-to-image diffusion models have made progress in the field of text-to-image generation, capable of generating realistic and diverse image content driven by text. Currently, a class of style transfer methods based on text-to-image diffusion models typically first learns a style text representation to characterize the style of a reference style image using textual inversion techniques. After fine-tuning, this style text representation is then used to conditionally control the image generation process to achieve style transfer. However, the style text representations learned by existing methods often contain semantic information from the reference style image, making the generated stylized images susceptible to semantic interference from the reference style image. This problem is known as the content leakage problem in style representation learning. Therefore, the key to solving this problem is how to separate semantic information from the style text representation during the style text representation learning process based on textual inversion. Summary of the Invention
[0003] The purpose of this invention is to address the shortcomings of existing technologies by proposing a content semantic separation image style representation learning method and apparatus based on text inversion, so as to solve the problem of content semantic leakage during the image style representation learning process.
[0004] The objective of this invention is achieved through the following technical solution: Firstly, this invention provides a content semantic separation image style representation learning method based on text inversion, the method comprising the following steps:
[0005] (1) Obtain the original reference style image, and extract style features and content features based on the image style encoder network and the image content encoder network, respectively;
[0006] (2) Data augmentation that preserves style features but destroys content semantics is performed on the original reference style image, and style features are extracted based on the image style encoder network;
[0007] (3) Perform style transformation data augmentation on the original reference style image while preserving the semantic content, and extract content features based on the image content encoder network;
[0008] (4) Based on the style adapter network and content adapter network, the style features and content features obtained in steps (1) to (3) are mapped to the text space to obtain the corresponding style text representation and content semantic text representation; these representations are injected into the text-generated graph diffusion model through the text encoder and cross-attention mechanism, and trained based on the text inversion method;
[0009] (5) Extract style features of reference style image based on image style encoder network, obtain final style text representation of reference style image based on style adapter network after training, and use the style text representation to perform image style transfer.
[0010] Furthermore, in step (2), data augmentation of the original reference style image includes the following steps:
[0011] (2-1) Set the size of the image block according to the pixel size of the original reference style image and the input requirements of the image style encoder used;
[0012] (2-2) Based on the set image block size, the original reference style image is segmented according to the raster scan order from the top left to the bottom right, and the image block order is marked;
[0013] (2-3) The image blocks are rearranged randomly, destroying the semantic information of the original reference style image while retaining the style information of the original reference style image, and then reassembled into a set of data-enhanced images.
[0014] Furthermore, in step (3), data augmentation is performed on the original reference style image to obtain a set of style information images that retain the content semantic information of the original reference style image while changing the style information of the original reference style image. Specifically, this includes the following methods:
[0015] a. Convert the original reference style image to grayscale to obtain the corresponding grayscale image, and then apply Gaussian blur of the desired degree.
[0016] b. Perform random color transformation on the original reference style image, and then apply Gaussian blur of the desired degree.
[0017] Furthermore, style information includes global color distribution and local texture details.
[0018] Furthermore, the style features of the original reference style image are extracted based on the image style encoder network, and the style features are passed through a style adapter network to obtain the final style text representation, which excludes the semantic interference of the content in the original reference style image.
[0019] Furthermore, in step (4), the training process and loss function are designed as follows:
[0020] Using the style text representation and content semantic text representation obtained from the original reference style image as conditions, the original reference style image is trained using a text inversion method, with the loss function being the reference style image loss.
[0021] Using the style text representation obtained based on the style features in step (2) as a condition, the data augmented image in step (2) is trained using a text inversion method, with the loss function being the overall style loss;
[0022] Using the semantic text representation of the content obtained based on the content features in step (3) as a condition, the data-enhanced image in step (3) is trained using a text inversion method, with the loss function being the overall content loss;
[0023] The style text representation obtained by constraining the original reference style image using style consistency loss is similar to the style text representation of the image obtained by data augmentation in step (2), and far from the style text representation of the image obtained by data augmentation in step (3); the content semantic text representation obtained by constraining the original reference style image using content consistency loss is similar to the content semantic text representation of the image obtained by data augmentation in step (3).
[0024] Furthermore, in step (5), the style text representation obtained using the reference style image is... To perform text-to-image style transfer under certain conditions; specifically, to construct text prompts. Describe the content you want to generate, and Attach to the text prompt to construct the text prompt. =" , "or" in the style of ",Will Feed the text into a pre-trained text-to-image diffusion model for text-to-image generation; or feed the reference content image... The corresponding inversion noise is obtained after DDIM inversion. Use this inverted noise as the initial noise and provide a styled text representation. Image style transfer is performed as a condition.
[0025] Secondly, the present invention also provides a content semantic separation image style representation learning device based on text inversion, including a memory and one or more processors. The memory stores executable code, and when the processor executes the executable code, it implements the content semantic separation image style representation learning method based on text inversion.
[0026] Thirdly, the present invention also provides a computer-readable storage medium having a program stored thereon, which, when executed by a processor, implements the aforementioned content semantic separation image style representation learning method based on text inversion.
[0027] Fourthly, the present invention also provides a computer program product, including a computer program, which, when executed by a processor, implements the aforementioned content semantic separation image style representation learning method based on text inversion.
[0028] The beneficial effects of the present invention are as follows: The present invention can alleviate the semantic interference of the reference style image in the style representation learning process based on text inversion, thereby suppressing the content leakage problem that often mixes the semantic content of the reference style image in the stylized image generated by the current method. Attached Figure Description
[0029] To more clearly illustrate the technical solutions in the embodiments of the present invention or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present invention. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.
[0030] Figure 1 This is a schematic diagram of an image style representation learning method based on text inversion for content semantic separation, provided by the present invention.
[0031] Figure 2 This is a schematic diagram of the stylized representation learning and training method based on text inversion provided by the present invention.
[0032] Figure 3 A schematic diagram illustrating the style-invariant, content-destructive data enhancement and content-invariant, style-transformation data enhancement methods provided by this invention.
[0033] Figure 4 This is a schematic diagram of the text-to-image stylization generation result provided by the present invention.
[0034] Figure 5 This invention provides a structural diagram of a content semantic separation image style representation learning device based on text inversion. Detailed Implementation
[0035] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described below with reference to the accompanying drawings and examples. It should be understood that the specific examples described herein are merely illustrative and not intended to limit the invention.
[0036] To mitigate the content leakage problem in stylized representation learning based on text inversion, this invention proposes a content-semantic separation image stylized representation learning method based on text inversion. This method aims to separate the content-semantic information of a reference style image from the learned style text representation. The proposed method performs style-invariant content-destructive data augmentation and content-invariant style transformation data augmentation on the reference style image, respectively. By learning text representations based on text inversion from these two augmented images and the reference style image, and constructing corresponding text representation regularization losses, a content-semantic-independent style text representation of the reference style image is finally obtained.
[0037] like Figure 1 As shown, the present invention proposes a content semantic separation image stylistic representation learning method based on text inversion, the specific steps of which are as follows:
[0038] (1) Perform the following on the reference style image: Figure 3 The style-invariant, content-destructive data augmentation method shown below, hereinafter referred to as Data Augmentation A, produces a set of images whose style representation is the same as the original reference style image, but whose content semantics are destroyed. The specific implementation of Data Augmentation A is as follows: the original reference style image... The input image is cut into patches, and the cut image patches are reassembled.
[0039] (1-1) Under normal circumstances, the size of the patch should match the level of detail of the image. In this task, the purpose of segmenting the patch is to destroy the content structure of the image while preserving and capturing the local style features of the image. The specific choice depends on the size of the reference image and the input requirements of the image style encoder. Consider 16×16, 32×32 or 64×64 pixels, etc.
[0040] (1-2) Based on the pre-defined image block size, the original reference style image is segmented according to the raster scan order from the top left to the bottom right. Specifically, this process involves dividing the entire image into several fixed-size image blocks. During segmentation, each image block is labeled to maintain the order information of the image blocks. These labels are usually numbered based on the position of the image block in the entire image; for example, starting from the first block in the top left corner, the sequence number of each block is labeled row by row.
[0041] (1-3) Assume the original image is divided into N patches, where the index of each patch can be represented as... The value of i ranges from i=1,2,…,N. A random number generation algorithm in a computer is used to randomly rearrange the patch block sequence, and then the patch blocks are reassembled into an image based on the shuffled sequence. This disrupts the overall structure of the image but preserves the stylistic information of the original image, such as global color distribution and local texture details.
[0042] (2) Construct an image style encoder network For style feature extraction, a pre-trained CLIP-ViT image encoder or a DINO-ViT model can be used as the image style encoder network. Specifically, the [cls-token] feature vector output from the last layer after the image is fed into the network can be extracted as a style feature.
[0043] The input images are the m images whose style representation remains unchanged but whose content semantics are destroyed in step (1). The process can be described as follows:
[0044]
[0045] Since the output is a set of image feature vectors, in order to preserve the overall style features of the image set, it is necessary to integrate the style features of each image to obtain a unified style feature set. The specific method involves analyzing the style features of each image. To sum them and take their average, follow these steps:
[0046]
[0047] (3) Construct a learnable style adapter network style characteristics Mapping to the text space yields a style text representation.
[0048] (3-1) Style adapter networks use multiple learnable layers, typically including fully connected layers, normalization layers, or modules based on self-attention mechanisms, to process style features. Convert to a uniform style text representation This can be represented as:
[0049]
[0050] (3-2) By using a multimodal model (such as CLIP) with a text encoder and cross-attention mechanism (query vector Q, key vector K, value vector V) to... It is injected into pre-trained text-based image diffusion models (such as Stable Diffusion) as a condition.
[0051] (4) For the original reference style image Perform as Figure 3 The content-invariant style transformation data augmentation shown is referred to as Data Augmentation B below. This results in a set of images whose style representation differs from the original reference style image, but whose content semantics remain consistent. Several methods are described below, which can be flexibly adopted according to the actual situation:
[0052] (4.1) Perform grayscale conversion on the reference style image to obtain a grayscale image. Then, Gaussian blur can be applied to the image to a certain extent. Through this process, the texture details in the image are blurred, but the overall structure is preserved.
[0053] (4.2) Perform random color transformation on the reference style image, and then apply a certain degree of Gaussian blur. The color transformation can be achieved using ColorJitter or color histogram matching. Subsequently, the image can be applied with a certain degree of Gaussian blur.
[0054] A set of n images obtained by data augmentation using the above method In fact, it retains the semantic information of the original reference style image but changes the style information of the original reference style image, such as global color distribution and local texture details.
[0055] (5) Construct an image content encoder network For content feature extraction, a pre-trained CLIP-ViT image encoder or a DINO-ViT model can be used as the image content encoder network. Specifically, the key value feature vector extracted from the last self-attention layer of the image fed into the network is used as the content feature, which can be represented as:
[0056]
[0057] Since the output is a set of image feature vectors, in order to preserve the overall image content features, it is necessary to integrate the content features of each image to obtain the unified content features of this set of images. The specific method involves analyzing the content features of each image. To sum them and take their average, follow these steps:
[0058]
[0059] (6) Construct a learnable content adapter network Content features Mapping to the text space yields the content text representation. By using a text encoder and a cross-attention mechanism (query vector Q, key vector K, value vector V) to... It is injected into the pre-trained text image diffusion model as a condition. The specific process is as follows: step (3).
[0060] (7) Reference style image The image style encoder network constructed in steps (2) and (5) is used. and image content encoder network Extract the corresponding style features respectively and content features The operation is as follows:
[0061]
[0062] (8) Use the style adapter network from steps (3) and (6) and content adapter network By mapping style and content features to the text space, reference style images are obtained respectively. style text representation and content semantic text representation Then, through a text encoder and a cross-attention mechanism (query vector Q, key vector K, value vector V), the text is processed... and It is injected into the pre-trained text image diffusion model as a condition. For the specific process, refer to steps (3) and (6).
[0063] (9) Based on These conditional loss functions are used to train the style adapter network and content adapter network using a text-based inversion method, such as... Figure 2 As shown:
[0064] Reference style image loss and As a condition, for the reference style image The text-based inversion method is trained using a diffusion model to construct the loss function, as described below:
[0065]
[0066] in For pre-training raw image diffusion models, such as Stable Diffusion, For pre-trained text encoders, such as CLIP's text encoder, For reference style images The resulting noisy image after t-step noise addition using the diffusion model.
[0067] Overall style loss As a condition, for This set of images was trained using a text-based inversion method, and the loss was constructed using a diffusion model, as described below:
[0068]
[0069] in For image The resulting noisy image after t-step noise addition using the diffusion model.
[0070] Overall content loss As a condition, for This set of images was trained using a text-based inversion method, and the loss was constructed using a diffusion model, as described below:
[0071]
[0072] in For image The resulting noisy image after t-step noise addition using the diffusion model.
[0073] Style consistency loss constraint and Similar to:
[0074]
[0075] in The image obtained by data augmentation B in step (4) The text representation is obtained by feeding it into the image style encoder and style adapter network.
[0076] Content consistency loss constraint and Similar to:
[0077]
[0078] (10) After training is complete, the user can use This allows for conditional text-to-image style transfer. Specifically, users can construct text prompts. Describe the content you want to generate, and Attach to the text prompt to construct the text prompt. =" , "or" in the style of ",Will Feed the text into a pre-trained text-to-image diffusion model for style transfer; or feed the reference content image... The corresponding inversion noise is obtained after DDIM inversion. Use this inverted noise as the initial noise and give it a stylized text representation. Image style transfer is performed conditionally. For example, such as... Figure 4 As shown, when a user inputs the text "A house covered with ice and snow," the resulting text-to-image generation output includes the style of a reference style image and the user's input text. Compared with traditional methods, the text-guided generation output obtained by this invention can effectively suppress the semantic content of the reference style image mixed in with the stylized image.
[0079] Corresponding to the aforementioned embodiment of a content semantic separation image style representation learning method based on text inversion, the present invention also provides an embodiment of a content semantic separation image style representation learning device based on text inversion.
[0080] See Figure 5 The present invention provides a content semantic separation image style representation learning device based on text inversion, comprising a memory and one or more processors. The memory stores executable code, and when the processor executes the executable code, it is used to implement a content semantic separation image style representation learning method based on text inversion as described in the above embodiments.
[0081] The embodiment of the content semantic separation image style representation learning device based on text inversion provided by this invention can be applied to any device with data processing capabilities, such as a computer. The device embodiment can be implemented in software, hardware, or a combination of both. Taking software implementation as an example, as a logical device, it is formed by the processor of any data processing device loading the corresponding computer program instructions from non-volatile memory into memory for execution. From a hardware perspective, such as... Figure 5 The diagram shown is a hardware structure diagram of any device with data processing capabilities, which is a content semantic separation image style representation learning device based on text inversion provided by the present invention. (Except for...) Figure 5 In addition to the processor, memory, network interface, and non-volatile memory shown, any data processing device in the embodiment may also include other hardware depending on the actual function of the data processing device, which will not be described in detail here.
[0082] The specific implementation process of the functions and roles of each unit in the above device can be found in the implementation process of the corresponding steps in the above method, and will not be repeated here.
[0083] For the device embodiments, since they basically correspond to the method embodiments, the relevant parts can be referred to in the description of the method embodiments. The device embodiments described above are merely illustrative. The units described as separate components may or may not be physically separate, and the components shown as units may or may not be physical units, that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of the present invention according to actual needs. Those skilled in the art can understand and implement this without creative effort.
[0084] This invention also provides a computer-readable storage medium storing a program thereon, which, when executed by a processor, implements a content semantic separation image style representation learning method based on text inversion as described in the above embodiments.
[0085] The computer-readable storage medium can be an internal storage unit of any data processing device described in any of the foregoing embodiments, such as a hard disk or memory. The computer-readable storage medium can also be an external storage device of any data processing device, such as a plug-in hard disk, smart media card (SMC), SD card, flash card, etc., equipped on the device. Furthermore, the computer-readable storage medium can include both internal storage units and external storage devices of any data processing device. The computer-readable storage medium is used to store the computer program and other programs and data required by the data processing device, and can also be used to temporarily store data that has been output or will be output.
[0086] The present invention also provides a computer program product, including a computer program that, when executed by a processor, implements the aforementioned content semantic separation image style representation learning method based on text inversion.
[0087] The above embodiments are used to explain and illustrate the present invention, but not to limit the present invention. Any modifications and changes made to the present invention within the spirit and scope of the claims shall fall within the protection scope of the present invention.
Claims
1. A content semantic separation image style representation learning method based on text inversion, characterized in that, The method includes the following steps: (1) Obtain the original reference style image, and extract style features and content features based on the image style encoder network and the image content encoder network, respectively; (2) Data augmentation that preserves style features but destroys content semantics is performed on the original reference style image. The style features of the original reference style image are extracted based on the image style encoder network, and the style features are passed through a style adapter network to obtain the final style text representation. This style text representation eliminates the content semantic interference in the original reference style image. (3) Perform style transformation data augmentation on the original reference style image while preserving the semantic content, and extract content features based on the image content encoder network; (4) Based on the style adapter network and content adapter network, the style features and content features obtained in steps (1) to (3) are mapped to the text space to obtain the corresponding style text representation and content semantic text representation; these representations are injected into the text-generated graph diffusion model through the text encoder and cross-attention mechanism, and the model is trained based on the text inversion method; the training process and loss function design are as follows: Using the style text representation and content semantic text representation obtained from the original reference style image as conditions, the original reference style image is trained using a text inversion method, with the loss function being the reference style image loss. Using the style text representation obtained based on the style features in step (2) as a condition, the data augmented image in step (2) is trained using a text inversion method, with the loss function being the overall style loss; Using the semantic text representation of the content obtained based on the content features in step (3) as a condition, the data-enhanced image in step (3) is trained using a text inversion method, with the loss function being the overall content loss; The style text representation obtained by constraining the original reference style image using style consistency loss is similar to the style text representation of the image obtained by data augmentation in step (2), and far from the style text representation of the image obtained by data augmentation in step (3); the content semantic text representation obtained by constraining the original reference style image using content consistency loss is similar to the content semantic text representation of the image obtained by data augmentation in step (3); (5) Extract style features from the reference style image based on the image style encoder network, obtain the final style text representation of the reference style image based on the trained style adapter network, and use the style text representation obtained from the reference style image. To perform text-to-image style transfer under certain conditions; specifically, to construct text prompts. Describe the content you want to generate, and Attach to the text prompt to construct the text prompt. ==" , "or" in the style of ",Will Feed the text into a pre-trained text-to-image diffusion model for text-to-image generation; or feed the reference content image... The corresponding inversion noise is obtained after DDIM inversion. Use this inverted noise as the initial noise and provide a styled text representation. Image style transfer is performed as a condition.
2. The content semantic separation image style representation learning method based on text inversion according to claim 1, characterized in that, In step (2), data augmentation of the original reference style image includes the following steps: (2-1) Set the size of the image block according to the pixel size of the original reference style image and the input requirements of the image style encoder used; (2-2) Based on the set image block size, the original reference style image is segmented according to the raster scan order from the top left to the bottom right, and the image block order is marked; (2-3) The image blocks are rearranged randomly, destroying the semantic information of the original reference style image while retaining the style information of the original reference style image, and then reassembled into a set of data-enhanced images.
3. The method of claim 1, wherein, In step (3), data augmentation is performed on the original reference style image to obtain a set of style information images that retain the content semantic information of the original reference style image while changing the style information of the original reference style image. Specifically, the following methods are included: a. Convert the original reference style image to grayscale to obtain the corresponding grayscale image, and then apply Gaussian blur of the desired degree; b. Perform random color transformation on the original reference style image, and then apply Gaussian blur of the desired degree.
4. The content semantic separation image style representation learning method based on text inversion according to claim 2 or 3, characterized in that, Style information includes global color distribution and local texture details.
5. An image style representation learning device based on text inversion of content semantic separation, comprising a memory and one or more processors, wherein the memory stores executable code, and the executable code is configured to perform the following steps: When the processor executes the executable code, it implements a content semantic separation image style representation learning method based on text inversion as described in any one of claims 1-4.
6. A computer-readable storage medium having stored thereon a program, characterized in that, When the program is executed by the processor, it implements a content semantic separation image style representation learning method based on text inversion as described in any one of claims 1-4.
7. A computer program product comprising a computer program, characterized in that, When the computer program is executed by the processor, it implements the content semantic separation image style representation learning method based on text inversion as described in any one of claims 1-4.