A scribble face image editing method based on a multi-modal interaction module

By using a multimodal interaction module based on a transformer network, combined with deep interaction between graffiti and facial images, the problem of realism in facial image editing and preservation of identity features in existing technologies is solved, achieving efficient and realistic editing results.

CN117745860BActive Publication Date: 2026-06-23HUNAN UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
HUNAN UNIV
Filing Date
2023-12-25
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

Existing face image editing algorithms have shortcomings in terms of the realism of the editing effect and the preservation of the original face identity features. In particular, supervised methods have limited editing capabilities, unsupervised methods are uncontrollable, sketch-line-based methods have poor results, and text-driven methods are difficult to intuitively express user intentions.

Method used

It employs a dual-path multimodal interaction module and a single-path multimodal interaction module based on a transformer network structure. Through deep interaction between user doodles and facial images, it achieves semantic alignment between doodles and images, and saves the original facial identity features through a texture supplementation module.

Benefits of technology

It enables the generation of realistic editing effects under large-area graffiti input, while maintaining the integrity of the original facial identity features, thus enhancing the realism and effectiveness of the editing.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN117745860B_ABST
    Figure CN117745860B_ABST
Patent Text Reader

Abstract

The present application relates to the technical field of face image editing, in particular to a kind of face image editing method based on multi-modal interactive module of scribble. Through two-way multi-modal interactive module, cross attention graph is calculated from position dimension and channel dimension respectively, the cross attention graph obtained and scribble vector are used to carry out latent space latent reflection on face image latent vector iterative modification, complete scribble and image target editing position semantic content alignment, and scribble is embedded into corresponding latent space;Through single-path multi-modal interactive module, save the original face identity feature after editing, the texture supplement vector obtained carries out latent space latent reflection on face image latent vector iterative modification, and finally generates face image editing result.The present application can more intuitively and fully express the editing intention of user, realize the real feeling editing effect that meets user's expectation, and the present application has superiority in editing effect, editing real feeling and face identity information saving of face image editing.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the technical field of face image editing, and in particular to a doodle face image editing method based on a multimodal interaction module. Background Technology

[0002] As facial images encompass crucial human information, they occupy an extremely important position in social interactions. Thanks to the rapid development of internet technology and the widespread popularity of social media and short video platforms, editing and sharing facial images has become an essential part of most people's social interactions. In the field of deep learning, generative adversarial networks (GANs) have achieved remarkable success in image generation. In particular, in recent years, scholars have proposed a novel GAN ​​architecture—StyleGAN—renowned for its powerful and rich image generation capabilities and semantically rich decoupled latent space. Therefore, many studies are increasingly interested in using pre-trained StyleGANs for image editing.

[0003] Facial image editing refers to the interpretable editing of facial images according to target attributes. Attributes refer to the shape and style of various parts of the face in the image. Interpretable editing means modifying the attributes of the facial image so that the resulting changes can be explained by humans, such as changing hair color or eye shape. The goal of facial image editing is to meet the user's editing intentions and achieve the expected editing effect for a given facial image. The key challenge lies in achieving the expected editing effect while ensuring the realism of the edited result and preserving the original facial identity features.

[0004] Currently, mainstream image editing algorithms mainly fall into several categories, including supervised editing methods, unsupervised semantic mining methods, sketch-line-based editing methods, and text-driven editing methods. Supervised editing methods often rely on pre-trained attribute classifiers and can only perform limited editing on a subset of attributes. Unsupervised methods are uncontrolled, and users cannot achieve the desired editing results. Sketch-line-based editing methods, while allowing for free editing through user interaction, often produce poor stylistic effects. Text-driven editing methods can edit images using user-input descriptive text, and while they perform well in terms of realistic editing results, they are less effective at preserving original facial features. Furthermore, the lack of intuitiveness in using descriptive text makes it difficult for users to grasp the true editing intent. Summary of the Invention

[0005] To address the aforementioned issues, this invention provides a doodle-based face image editing method based on a multimodal interaction module. The method aims to edit face images through interactive user doodles, as only flexible and intuitive hand-drawn doodles can fully express the user's editing intentions for face images.

[0006] The principle of this invention lies in addressing the issue of realistic editing results by employing a dual-path multimodal interaction module based on a transformer network structure, named the Scribble interactive editing module. First, the user-input doodle mask is converted into a query vector through an embedding layer network. Then, a pre-trained inversion encoder converts the face image to be edited into a latent vector, serving as the key and value for the multimodal interaction module. The query, key, and value of the dual-path multimodal interaction module calculate cross-attention maps from both the positional and channel dimensions. The resulting cross-attention maps and doodle vectors are then used to iteratively modify the latent vectors of the face image to perform latent space mapping, aligning the semantic content of the doodle with the target editing location in the image, and embedding the doodle into the corresponding latent space. In this way, through deep interaction between the doodle and the corresponding image, the generated editing result becomes more realistic.

[0007] To address the problem of preserving original facial identity features after editing, this invention employs a single-path multimodal interaction module, named the Texture Supplement Module. First, a texture mask for the editing region is obtained from the facial image based on a doodle mask. Then, this mask is converted into a query vector through an embedding layer network. The edit latent vector obtained from the Scribbleinteractive editing module serves as the key and value, and is input into the single-path multimodal interaction module to calculate a cross-attention map. The resulting texture supplement vector iteratively modifies the latent vector of the facial image, performing latent space mapping to enhance texture details and depth alignment of the edited image. Finally, the obtained edit offset is combined with the image latent vector, and a pre-trained generative network generates the final edited facial image. In this way, even with large-area doodle input, this invention can effectively combine doodles and image semantics, achieving realistic editing effects while supplementing detailed information, resulting in better preservation of original facial identity features.

[0008] This invention is a graffiti face image editing method based on a multimodal interaction module, including a graffiti texture information processing part, a graffiti editing part, a texture supplementation part, and an editing generation part, specifically including the following steps:

[0009] Step 1: Graffiti Texture Information Processing. For the input graffiti, a mask image relative to the face image is calculated based on the graffiti. Then, a texture mask containing the pixels of the edited region is obtained from the face image based on the portion of the face image corresponding to the graffiti. Both the graffiti mask and the texture mask have their features extracted through an embedding layer network module, and vectors in the latent space are calculated. and For the input face image to be edited, the goal is to transform the face image into an intermediate latent space with rich semantics and good decoupling. Therefore, the face image needs to be input into a pre-trained E4E inverse encoder to calculate the result. Implicit vectors in space , where n refers to the number of layers in the image generator network, which is 18 in this invention;

[0010] Step 2: Doodle editing. The goal is to perform deep interaction between the face image and the user's doodle, generating an offset vector with preliminary editing effects. Therefore, a dual-path multimodal interaction module, Scribble interactive editing module, needs to be designed to process the face image and user doodle to obtain the preliminary editing results. The Scribble interactive editing module accepts the face image and converts it to... Implicit vectors in space Vectors transformed from graffiti mask into latent space As input, it will be analyzed from both channel and position dimensions. and When combined, deep interaction across these two dimensions allows for a better integration of graffiti and images. and After deep interaction, the goal is to output an offset latent vector. Yes, use k Scribble interactive editing modules to iteratively modify ;

[0011] Step 3: For the location dimension, Set as query And will input the latent vector of the i-th module Set as key Sum The interaction in the i-th block can be written as:

[0012] , , ;

[0013] ;

[0014] ;

[0015] in, It refers to the number of read / write heads. It is the size of the hidden space, and , and These are learnable projection heads in cross-attention; they do not change the feature dimensions. Furthermore... It is also learnable, responsible for adjusting the weights of contributions from different heads. Softmax refers to the normalized exponential function in probability theory; the feedforward network consists of fully connected layers and ReLU activation functions. ReLU activation functions, or linear rectified functions, refer to the ramp function in mathematics and are commonly used activation functions in artificial neural networks; interactions in the position dimension allow doodles and images to construct semantically aligned latent space mappings. Furthermore, research indicates that multiple elements are typically in... A semantic attribute is jointly expressed in space. Therefore, it is also necessary to establish interactions at the channel dimension to enhance the consistency of semantic information;

[0016] Step 4: For the channel dimension, use the same key as in the position dimension. Sum And use a new query in the channel dimension according to the following equation. The calculation method for the interaction process is the same as the formula above;

[0017] , , ;

[0018] in , and Similar settings are shared with the location dimension;

[0019] Step 5: After semantically aligning the doodles and images, to further improve editing capabilities, The alignment results are input into the fusion module for further refinement. The calculation of the fusion module is shown in the following formula, where... and These represent the alignment results of the multimodal interaction modules in terms of position and channel dimensions, respectively. It is an adaptive average pooling operation, and This indicates normalization. Furthermore, express Embedding after the linear layer;

[0020] ;

[0021] ;

[0022] The output of the k-th Scribble interactive editing module It is an implicit vector offset containing editing information. ;

[0023] Step 6: In the texture supplementation part, in order to preserve facial identity feature information and reduce subtle identity changes during editing caused by mapping in the latent space, a single-path multimodal interaction module, Texture SupplementModule, is used. Imposing identity constraints; the Texture supplement module accepts the output of the Scribble interactive editing module. and texture mask embedding As input. As mentioned before, Set as query Potential code offset Set as key value Sum .in , and It shares similar settings with the Scribble interactive editing module. Furthermore, the multi-head cross-attention calculation method is as described previously.

[0024] , , ;

[0025] Step 7: In the editing generation part, the edit vectors processed by the Scribble interactive editing module and the Texture supplement module are mainly input into the pre-trained face generation network to obtain the final edit result. First, the latent vector offsets with supplementary texture detail information obtained from the Texture supplement module are... This is used to optimize the final latent vectors. The resulting image is then edited. The generation process can be formalized as follows:

[0026] ;

[0027] ;

[0028] ;

[0029] in This represents a pre-trained face image generation network. It is a parameter used to control the intensity of editing.

[0030] The beneficial effects of this invention are:

[0031] (1) This invention emphasizes the use of interactive graffiti for face image editing. Compared with most other face image editing algorithms, this invention can more intuitively and fully express the user's editing intention through the flexible form of graffiti, and more efficiently achieve the user's expected editing effect to complete specific face image editing tasks. (2) This invention designs a dual-path multimodal interaction module to handle the deep interaction between user graffiti and image semantics, and a single-path multimodal interaction module to supplement the texture information of the editing result, so as to ensure the realism of the editing effect and the preservation of the original face feature information when processing graffiti editing. Thanks to the multimodal interaction module, even when processing large-area input graffiti, or graffiti that occludes many key parts of the face, this invention can handle the relationship between input graffiti and face pixels well, achieve a realistic editing effect that meets the user's expectations, and will not cause differences in face identity information before and after editing.

[0032] (3) Through a large number of qualitative and quantitative experiments, the present invention has demonstrated the superiority of the present invention in terms of editing effect, editing realism and preservation of facial identity information. Attached Figure Description

[0033] Figure 1 The flowchart shows the proposed face image editing method.

[0034] Figure 2 The diagram shows the structure of the proposed dual-path multimodal interaction module.

[0035] Figure 3 The diagram shows the structure of the proposed single-path multimodal interaction module. Detailed Implementation

[0036] To make the objectives, technical solutions, and advantages of this invention clearer, the invention will be further described in detail below with reference to specific embodiments and the accompanying drawings. It should be understood that these descriptions are merely exemplary and not intended to limit the scope of the invention. Furthermore, descriptions of well-known structures and technologies are omitted in the following description to avoid unnecessarily obscuring the concept of the invention; moreover, the technical features involved in the different embodiments of the invention described below can be combined with each other as long as they do not conflict with each other. The following will refer to the accompanying drawings... Figures 1 to 3 The invention will now be described in more detail.

[0037] To address the issue of realistic editing results, a dual-path multimodal interaction module based on a transformer network structure (a network structure composed of a series of encoders and decoders), named the Scribble interactive editing module, is employed. First, the user-input doodle mask is converted into a query vector through an embedding layer network. Then, a pre-trained inversion encoder is used to convert the face image to be edited into a latent vector, serving as the key and value for the multimodal interaction module. This inversion encoder was proposed in "Designing an encoder for stylegan image manipulation" published in *ACM TRANSACTIONS ON GRAPHICS*. The query, key, and value in the dual-path multimodal interaction module calculate cross-attention maps from both the positional and channel dimensions. The resulting cross-attention maps and doodle vectors are then used to iteratively modify the latent vectors of the face image to perform latent space mapping, aligning the semantic content of the doodle with the target editing location in the image, and embedding the doodle into the corresponding latent space. This deep interaction between the doodle and the corresponding image results in a more realistic editing outcome.

[0038] To address the problem of preserving original facial identity features after editing, this invention employs a single-path multimodal interaction module, named the Texture Supplement Module. First, a texture mask for the editing region is obtained from the facial image based on a doodle mask. Then, this mask is converted into a query vector through an embedding layer network. The edit latent vector obtained from the Scribbleinteractive editing module serves as the key and value, and is input into the single-path multimodal interaction module to calculate a cross-attention map. The resulting texture supplement vector iteratively modifies the latent vector of the facial image, performing latent space mapping to enhance texture details and depth alignment of the edited image. Finally, the obtained edit offset is combined with the image latent vector, and a pre-trained generative network generates the final edited facial image. This generative network adopts the StyleGAN network structure proposed in "A style-based generator architecture for generative adversarial networks" published at the 2019 IEEE International Conference on Computer Vision and Pattern Recognition. In this way, even with large-area doodle input, this invention can effectively combine doodles and image semantics, achieving realistic editing effects while supplementing detailed information, resulting in good preservation of original facial identity features.

[0039] This invention is a graffiti face image editing method based on a multimodal interaction module, including a graffiti texture information processing part, a graffiti editing part, a texture supplementation part, and an editing generation part, specifically including the following steps:

[0040] Step 1: Graffiti texture information processing. For the input graffiti, calculate a mask image relative to the face image based on the graffiti. Then, based on the part of the face image area corresponding to the graffiti, obtain a texture mask containing the pixels of the edited area image from the face image.

[0041] The graffiti mask and texture mask each extract features through an embedding layer network module, and calculate vectors in the latent space. and ;

[0042] For the input face image to be edited, the goal is to transform the face image into an intermediate latent space with rich semantics and good decoupling. Therefore, the face image needs to be input into a pre-trained inversion encoder to calculate the result. Implicit vectors in space , where n refers to the number of layers in the image generator network, n=18;

[0043] Step 2: Doodle Editing. The face image and user doodles undergo deep interaction to generate an offset vector with preliminary editing effects. This vector is then processed by a dual-channel multimodal interaction module to obtain the initial editing result. The dual-channel multimodal interaction module receives the face image and converts it to... Implicit vectors in space Vectors transformed from graffiti mask into latent space As input; it will be analyzed from both channel and position dimensions. and The connection, a deep interaction across two dimensions, allows for a better integration of graffiti and images; in After interacting with depth, output the offset latent vector. Iterative modifications using k dual-path multimodal interaction modules ;

[0044] Step 3: For the location dimension, Set as query And will input the latent vector of the i-th module Set as key Sum The interaction in the i-th block is written as:

[0045] , , ;

[0046] ;

[0047] ;

[0048] in, It refers to the number of read / write heads. It is the size of the hidden space. , and These are learnable projection heads in cross-attention, which do not change the feature dimensions; It is learnable and responsible for adjusting the weights of contributions from different leaders; and The normalized exponential function refers to the normalized exponential function in probability theory; a feedforward network consists of fully connected layers and ReLU activation functions, which are linear rectified functions, also known as ramp functions in mathematics, and are commonly used activation functions in artificial neural networks; the interaction in the position dimension allows doodles and images to construct semantically aligned latent space mappings;

[0049] Step 4: For the channel dimension, use the same key as in the position dimension. Sum And use a new query in the channel dimension according to the following equation. The calculation method for the interaction process is the same as the formula above.

[0050] , , ;

[0051] in , and Similar settings are shared with the location dimension;

[0052] Step 5: After semantically aligning the graffiti and images, The alignment results are input into the fusion module for further refinement. To improve editing capabilities; the calculation for the fusion module is shown in the following formula.

[0053] ;

[0054] ;

[0055] in and These represent the alignment results of the multimodal interaction modules in terms of position and channel, respectively. It is an adaptive average pooling operation, and This indicates normalization; furthermore, express Embedded after the linear layer; output of the k-th bi-channel multimodal interaction module. It is an implicit vector offset containing editing information. The dual-path multimodal interaction module has a structure with different distributions. and Establishing semantic mapping is key to achieving realistic editing. This semantic alignment is achieved through linear transformation, and the parameters of the linear transformation are learned based on cross-attention maps.

[0056] like Figure 1 The diagram shows the structure of the Scribble interactive editing module. It shows different distributions... and Establishing semantic mappings is key to achieving realistic editing. This semantic alignment is achieved through linear transformations, with the parameters of the linear transformation learned based on cross-attention maps.

[0057] Step 6: In the texture supplementation part, in order to preserve facial identity feature information and reduce subtle identity changes during editing caused by mapping in the latent space, a single-path multimodal interaction module is used. Imposing identity constraints; specific design within the module, such as Figure 2 As shown;

[0058] The single-channel multimodal interaction module accepts the output of the dual-channel multimodal interaction module. and texture mask embedding As input; will Set as query Potential code offset Set as key value Sum ;in , and Similar to the dual-path multimodal interaction module, the multi-head cross-attention calculation method is also as described in step 3;

[0059] , , ;

[0060] Step 7: In the editing and generation part, the edit vectors processed by the dual-channel multimodal interaction module and the single-channel multimodal interaction module are input into the pre-trained generation network to obtain the final editing result; the latent vector offset with supplementary texture detail information obtained from the single-channel multimodal interaction module is used. To optimize the final latent vectors; edit the image. The formula for the generation process is:

[0061] ;

[0062] ;

[0063] ;

[0064] in This represents a pre-trained face image generation network. It is a parameter used to control the intensity of editing.

[0065] Compared to most other face image editing algorithms, this invention, through the flexible form of doodles, can more intuitively and fully express the user's editing intentions. Even when processing large areas of input doodles, or doodles that obscure many key facial features, it can effectively handle the relationship between the input doodles and facial pixels, achieving a realistic editing effect that meets the user's expectations, without causing any discrepancies in facial identity information before and after editing. Extensive qualitative and quantitative experimental analysis, along with various quantitative indicators and qualitative scoring experiments, have demonstrated the superiority of this invention in terms of editing effect, realism, and preservation of facial identity information in face image editing.

[0066] Obviously, the described embodiments are only some, not all, of the embodiments of the present invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

Claims

1. A method for editing doodle-style face images based on a multimodal interaction module, characterized in that: This includes graffiti texture information processing, graffiti editing, texture supplementation, and edit generation; The dual-path multimodal interaction module, based on a transformer network structure, firstly converts the user-input doodle mask into a query vector through an embedding layer network. Then, a pre-trained inversion encoder converts the face image to be edited into a latent vector, which serves as the key and value of the multimodal interaction module. The query, key, and value of the dual-path multimodal interaction module calculate cross-attention maps from both the positional and channel dimensions. Then, the obtained cross-attention maps and doodle vectors are used to perform latent space mapping by iteratively modifying the latent vectors of the face image, thereby aligning the semantic content of the doodle with the image target editing position and embedding the doodle into the corresponding latent space. The original facial identity features after editing are saved through a single-channel multimodal interaction module. First, the texture mask of the editing area is obtained from the facial image based on the graffiti mask. Then, it is converted into a query vector through an embedding layer network. The edited latent vector obtained from the dual-channel multimodal interaction module is used as the key and value and input together into the single-channel multimodal interaction module to calculate the cross attention map. The resulting texture supplement vector iteratively modifies the latent vector of the facial image to perform latent space mapping, so as to enhance the texture content details and the depth alignment of the edited image. Finally, the obtained edit offset is combined with the image latent vector, and the final face image editing result is generated through a pre-trained generative network.

2. The doodle face image editing method based on a multimodal interaction module according to claim 1, characterized in that... Includes the following steps: Step 1: Graffiti texture information processing. For the input graffiti, calculate a mask image relative to the face image based on the graffiti. Then, based on the part of the face image area corresponding to the graffiti, obtain a texture mask containing the pixels of the edited area image from the face image. The graffiti mask and texture mask each extract features through an embedding layer network module, and calculate vectors in the latent space. and ; For the input face image to be edited, the goal is to transform the face image into an intermediate latent space with rich semantics and good decoupling. Therefore, the face image needs to be input into a pre-trained inversion encoder to calculate the result. Implicit vectors in space , where n refers to the number of layers in the image generator network, n=18; Step 2: Doodle Editing. The face image and user doodles undergo deep interaction to generate an offset vector with preliminary editing effects. This vector is then processed by a dual-channel multimodal interaction module to obtain the initial editing result. The dual-channel multimodal interaction module receives the face image and converts it to... Implicit vectors in space Vectors transformed from graffiti mask into latent space As input; it will be analyzed from both channel and position dimensions. and The connection, a deep interaction across two dimensions, allows for a better integration of graffiti and images; in and After deep interaction, output the offset latent vector. Iterative modifications using k dual-path multimodal interaction modules ; Step 3: For the location dimension, Set as query And will input the latent vector of the i-th module Set as key Sum ; The interaction in the i-th block is written as follows; , , in, It refers to the number of read / write heads. It is the size of the hidden space. , and These are learnable projection heads in cross-attention, which do not change the feature dimensions; It is learnable and responsible for adjusting the weights of contributions from different leaders; and It refers to the normalized exponential function in probability theory; the feedforward network consists of fully connected layers and ReLU activation functions, and the interaction in the position dimension allows doodles and images to construct semantically aligned latent space mappings; Step 4: For the channel dimension, use the same key as in the position dimension. Sum And use a new query in the channel dimension according to the following equation. The calculation method for the interaction process is the same as the formula above. , in , and Similar settings are shared with the location dimension; Step 5: After semantically aligning the graffiti and images, The alignment results are input into the fusion module for further refinement. To improve editing capabilities; the calculation for the fusion module is shown in the following formula. , in and These represent the alignment results of the multimodal interaction modules in terms of position and channel, respectively. It is an adaptive average pooling operation, and This indicates normalization; furthermore, express Embedded after the linear layer; output of the k-th bi-channel multimodal interaction module. It is an implicit vector offset containing editing information. The dual-path multimodal interaction module has a structure with different distributions. and Establishing semantic mapping is key to achieving realistic editing. This semantic alignment is achieved through linear transformation, and the parameters of the linear transformation are learned based on cross-attention maps. Step 6: In the texture supplementation part, in order to preserve facial identity feature information and reduce subtle identity changes during editing caused by mapping in the latent space, a single-path multimodal interaction module is used. Impose identity constraints; The single-channel multimodal interaction module accepts the output of the dual-channel multimodal interaction module. and texture mask embedding As input; will Set as query Potential code offset Set as key value Sum ;in , and Similar to the dual-path multimodal interaction module, the multi-head cross-attention calculation method is also as described in step 3; , , ; Step 7: In the editing and generation part, the edit vectors processed by the dual-channel multimodal interaction module and the single-channel multimodal interaction module are input into the pre-trained generation network to obtain the final editing result; the latent vector offset with supplementary texture detail information obtained from the single-channel multimodal interaction module is used. To optimize the final latent vectors; edit the image. The formula for the generation process is: , , , in This represents a pre-trained face image generation network. It is a parameter used to control the intensity of editing.