A text-to-image generation method and system based on latent space variational autoencoder decoder fine-tuning concept removal

By implementing concept removal in the VAE decoding path of the text-to-image generation model and only fine-tuning the decoder, the problems of concept representation entanglement, architecture dependency, and adversarial attacks in the prior art are solved, achieving efficient and cross-model compatible concept removal results.

CN122223162APending Publication Date: 2026-06-16PEKING UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
PEKING UNIV
Filing Date
2026-02-12
Publication Date
2026-06-16

AI Technical Summary

Technical Problem

Existing text-to-image generation models pose risks in generating deepfake content, privacy breaches, and copyright infringement. Furthermore, existing machine forgetting methods suffer from problems such as conceptual representation entanglement, strong architectural dependence, poor transferability, vulnerability to adversarial attacks, and high costs.

Method used

We employ a latent space variational autoencoder-decoder fine-tuning method. By constructing a concept-driven reconstruction dataset, freezing the encoder parameters, and fine-tuning the decoder only, we achieve irreversible removal of the target concept. We also implement concept removal in the VAE decoding path, supporting plug-and-play functionality across models.

🎯Benefits of technology

It achieves adversarial robustness independent of prompt words, cross-model transferability, and lightweight concept removal that does not rely on the original training data. It reduces the success rate of adversarial attacks, improves the accuracy of concept removal and generation quality, and supports the simultaneous removal of multiple concepts.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122223162A_ABST
    Figure CN122223162A_ABST
Patent Text Reader

Abstract

The application discloses a text-to-image generation method and system based on latent space variational autoencoder decoder fine-tuning concept removal, which first proposes to implement concept removal in the VAE decoding path to realize prompt word-independent robust forgetting. The method comprises the following steps: determining a target concept to be removed; constructing a concept-driven reconstruction dataset comprising a source image, a transformed image and a regular image of the target concept; freezing the parameters of a text encoder, a denoising network and an encoder in a variational autoencoder; fine-tuning and training a decoder in the variational autoencoder by using the concept-driven reconstruction dataset to obtain a concept-removed decoder; deploying the concept-removed decoder in a text-to-image diffusion model; and generating an image by using the text-to-image diffusion model based on an input natural language prompt word. The application has the advantages of prompt word-independent adversarial robustness, cross-model plug-and-play migration capability, lightweight training and deployment, and selective forgetting accuracy.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of generative artificial intelligence and data security technology, specifically to a text-to-image generation method and system based on concept removal fine-tuning of a latent space variational autoencoder decoder, falling within the technical scope of deep learning model security governance and copyright / privacy compliance. This invention is the first to propose a technical framework for implementing concept removal in the decoding path of a variational autoencoder (VAE), achieving robust forgetting independent of cue words and plug-and-play transfer across models. Background Technology

[0002] Text-to-image diffusion models can generate realistic images based on natural language prompts and are widely used in content creation, advertising design, and digital media production. However, as the model's capabilities and usability improve, it may also be used to generate deepfake content, impersonate specific individuals, generate copyrighted stylized works, or other inappropriate content, leading to problems such as privacy breaches, defamation, copyright infringement, and a decline in the credibility of online content.

[0003] To mitigate the aforementioned risks, existing technologies typically employ machine unlearning methods to "remove concepts" from diffusion models. These methods primarily intervene in text-conditional guidance paths, including fine-tuning the cross-attention layer or text encoder to suppress target concepts, or introducing guidance / suppression strategies during the inference phase to temporarily control the generated results. However, these approaches suffer from at least the following shortcomings: (1) Highly entangled conceptual representations: In the latent space of the diffusion model, the semantic representations of different concepts are highly intertwined. Targeted suppression of the target concept can easily lead to a significant decrease in the image quality and diversity of non-target content, resulting in the phenomenon of "catastrophic forgetting".

[0004] (2) Strong architecture dependency and poor portability: Existing methods are deeply dependent on the structure and parameter positions of specific denoising networks (such as UNet). When users fine-tune open-source diffusion models (such as Stable Diffusion) to meet personalized or specific domain needs, existing methods must reapply to each modified model instance or incur additional computational overhead, making it difficult to migrate and reuse between different downstream models or fine-tuned models in the same series.

[0005] (3) Vulnerable to reactivation by adversarial attacks: Existing methods mainly suppress "textual conditional activation" but do not achieve irreversible semantic removal at the underlying level of the generation chain. Studies have shown that adversarial prompts, synonym rewriting, token replacement, or CLIP-based concept inversion attacks can reliably recover erased concepts, including complex concepts such as celebrity identities, indicating that existing forgetting methods are only partial and reversible.

[0006] (4) High cost: Some methods require access to the original training data or large-scale retraining, which is not conducive to rapid compliance processing.

[0007] In typical latent diffusion models, the variational autoencoder (VAE) is responsible for the transformation between the pixel space and the latent space. The decoder, located at the end of the generation chain, maps the denoised latent variables back to the pixel image. Since VAEs are typically shared with multiple downstream denoising networks, and the decoding process is inherently independent of text conditions, implementing concept removal in the decoding path could potentially achieve both cue-word-independent adversarial robustness and cross-model transferability. However, current research and engineering solutions in this area remain lacking, and the VAE decoder, a "blind spot" in forgetting research, has not been fully explored. Summary of the Invention

[0008] To address the aforementioned problems, the present invention aims to provide a text-to-image generation method and system based on latent space variational autoencoder decoder fine-tuning concept removal.

[0009] The technical solution adopted in this invention is as follows: A text-to-image generation method based on latent space variational autoencoder decoder fine-tuning for concept removal includes the following steps: Identify the target concepts to be removed; Construct a concept-driven reconstruction dataset containing source images, transformed images, and regularized images that include the target concept; The parameters of the encoders in the text encoder, denoising network, and variational autoencoder are frozen. The decoder in the variational autoencoder is fine-tuned using the concept-driven reconstruction dataset to obtain the decoder after concept removal. The concept-removed decoder is deployed in the text-to-image diffusion model; Based on the input natural language prompts, an image is generated using a text-to-image diffusion model.

[0010] Furthermore, the target concept is at least one of the following: identity concept, object concept, and style concept.

[0011] Furthermore, the transformed image is generated using a structure-preserving concept editing method, wherein the source image and the transformed image are strictly aligned in spatial structure and differ only in the existence of the target concept; the regularized image is a regularized image that is semantically independent of the target concept.

[0012] Furthermore, when the target concept is an identity concept, the transformed image is an image in which the target person's face in the source image is replaced with an ordinary face using the Prompt-to-Prompt attention editing method, while keeping the background, pose, and spatial layout unchanged.

[0013] Furthermore, when the target concept is a style-related concept, the transformed image is an image that uses the MAST manifold alignment style transfer method to convert the target artistic style of the source image into a neutral realistic style while keeping the image content and geometric structure unchanged.

[0014] Furthermore, the sharpness-aware minimization optimization strategy is used to fine-tune the decoder. The sharpness-aware minimization optimization strategy constrains parameter updates by finding a flat minimum region on the loss surface. This includes latent encoding of the source image and the regularized image, decoding of latent variables, calculating a joint loss function containing concept removal loss and truth-preserving regularization loss, and updating the decoder parameters until the number of iterations or convergence conditions are met, thus obtaining the concept-removed decoder.

[0015] Furthermore, the step of deploying the concept-removed decoder in the text-to-image diffusion model includes: deploying the concept-removed decoder as a plug-and-play component to multiple downstream diffusion models based on the same variational autoencoder architecture, thereby achieving multiple model reuse after one training.

[0016] A text-to-image generation system based on latent space variational autoencoder decoder fine-tuning for concept removal, comprising: The concept determination module is used to determine the target concepts to be removed; The data construction module is used to build a concept-driven reconstruction dataset containing source images, transformed images, and regularized images of the target concept; The parameter freezing module is used to freeze the encoder parameters in the text encoder, denoising network, and variational autoencoder. The decoder fine-tuning module is used to fine-tune the decoder in the variational autoencoder using the concept-driven reconstruction dataset to obtain the decoder after concept removal. The deployment module is used to deploy the concept-removed decoder onto the text-to-image diffusion model; The image generation module generates images based on input natural language prompts using a text-to-image diffusion model.

[0017] The present invention also provides a computer device including a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program including instructions for performing the methods described above.

[0018] The present invention also provides a computer-readable storage medium storing a computer program that, when executed by a computer, implements the above-described method.

[0019] This invention proposes for the first time a technical framework for concept removal implemented in the VAE decoding path (EraseVAE). By only making lightweight fine-tuning to the decoder, it achieves irreversible removal of target concepts without accessing the original training data of the diffusion model or retraining the denoising network and text encoder, while maintaining the fidelity of the reconstructed non-target content and the quality of the generated content as much as possible. This invention is the first to transfer concept forgetting from traditional text conditional paths (such as UNet cross-attention layers and text encoders) to the VAE decoder. Since the VAE decoding process is completely independent of text embedding, this invention fundamentally achieves "cue word-independent" concept removal. This orthogonal design completely renders adversarial cue word attacks, synonym rewriting, token replacement, and other attacks targeting text conditional paths ineffective, eliminating the main attack surface of existing methods. Experiments show that this invention reduces the success rate of adversarial concept inversion attacks by more than 60%.

[0020] The specific beneficial effects of this invention are as follows: (1) Adversarial robustness independent of cue words: This invention performs concept removal in the VAE decoding path, and the decoding process is completely independent of text conditions. Therefore, adversarial cue words, synonym rewriting, token replacement, or CLIP-guided concept inversion attacks are difficult to reactivate the removed concepts. Experiments show that this invention reduces the success rate of adversarial attacks by more than 60%. (2) Cross-model plug-and-play transferability: VAEs are usually shared between the same series of diffusion models and their downstream fine-tuning models. This invention only fine-tunes the decoder, which can achieve "one-time removal and multiple model reuse", avoiding repeated training for each downstream model and fundamentally solving the problem of strong dependency of existing method architectures. (3) Not dependent on original training data: Controlled supervised learning of latent space is achieved by constructing a concept-driven reconstruction dataset, and compliance processing can be completed without accessing the original training data; (4) Selective forgetting and non-target fidelity: Through triplet data construction and regular loss design, the target concept is accurately and locally removed, while the generation quality and diversity of non-target concepts are basically unaffected. The accuracy of concept removal is improved by about 35%, and the FID is improved by about 30%. (5) Lightweight engineering cost: Only the decoder parameters are updated, and both training and deployment are lightweight. The inference speed is basically unaffected and is comparable to the original inference time of Stable Diffusion. (6) Simultaneous removal of multiple concepts: This invention supports the simultaneous removal of multiple types of target concepts in a single fine-tuning process, and the removal effect of each target concept is comparable to that of individual removal. Attached Figure Description

[0021] Figure 1 This is a flowchart of the text-to-image generation method based on the concept of latent space variational autoencoder decoder fine-tuning of the present invention. Figure 2 This is a block diagram of the text-to-image generation system based on the concept of latent space variational autoencoder decoder fine-tuning of the present invention. Detailed Implementation

[0022] The present invention will now be described in further detail with reference to the accompanying drawings. The examples given are only for explaining the present invention and are not intended to limit the scope of the present invention.

[0023] This invention discloses a text-to-image generation method based on latent space variational autoencoder decoder fine-tuning for concept removal. The method implements concept removal in the VAE decoding path to achieve cue-word-independent robust forgetting. The diffusion model includes at least: a text encoder, a denoising network, and a variational autoencoder (VAE); the VAE includes an encoder. With decoder The method is as follows: Figure 1 As shown, it includes the following steps: Step S11: Determine the target concept to be removed; Step S12: Construct a concept-driven reconstruction dataset containing source images, transformed images, and regularized images that include the target concept; Step S13: Freeze the parameters of the encoder in the text encoder, denoising network and variational autoencoder, and fine-tune the decoder in the variational autoencoder using the concept-driven reconstruction dataset to obtain the decoder after concept removal. Step S14: Deploy the concept-removed decoder onto the text-to-image diffusion model; Step S15: Generate an image based on the input natural language prompts using a text-to-image diffusion model.

[0024] In one embodiment, the target concept in step S11 can be an identity concept, an object concept, a style concept, or any combination thereof. This invention supports the simultaneous removal of multiple concepts.

[0025] In one embodiment, step S12 constructs a concept-driven reconstruction dataset for concept removal training. Each training sample is a triple. Concept-driven reconstruction of datasets The construction process includes the following steps: S121: Obtain the target concept Source image ; S122: Based on Transformed images are generated using a structure-preserving conceptual editing method. ,in and Strict alignment is achieved in spatial structure, with differences only in the existence of the target concept (maintaining consistency outside the target concept). For identity / object concepts, a Prompt-to-Prompt attention editing method is used to achieve local concept replacement; for style concepts, a MAST (Manifold alignment for semantically aligned style transfer) manifold alignment style transfer method is used to remove style attributes while maintaining geometric and semantic integrity. Specifically, the target concept When the concept is an identity class, the transformed image This refers to an image where the target human face in a source image is replaced with a normal human face using the Prompt-to-Prompt attention editing method, while maintaining the background, pose, and spatial layout; the target concept is... When the concept is style-related, the transformed image This method uses the MAST manifold alignment style transfer technique to convert the target artistic style of the source image into a neutral realistic style while preserving the image content and geometric structure.

[0026] S123: Obtain regularized images that are semantically independent of the target concept. It is used to constrain the reconstruction fidelity of non-target content and prevent catastrophic forgetting.

[0027] In one embodiment, step S13 freezes all parameters except the decoder, i.e., keeps the encoder frozen. The text encoder and denoising network parameters remain unchanged; only the decoder is changed. Set it to trainable to obtain the parameters to be optimized. and initialized to This design achieves complete decoupling from text conditional paths, forming the architectural foundation for the invention's prompt-agnostic robustness.

[0028] In one embodiment, step S13 employs a sharpness-aware minimization (SAM) optimization strategy to fine-tune the decoder to achieve concept removal. The training process includes the following steps: S131: Perform latent coding on the source image and the regularized image to obtain latent variables: Formula (1): , S132: Decode the latent variables using the current decoder to obtain the reconstructed image: Formula (2): S133: Calculating the concept of removing loss With guarantee, the loss is real. And construct the total loss : Formula (3): in, Representing pixel-level L1 distance, LPIPS (Learned Perceptual Image PatchSimilarity) is a deep feature-based perceptual similarity metric used to capture high-level semantic differences at the human visual perception level. , These are the weighting coefficients, all set to 0.5 by default.

[0029] S134: Based on Parameter updates are performed using a Sharpness-Aware Minimization (SAM) strategy. SAM finds a flat minimum region by solving the following minimax optimization problem: Formula (4): in To counteract the disturbance's neighborhood radius (referred to as the disturbance radius), the specific update steps include: (a) Calculate the gradient: ; (b) Calculate counter-perturbations: ; (c) Constructing perturbation parameters: ; (d) Calculate the gradient at the perturbation parameter and update the original parameter: ,in This is the learning rate.

[0030] The SAM strategy, by penalizing sharp minima, enables the optimizer to find solutions that remain stable to small parameter perturbations, thus improving the generalization ability and robustness of concept removal. Research shows that a flat loss surface is associated with better generalization performance. This selective regularization is particularly beneficial in settings where only the decoder is trained, effectively preventing overfitting that could lead to a decrease in visual fidelity for unseen images.

[0031] S135: Repeat steps SS131 to S134 until the iteration count or convergence condition is met, and the decoder after concept removal is obtained. .

[0032] In one embodiment, the method for deploying the concept-removed decoder in step S14 is as follows: during the inference phase, the original diffusion model's text encoder and denoising network remain unchanged, and only the decoder is... Replace with .

[0033] In one embodiment, step S15 inputs the natural language prompt words into a text encoder for text encoding, and then inputs them into a denoising network for denoising processing. When the denoising network outputs latent variables... When using The final image is generated, thereby suppressing the reconstruction and manifestation of the target concept at the end of pixel generation.

[0034] The method of the present invention has adversarial robustness independent of prompt words. When an adversary attempts to reactivate a removed concept through adversarial prompt words, synonym rewriting, token replacement, or CLIP-guided optimization, the removed concept cannot be reactivated because the concept removal occurs in a VAE decoding path independent of text conditions.

[0035] The method of this invention supports the simultaneous removal of multiple concepts. By including training samples with multiple different target concepts in the same triplet dataset, multiple types of target concepts such as identity, object and / or style can be removed simultaneously in a single fine-tuning process, and the removal effect of each target concept is comparable to that of removing them individually.

[0036] This invention enables plug-and-play deployment across models. Because VAEs are shared among Stable Diffusion model families, the... It can be used as a plug-and-play module to directly replace the original decoder in any diffusion model based on the same VAE architecture, including but not limited to Stable Diffusion v1.4, v2.0, v2.1 and various UNet fine-tuning variants, to achieve cross-model consistency concept removal without the need to train or adapt each downstream model separately.

[0037] This invention implements a concept removal method (EraseVAE) for text-to-image diffusion models based on latent space variational autoencoder decoder fine-tuning. The experiments used Python as the programming language and StableDiffusion v1.4 as the base model. All experiments were conducted on a single NVIDIA RTX 3090 GPU to ensure fair baseline comparisons and realistic training / inference cost estimates across widely available hardware.

[0038] Application Scenario 1: Celebrity Identity Removal. An AI image generation platform needs to prevent users from generating fake images of specific celebrities to avoid portrait rights infringement. After determining the target identity concept to be removed, the platform operator uses a Prompt-to-Prompt method to construct training data for the triples of that identity. The source image contains the facial features of the target person, and the transformed image replaces the face with a normal face while maintaining the background, pose, and spatial layout. After fine-tuning the VAE decoder with a SAM optimization strategy, the decoder with concept removal is obtained. After deployment, when a user inputs a prompt containing the celebrity's name, the generated image will not present the person's real facial features, thus achieving identity privacy protection. Even if an adversary attempts to use a CLIP-guided concept inversion attack, the success rate is reduced by more than 60% because concept removal occurs in the VAE decoding path independent of text conditions. This decoder can be directly reused as a plug-and-play module for multiple downstream models (such as different fine-tuned versions) based on the same VAE on the platform, without the need to train each model separately.

[0039] Application Scenario 2: Removal of Copyrighted Artistic Styles. A commercial image generation service needs to avoid generating works in a specific artist's style to circumvent copyright risks. The operational strategy involves constructing triplet data for this artistic style using the MAST manifold alignment method. The source image represents the target artistic style, while the transformed image converts the style to a neutral realist style while preserving the image content and geometric structure. After decoder fine-tuning, when a user requests the generation of an image in that artist's style, the output will no longer display the visual characteristics of that specific style, while the generation capability for other artistic styles remains unaffected. Experiments show that this invention reduces the Top-10 classification accuracy from 79.5% to 39.5% in the Van Gogh style removal task, while maintaining or even improving the generation accuracy for other artist styles such as Monet and Picasso compared to the original model.

[0040] Application Scenario 3: Simultaneous Removal of Multiple Concepts. In real-world scenarios, compliance requirements often necessitate the simultaneous removal of multiple target concepts. This invention supports the simultaneous removal of multiple concepts, including identity-related (e.g., specific celebrities), object-related (e.g., specific animals), and style-related (e.g., specific artist styles), during a single fine-tuning process. Experiments show that when simultaneously removing the concepts "Elon Musk" (identity), "Cat" (object), and "Claude Monet" (style), the removal effect of each target concept is comparable to that of removing them individually, and the detection metrics for non-target concepts are very close to the results of single-concept forgetting experiments, demonstrating the scalability of this invention in multi-concept scenarios.

[0041] Application Scenario 4: Cross-model compatibility deployment. The key advantage of this invention lies in its modular architecture. Since the VAE is shared among diffusion models, operations at the VAE level ensure compatibility with existing UNet forgetting methods and achieve seamless transfer across different UNets. In the experiment, the Monet style was first removed from the base Stable Diffusion VAE, and then this "Monet-unlearned" VAE was directly integrated into three customized UNets (ESD, FMN, SLD), which had forgotten Elon Musk's face through their respective UNet intervention methods. All UNet components remained fixed without any additional training. The results showed that after replacing with the Monet-unlearned VAE, the suppression effect of the Musk concept remained unchanged, while effectively inducing the forgetting of the Monet style. This demonstrates two key features of this invention: (1) compatibility - VAE forgetting and UNet forgetting can work seamlessly together; (2) transferability - a single modified VAE can propagate its forgetting effect to different UNet checkpoints.

[0042] Another embodiment of the present invention provides a text-to-image generation system based on latent space variational autoencoder decoder fine-tuning for concept removal, such as... Figure 2 As shown, the system 20 includes: Concept determination module 21 is used to determine the target concept to be removed; Data construction module 22 is used to construct a concept-driven reconstruction dataset containing source images, transformed images, and regularized images of the target concept; The parameter freezing module 23 is used to freeze the parameters of the encoder in the text encoder, denoising network and variational autoencoder; Decoder fine-tuning module 24 is used to fine-tune the decoder in the variational autoencoder using the concept-driven reconstruction dataset to obtain the decoder after concept removal. Deployment module 25 is used to deploy the concept-removed decoder onto the text-to-image diffusion model; Image generation module 26 is used to generate images based on input natural language prompts using a text-to-image diffusion model.

[0043] The above division of modules is merely illustrative. In practical applications, the functions described above can be assigned to different functional modules as needed to complete all or part of the functions described in the aforementioned method. The specific working process of each module can be found in the corresponding processes in the aforementioned method embodiments.

[0044] It should be understood that the methods and systems disclosed in the above embodiments of the present invention can be implemented in other ways. The various steps and modules in the present invention can be implemented as software functional units and stored in a computer-readable storage medium, including several instructions to cause a computer device to execute some or all of the steps of the method described in the present invention. For example, one embodiment of the present invention provides a computer device (computer, server, etc.) including a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program including instructions for performing the steps of the method of the present invention. For example, another embodiment of the present invention provides a computer-readable storage medium (such as ROM / RAM, disk, optical disk, etc.) storing a computer program, which, when executed by a computer, implements the steps of the method of the present invention. For example, another embodiment of the present invention provides a computer program product including a computer program, which, when executed by a computer, implements the steps of the method of the present invention.

[0045] Although specific embodiments of the invention have been disclosed for illustrative purposes to aid in understanding and implementing the invention, those skilled in the art will understand that various substitutions, variations, and modifications are possible without departing from the spirit and scope of the invention and the appended claims. Therefore, the invention should not be limited to the content disclosed in the preferred embodiments, and the scope of protection claimed by the invention is defined by the claims.

Claims

1. A text-to-image generation method based on latent space variational autoencoder decoder fine-tuning for concept removal, characterized in that, Includes the following steps: Identify the target concepts to be removed; Construct a concept-driven reconstruction dataset containing source images, transformed images, and regularized images that include the target concept; The parameters of the encoders in the text encoder, denoising network, and variational autoencoder are frozen. The decoder in the variational autoencoder is fine-tuned using the concept-driven reconstruction dataset to obtain the decoder after concept removal. The concept-removed decoder is deployed in the text-to-image diffusion model; Based on the input natural language prompts, an image is generated using a text-to-image diffusion model.

2. The method according to claim 1, characterized in that, The target concept is at least one of the following: identity concept, object concept, and style concept.

3. The method according to claim 1, characterized in that, The transformed image is generated using a structure-preserving concept editing method. The source image and the transformed image are strictly aligned in spatial structure and differ only in the existence of the target concept. The regularized image is a regularized image that is semantically independent of the target concept.

4. The method according to claim 3, characterized in that, When the target concept is an identity concept, the transformed image is an image in which the target person's face in the source image is replaced with an ordinary face using the Prompt-to-Prompt attention editing method, while keeping the background, pose and spatial layout unchanged.

5. The method according to claim 3, characterized in that, When the target concept is a style-related concept, the transformed image is an image that uses the MAST manifold alignment style transfer method to convert the target artistic style of the source image into a neutral realistic style while keeping the content and geometric structure of the image unchanged.

6. The method according to claim 1, characterized in that, The sharpness-aware minimization optimization strategy is used to fine-tune the decoder. The sharpness-aware minimization optimization strategy constrains parameter updates by finding a flat minimum region on the loss surface. This includes latent encoding of the source image and the regularized image, decoding of latent variables, calculating a joint loss function containing concept removal loss and truth-preserving regularization loss, and updating the decoder parameters until the number of iterations or the convergence condition is met, thus obtaining the decoder after concept removal.

7. The method according to claim 1, characterized in that, The step of deploying the concept-removed decoder in the text-to-image diffusion model includes: deploying the concept-removed decoder as a plug-and-play component to multiple downstream diffusion models based on the same variational autoencoder architecture, thereby achieving multiple model reuse after one training.

8. A text-to-image generation system based on latent space variational autoencoder decoder fine-tuning for concept removal, characterized in that, include: The concept determination module is used to determine the target concepts to be removed; The data construction module is used to build a concept-driven reconstruction dataset containing source images, transformed images, and regularized images of the target concept; The parameter freezing module is used to freeze the encoder parameters in the text encoder, denoising network, and variational autoencoder. The decoder fine-tuning module is used to fine-tune the decoder in the variational autoencoder using the concept-driven reconstruction dataset to obtain the decoder after concept removal. The deployment module is used to deploy the concept-removed decoder onto the text-to-image diffusion model; The image generation module generates images based on input natural language prompts using a text-to-image diffusion model.

9. A computer device, characterized in that, It includes a memory and a processor, the memory storing a computer program configured to be executed by the processor, the computer program including instructions for performing the method of any one of claims 1 to 7.

10. A computer-readable storage medium, characterized in that, The computer-readable storage medium stores a computer program, which, when executed by a computer, implements the method according to any one of claims 1 to 7.