A unified image generation and forgery detection method based on generation-understanding framework

By using a unified multimodal Transformer architecture and a symbiotic multimodal self-attention mechanism, feature-level knowledge sharing and deep interaction between generation and detection tasks are achieved, solving the problem of separation between generation and detection, improving the accuracy and robustness of forged image discrimination, simplifying the model structure, and realizing collaborative optimization of generation and detection.

CN122265801APending Publication Date: 2026-06-23TSINGHUA UNIVERSITY

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TSINGHUA UNIVERSITY
Filing Date
2026-04-02
Publication Date
2026-06-23

AI Technical Summary

Technical Problem

In existing technologies, the generation and detection models are separated and lack unified modeling, which makes the generation model prone to generating forgery traces and the detection model lacks generalization ability, making it difficult to meet the security governance needs of AI-generated content.

Method used

A unified multimodal Transformer architecture is constructed, integrating a shared text encoder, detection branch, and generation branch. Cross-task knowledge transfer is achieved through co-occurring multimodal self-attention, and a multi-task joint loss function is used for overall training. Combined with a frozen detector and feature alignment mechanism, collaborative optimization of generation and detection is achieved.

Benefits of technology

It improves the accuracy and robustness of forged image detection, enhances the interpretability of detection decisions, simplifies the model structure, improves training and inference efficiency, and achieves simultaneous improvement in generation and detection.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122265801A_ABST
    Figure CN122265801A_ABST
Patent Text Reader

Abstract

This invention discloses a unified image generation and forgery detection method based on a generative-understanding framework, relating to the fields of computer vision and artificial intelligence. The method includes: employing a unified multimodal Transformer as the backbone, integrating a shared encoder, detection branch, and generation branch, and achieving feature interaction through a shared layer; using the original image as input, introducing co-existing multimodal self-attention in the detection branch, fusing multi-source features and achieving cross-task knowledge transfer, and jointly training using multi-task loss; after training, freezing the detection branch as a supervision signal to align the generator's features, constraining its distribution to closely approximate real images; through two-stage optimization, the model can support detection interpretation and perceptual generation, outputting image forgery detection results and generated images, achieving collaborative optimization of generation and detection. This invention, through a unified architecture and bidirectional collaborative mechanism, achieves deep integration of generation and detection, solving the problems of their fragmentation, poor robustness, and difficulty in co-evolution.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of computer vision and artificial intelligence, and in particular to a unified image generation and forgery detection method and apparatus based on a generative-understanding framework. Background Technology

[0002] With the rapid development of generative artificial intelligence technology, large-scale image generation models, represented by diffusion models and autoregressive models, have been continuously iterated and upgraded. Synthetic images have seen significant improvements in detail reproduction, texture realism, and semantic consistency. Some high-quality synthetic images have reached or even surpassed the intuitive recognition threshold of human vision, and are widely used in content creation, media dissemination, and digital entertainment. Meanwhile, technologies for authenticating and tracing the origins of AI-generated images have also evolved. Detection methods have gradually developed from early traditional classification methods based on manually designed features to dedicated detection models relying on deep neural networks. In recent years, integrated technical solutions based on multimodal large models, balancing detection and judgment with result interpretation, have emerged, providing technical support for the security supervision and trusted authentication of AI-generated content.

[0003] Although image generation and image detection have developed in a mutually restrictive and mutually reinforcing manner, most existing technologies design, train, and deploy generation and detection tasks as independent systems, resulting in significant disconnects in model architecture, feature representation, training mechanisms, and optimization objectives. On one hand, generation and detection models employ different network structures and learning paradigms, making it difficult to achieve low-level feature sharing and cross-task knowledge transfer. On the other hand, the two types of models are often trained separately on independent datasets, lacking bidirectional feedback and dynamic adaptation mechanisms. Detection models struggle to keep up with the distribution changes of new generation models, exhibiting insufficient generalization and lagging defense capabilities. Meanwhile, generation models lack detection constraints during training, easily generating synthetic images with obvious forgery traces that are easily identifiable. Furthermore, existing fusion methods are mostly unidirectional and localized posterior optimizations, failing to form a unified, end-to-end, and co-evolving technical system, thus hindering the simultaneous improvement of generation and detection capabilities.

[0004] The current disconnect between generation and detection technologies not only limits further performance breakthroughs for both types of models but also fails to meet increasingly stringent requirements for AI-generated content security governance, hindering the standardized, healthy, and sustainable development of generative artificial intelligence technology. Therefore, there is an urgent need for a unified framework that enables unified modeling, bidirectional feedback, and collaborative optimization of generation and understanding. This framework would address the task fragmentation problem at the architectural level, promoting the symbiotic evolution and synergistic improvement of image generation and detection technologies. Summary of the Invention

[0005] The main objective of this invention is to provide a unified image generation and forgery detection method based on a generative-understanding framework.

[0006] Another objective of this invention is to propose a unified image generation and forgery detection device based on a generative-understanding framework.

[0007] The third objective of this invention is to provide an electronic device.

[0008] A fourth objective of this invention is to provide a non-transitory computer-readable storage medium.

[0009] To achieve the above objectives, a first aspect of the present invention proposes a unified image generation and forgery detection method based on a generative-understanding framework, comprising:

[0010] A unified multimodal Transformer architecture is constructed as the backbone network. The Transformer architecture integrates a shared text encoder, a model detection branch, and a generation branch. The detection branch and the generation branch achieve unified representation and interaction between the generation task and the detection task by sharing an intermediate Transformer layer. Using the original image as input, a co-existing multimodal self-attention model is embedded in the detection branch, which integrates detection features, text features and generated latent representations. Cross-task knowledge transfer is achieved through cross-modal multi-head attention, and a multi-task joint loss function is used to train the unified multimodal Transformer architecture as a whole. After the unified multimodal Transformer architecture training converges, the parameters of the detection branch are frozen to obtain the frozen detector and used as a supervision signal. The generator is trained with feature alignment to constrain the generator feature distribution to be consistent with the real image. The forgery sensitive features are weakened by feature similarity loss. After unified training and feature alignment optimization, the unified multimodal Transformer framework supports two working modes: detection interpretation and perceptual generation. It outputs corresponding image forgery detection results and generated images, realizing collaborative optimization between image generation and forgery detection.

[0011] Optionally, a co-existing multimodal self-attention model can be embedded in the detection branch, fusing detection features, text features, and generated latent representations to achieve cross-task knowledge transfer through cross-modal multi-head attention, including: A co-existing multimodal self-attention model is introduced and embedded layer by layer in the multi-layer Transformer structure of the detection branch; Multimodal feature fusion is performed on the detection features extracted by the detection branch, the text features output by the shared text encoder, and the generative latent representation output by the generation branch. Using detection features as attention query vectors and fused multimodal joint features as key and value vectors, cross-modal multi-head attention computation is performed, enabling detection features to be fused layer by layer to generate prior information and textual semantic information, thereby achieving feature-level knowledge transfer between generation and detection tasks.

[0012] Optionally, a multi-task joint loss function is used to train the unified multimodal Transformer architecture as a whole, including: Construct a multi-task joint loss function consisting of a weighted combination of detection and classification loss, explanation text generation loss, and generation stream matching loss; The detection classification loss is used to optimize the accuracy of true and false classification, the explanatory text generation loss is used to optimize the rationality of explanatory text, and the flow matching loss is used to optimize the image generation quality. By using a multi-task joint loss function, end-to-end joint training is performed on the shared structure, detection branch, and generation branch to achieve simultaneous optimization of detection and discrimination, interpretable analysis, and image generation capabilities.

[0013] Optionally, after the unified multimodal Transformer architecture training converges, parameter freezing is performed on the detection branch to obtain a frozen detector, which is then used as a supervision signal, including: After the first stage of overall training and convergence is completed in the unified multimodal Transformer architecture, all model parameters of the detection branch are fixed and frozen. The frozen detection branch is constructed into a frozen detector with stable discrimination capability; By using a frozen detector to perform forward inference on real images, high-level abstract features at the patch level in their deep Transformer structure are extracted to form a standard representation of realism in the detector feature space, and real image features are used as prior supervision signals.

[0014] Optionally, the generator can be trained with feature alignment to constrain the generator's feature distribution to be consistent with the real image, including: Extract intermediate features between a specified intermediate layer and a specified time step during the generator denoising diffusion sampling process; Generator features are mapped to a feature space that matches the frozen detector using a lightweight trainable projection network; Using feature cosine similarity loss as the optimization objective, the feature distribution of the generator is constrained to be consistent with the feature distribution of the real image in the detector.

[0015] Optionally, features sensitive to forgery can be weakened through feature similarity loss, including: The generator's own flow matching generation loss and the detector-guided feature alignment loss are weighted and fused together to form the overall optimization objective in the second stage. While maintaining the consistency of the generator's text semantics and the quality of image generation, the generator is guided to actively learn and avoid the fake feature subspace that the detector is sensitive to; Suppress artifacts, texture anomalies, and statistical biases in generated images to make them visually approximate real images in terms of appearance and underlying distribution.

[0016] Optionally, the unified multimodal Transformer framework supports two working modes: detection interpretation and perceptual generation. It outputs corresponding image forgery detection results and generated images, achieving collaborative optimization and bidirectional iterative improvement, including: In the detection interpretation mode, it receives the image to be judged and the detection instruction input, outputs the image authenticity classification result, and simultaneously generates interpretable text to support the judgment result; In the perceptual generation mode, it receives text description input and outputs a high-fidelity synthesized image that is highly consistent with the semantics of the text, visually realistic, and resistant to detection. By forming a closed-loop evolutionary system through two-stage collaborative optimization, image generation and forgery detection are mutually promoted and iterated, thereby achieving a synergistic improvement in image generation and forgery detection capabilities.

[0017] To achieve the above objectives, a second aspect of the present invention provides a unified image generation and forgery detection apparatus based on a generative-understanding framework, comprising: An architecture building module is used to build a unified multimodal Transformer architecture as the backbone network. The Transformer architecture integrates a shared text encoder, a model detection branch, and a generation branch. The detection branch and the generation branch achieve unified representation and interaction between generation and detection tasks by sharing an intermediate Transformer layer. The knowledge transfer module is used to embed a co-existing multimodal self-attention model in the detection branch, which integrates detection features, text features and generated latent representations. It achieves cross-task knowledge transfer through cross-modal multi-head attention and uses a multi-task joint loss function to train the unified multimodal Transformer architecture as a whole. The feature alignment module is used to freeze the parameters of the detection branch after the training of the unified multimodal Transformer architecture has converged, obtain the frozen detector and use it as a supervision signal to train the generator with feature alignment, constrain the generator feature distribution to be consistent with the real image, and weaken the forgery sensitive features through feature similarity loss. The collaborative iteration module is used to achieve collaborative optimization and bidirectional iterative improvement between image generation and forgery detection by the unified multimodal Transformer framework, which supports two working modes: detection interpretation and perceptual generation, after unified training and feature alignment optimization.

[0018] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0019] To achieve the above objectives, a third aspect of this application provides an electronic device, including a processor and a memory; wherein the processor runs a program corresponding to the executable program code stored in the memory, in order to implement a unified image generation and forgery detection method based on a generative-understanding framework as described in the first aspect embodiment.

[0020] To achieve the above objectives, a fourth aspect of this application proposes a non-transitory computer-readable storage medium storing a computer program that, when executed by a processor, implements a unified image generation and forgery detection method based on a generative-understanding framework as described in the first aspect embodiment.

[0021] The embodiments of the present invention have the following beneficial effects: 1. By adopting a unified multimodal Transformer architecture and a co-existing multimodal self-attention mechanism, feature-level knowledge sharing and deep interaction between generation and detection tasks are achieved, which significantly improves the detection model's accuracy, generalization ability and robustness in distinguishing forged images, while enhancing the interpretability of detection decisions.

[0022] 2. Through two-stage unified fine-tuning and multi-task joint optimization, image generation, authenticity discrimination and interpretable text output are completed simultaneously in a single model, simplifying the model structure, improving training and inference efficiency, and achieving multi-objective synergistic improvement.

[0023] 3. By relying on the detector-guided generation alignment mechanism, the authenticity prior is injected into the generation process, which constrains the generated image to approximate the real image in terms of feature distribution, effectively weakening forgery artifacts and statistical anomalies, and greatly improving the realism and anti-detection ability of the generated image.

[0024] 4. Construct a closed-loop collaborative system for bidirectional iteration of generation and detection, breaking the traditional one-way opposition relationship, and achieving synchronous enhancement and collaborative evolution of the two, providing end-to-end and scalable technical support for addressing the security risks of AI-generated content. Attached Figure Description

[0025] The above and / or additional aspects and advantages of the present invention will become apparent and readily understood from the following description of the embodiments taken in conjunction with the accompanying drawings, wherein: Figure 1 A flowchart illustrating a unified image generation and forgery detection method based on a generative-understanding framework, provided in an embodiment of the present invention; Figure 2This is a diagram of the co-occurrence multimodal detection generation architecture provided in an embodiment of the present invention; Figure 3 This is a flowchart of the detection generation layer feature matching process provided in an embodiment of the present invention; Figure 4 This is a structural diagram of a unified image generation and forgery detection device based on a generative-understanding framework, provided in an embodiment of the present invention. Detailed Implementation

[0026] It should be noted that, unless otherwise specified, the embodiments and features described in the present invention can be combined with each other. The present invention will now be described in detail with reference to the accompanying drawings and embodiments.

[0027] To enable those skilled in the art to better understand the present invention, the technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings of the embodiments of the present invention. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those skilled in the art without creative effort should fall within the scope of protection of the present invention.

[0028] The following description, with reference to the accompanying drawings, describes a unified image generation and forgery detection method and apparatus based on a generative-understanding framework according to embodiments of the present invention.

[0029] Example 1 This invention provides a unified image generation and forgery detection method based on a generative-understanding framework. Figure 1 This is a flowchart illustrating a unified image generation and forgery detection method based on a generative-understanding framework provided in an embodiment of the present invention. Figure 1 As shown, the method includes the following steps: Step S1: Construct a unified multimodal Transformer architecture as the backbone network. The Transformer architecture integrates a shared text encoder, a model detection branch, and a generation branch. The detection branch and the generation branch achieve unified representation and interaction between the generation task and the detection task by sharing an intermediate Transformer layer.

[0030] In this embodiment, the architecture is based on a unified multimodal Transformer that supports multi-task learning, integrating three core modules: a shared text encoder, a detection branch (also known as an understanding branch), and a generation branch. The detection branch is equipped with a dedicated image encoder (e.g., SigLIP) to extract highly discriminative visual features from the input image, providing the underlying visual basis for subsequent authenticity detection and interpretation analysis. The generation branch is equipped with an image encoder-decoder pair (e.g., FLUX VAE) to synthesize and generate visual images from text semantics, ensuring high-quality image output. The detection and generation branches are not independent but interact deeply through a shared intermediate Transformer layer, enabling information transfer and knowledge fusion between the generation and detection tasks within a unified representation space. This provides underlying architectural support for cross-task collaborative optimization. Figure 2 (a) and Figure 2 As shown in (b) of the diagram.

[0031] in, Figure 2 (a) in the diagram fully illustrates the generative-assisted forgery detection and interpretation process: In the detection scenario, the input image to be examined is fed in parallel into the understanding encoder and the generative encoder to obtain image terms corresponding to the detection requirements and generative image terms corresponding to the generative modeling, respectively. Simultaneously, the user-inputted text command (e.g., "Is this image real or fake?") is encoded into text terms by a shared text encoder. These three types of terms are then input into a co-occurring multimodal self-attention layer. After cross-modal feature fusion, they are further refined and transformed via a feedforward network. Finally, the detection branch, through the text header and classifier module, outputs a binary true / false classification result and generates a natural language explanation highly relevant to the judgment criteria, achieving interpretable output of the detection results.

[0032] Figure 2 (b) in the diagram fully illustrates the conditional image generation process: In the generation scenario, the user-input text description (e.g., "winter desktop wallpaper") is encoded into text terms by a shared text encoder, while the real reference image is encoded into noisy image terms by a generative encoder. Both text terms and noisy image terms are input into a co-occurring multimodal self-attention layer and a feedforward network. After fusing text semantics and visual priors, the generation branch outputs the target latent space velocity prediction result. This prediction result is then aligned and optimized with the target latent space velocity corresponding to the real image using a mean squared error loss function, thereby ensuring a dual improvement in semantic consistency and visual realism of the generated image. The two tasks share the same text encoder and core Transformer structure, reducing model parameter redundancy and laying a core foundation for subsequent cross-task knowledge transfer and co-evolution.

[0033] Step S2: Using the original image as input, a co-occurring multimodal self-attention model is embedded in the detection branch, which integrates detection features, text features, and generated latent representations. Cross-task knowledge transfer is achieved through cross-modal multi-head attention, and the unified multimodal Transformer architecture is trained as a whole using a multi-task joint loss function.

[0034] In this embodiment, the co-existing multimodal self-attention model is the core component for realizing feature-level knowledge transfer between generation and detection tasks. It is integrated into each Transformer block of the detection branch. Its core purpose is to break down the feature barriers between generation and detection tasks within a unified framework, achieving efficient transfer and deep fusion of cross-modal knowledge. Specifically, the model first integrates three types of key features, namely the detection features of the current layer. Text features and the latent representation from the corresponding layer of the generative encoder The features are concatenated along the feature dimension to obtain the concatenated multimodal features. This concatenation operation enables the initial fusion of generation distribution information, text context information, and detection features, laying the foundation for subsequent cross-modal attention interaction.

[0035] In this embodiment, after feature concatenation, the model performs further feature interaction and information refinement through a cross-modal multi-head attention mechanism. Specifically, the detected features are used as the query vector Q, and the concatenated multi-modal features are used as both the key vector K and the value vector V. Attention is calculated using a multi-head cross-attention calculation formula, which is as follows:

[0036] In this context, the query vector Q, key vector K, and value vector V are all obtained through their respective learnable projection matrices, ensuring matching of feature dimensions and effective extraction of feature information. This is a normalization factor used to avoid problems such as excessively large attention values ​​and vanishing gradients caused by excessively high feature dimensions. Through this cross-modal attention calculation process, the image distribution features modeled by the generative model and the semantic information of text commands can be efficiently and accurately integrated into the detection features, such as... Figure 2 As shown in (a), the co-occurring multimodal self-attention layer of the detection branch can perceive the image distribution features modeled by the generative model layer by layer, so that the detector no longer relies solely on the apparent visual features of the image to distinguish between true and false, but can gain a deeper understanding of the underlying distribution logic of the generated image, thereby significantly improving detection accuracy, generalization ability and robustness to new generation methods. At the same time, it provides an explanatory basis for detection decisions that can be traced back to the generation logic, enhancing the interpretability of the model.

[0037] Meanwhile, this embodiment employs a multi-task joint loss function to perform overall training on the unified multimodal Transformer architecture, breaking the limitations of independent training for traditional generation and detection tasks, and achieving collaborative optimization and synchronous improvement of the three major tasks of generation, detection, and interpretation. This multi-task joint loss function is a weighted sum of the detection loss, interpretation loss, and stream matching generation loss, and its specific expression is the total loss. hyperparameters , , The training weights used to balance the three tasks are usually initialized to 1 in experiments to treat each training objective equally and ensure that the three tasks can be promoted synchronously and improved synergistically.

[0038] Among them, the detection loss is the binary cross-entropy loss ( ), mainly used to optimize the accuracy of genuine and counterfeit classification, corresponding to Figure 2 In (a) of the diagram, the optimization of the real / fake binary labels output by the classifier, through the constraints of this loss function, enables the model to accurately learn the feature differences between real and fake images, reducing the misclassification rate and improving the reliability of classification judgment; the explanation loss is an autoregressive language modeling loss (such as negative log-likelihood loss). This is used to train models to generate natural language explanations that are highly relevant to judgment criteria, logically clear, and accurately expressed. Figure 2 In section (a), the optimization of explanatory text generation ensures that the explanatory text accurately elucidates the visual basis for the model's true / false judgment (such as generation artifacts, inconsistent lighting, etc.), enhancing the interpretability and credibility of the detection results; flow matching loss ( This is used to train the generative branch. In time step t of the denoising diffusion process, the model needs to predict a latent variable from the noise. Flow field pointing to clean data The core objective of this loss is to minimize the mean square error between the predicted flow field and the actual flow field. Figure 2 In (b) of the algorithm, the mean squared error loss between the target latent space velocity and the predicted latent space velocity is optimized. This loss constraint ensures the generation quality of the generated branch, guaranteeing that the generated image maintains a high degree of visual consistency with the real image. Through the above multi-task joint optimization strategy, the embodiments of this application achieve the synchronous advancement and synergistic improvement of generation, detection, and interpretation tasks, laying a solid foundation for the second stage of subsequent two-stage training.

[0039] Step S3: After the unified multimodal Transformer architecture training converges, the parameters of the detection branch are frozen to obtain the frozen detector and used as a supervision signal. The generator is trained with feature alignment to constrain the generator feature distribution to be consistent with the real image. The forgery sensitive features are weakened by feature similarity loss.

[0040] After the unified multimodal Transformer architecture completes the first stage of joint fine-tuning and training convergence, this embodiment of the application moves to a targeted generator optimization stage. Specifically, the parameters of the trained and stable detection branch are first frozen, ensuring their stability during subsequent training. This results in a frozen detector with fixed performance and high discriminative power. This frozen detector no longer participates in parameter updates but is instead assigned the role of a "realistic teacher," serving as a strong supervisory signal and core guidance for refined feature alignment training of the generation branch, i.e., the generator. The aim is to fundamentally improve the anti-detection capability and realism of the generated images.

[0041] In the specific implementation of feature alignment training, this scheme constructs a dual-path feature acquisition and comparison mechanism. On the one hand, a batch of real image samples... Input to the aforementioned frozen detector Through the forward propagation process, patch-level deep features are extracted from the Transformer block of its final layer. These patch-level features It embodies the detector's deep abstraction and cognition of the inherent distribution characteristics of real images. It is a high-dimensional feature vector that characterizes the attributes of real images and is used as the benchmark feature of the distribution of real images in the embodiments of this application.

[0042] On the other hand, in the generator During the denoising sampling and iterative generation process, the system will actively select intermediate features from a certain intermediate layer at a specific time step t. The target features are used for alignment. Since the generator and detector have inherent differences in network structure, feature dimensions, and semantic representation space, direct feature matching will face problems of dimensionality mismatch and semantic misalignment. Therefore, this application introduces a lightweight and trainable projection network. This projection network serves as a bridge for feature mapping. Receive generator intermediate features As input, and through its learnable mapping parameters, the generator features are... Projection transformation to detector features The target space, whose spatial dimension is matched with semantic features, yields the projected generator features. .

[0043] Through the above processing, this application defines a detector-guided generator alignment loss (LDIGA) to force alignment between generator features and real image features in terms of statistical distribution. Specifically, this alignment loss uses cosine similarity as a measure of feature similarity, and its core calculation formula is as follows: The training objective is to minimize this loss function, that is, to maximize the projected generator features. Features of real detectors The cosine similarity between them. This optimization design forces the generator to actively learn and approximate the distribution pattern of real images in the detector's feature space during the generation process, thereby weakening its modeling and learning of those forged sensitive features that are easily identified by the detector.

[0044] like Figure 3 As shown in the figure, the feature connection and loss calculation process of this application embodiment clearly demonstrates the entire optimization chain: the input image is also fed into the understanding encoder and the generation encoder respectively. The feature map output by the understanding encoder is passed through multiple detection layers for feature extraction, while the feature map output by the generation encoder is transformed and iterated through multiple generation layers. A multi-node feature connection channel is established between the detection layer and the generation layer to realize real-time interaction and monitoring of features at different levels. During the training process, the system calculates and returns two core losses based on these feature connections: one is the alignment loss used to constrain the consistency of feature distribution between the detection branch and the generation branch, and the other is the flow matching loss used to maintain the basic generation capability and data distribution prior of the generation branch. Through this collaborative optimization of dual losses, the real image discrimination knowledge contained in the detector is actively injected and transferred to the generator, guiding the generator to actively avoid the feature subspace that the detector is sensitive to and that marks "forgery" in its latent space. Ultimately, the generator can generate high-fidelity synthetic images that are highly close to real images in terms of visual quality and statistical distribution and are difficult to be detected by traditional detection methods.

[0045] Step S4: After unified training and feature alignment optimization, the unified multimodal Transformer framework supports two working modes: detection interpretation and perceptual generation. It outputs the corresponding image forgery detection results and generated images, realizing the collaborative optimization between image generation and forgery detection.

[0046] After the first stage of unified fine-tuning of generation and detection and the second stage of feature alignment guided by detector, the unified multimodal Transformer framework described in this application has completed comprehensive training and optimization. It has stable and efficient task processing capabilities and can simultaneously support the two core working modes of detection interpretation and perceptual generation. It fully demonstrates the flexibility and practicality of the unified architecture of this application and realizes deep collaboration and bidirectional empowerment of the two major tasks of image generation and forgery detection.

[0047] In the detection interpretation mode, such as Figure 2As shown in (a), this mode is mainly used to solve the problems of authenticity identification and interpretability analysis of generated images, adapting to the needs of credibility and transparency of detection results in real-world scenarios. In specific implementation, the user only needs to input the target image to be checked, as well as the corresponding detection text instruction (e.g., "Is this image real or fake?" "Point out the forgery traces in this image", etc.), and the model will start the detection process. The image to be checked is input in parallel to the understanding encoder and the generation encoder, while the text instruction is converted into semantic features by the shared text encoder. The three types of features (detection features, generation latent representation, and text features) are input together to the co-occurring multimodal self-attention layer to complete the deep fusion of cross-modal information, making full use of the image distribution knowledge modeled by the generation model and the semantic information of the text instruction to assist in detection decision-making. After fusion, the model outputs a binary true / false classification result (real or fake) through the classifier of the detection branch. At the same time, it generates a corresponding natural language explanation text through the text header. This explanation text will explain in detail the core basis for the judgment, including but not limited to visual features such as generation artifacts, inconsistent lighting intensity, unbalanced object proportions, and semantic logical contradictions in the image. This makes the detection decision traceable and verifiable, greatly improving the credibility and interpretability of the detection results and solving the pain point of traditional detection models that "only judge the result and do not provide explanation".

[0048] In the perception generation mode, such as Figure 2 As shown in (b), this mode is mainly used to complete high-quality, high-realism conditional image generation tasks, adapting to various text-driven image synthesis needs. In specific implementation, the user inputs a clear text description (e.g., "winter desktop wallpaper, snowy forest, warm-toned lighting, high-definition texture"). This text description is encoded into structured text semantic features by a shared text encoder. Simultaneously, the model combines the image generation prior knowledge of the generator encoder, inputting the text semantic features and the latent representation into a co-occurring multimodal self-attention layer, achieving deep fusion of text semantics and image generation logic, ensuring a high degree of semantic consistency between the generated image and the text description. Because the generator undergoes detector-guided feature alignment optimization in the second stage, its generation process not only follows the natural distribution of image data but also actively avoids the forgery feature subspace that the detector is sensitive to. Therefore, the generated image not only has high visual fidelity but also is statistically closer to the real image, making it more difficult for specialized forgery detection tools to identify compared to traditional generation models.

[0049] This application's embodiments, through the collaborative operation of the two working modes described above, truly achieve collaborative optimization and bidirectional iterative improvement between image generation and forgery detection: the detection model provides authenticity supervision for the generation model, guiding the generation model to generate more realistic and more resistant images; the generation model injects underlying distribution knowledge into the detection model, improving the accuracy and robustness of the detection model, ultimately forming a self-iteratory and self-optimizing generation-detection closed-loop system. This provides effective technical support for building a safe, reliable, and interpretable generative AI ecosystem, and fully demonstrates the core advantages and application value of this application's unified generation-understanding framework.

[0050] Example 2 This invention provides a unified image generation and forgery detection device based on a generative-understanding framework. Figure 4 This is a schematic flowchart of a unified image generation and forgery detection device based on a generative-understanding framework, provided in an embodiment of the present invention. Figure 4 As shown, the device includes: Architecture building module 100 is used to build a unified multimodal Transformer architecture as the backbone network. The Transformer architecture integrates a shared text encoder, a model detection branch, and a generation branch. The detection branch and the generation branch achieve unified representation and interaction between the generation task and the detection task by sharing an intermediate Transformer layer. The knowledge transfer module 200 is used to embed a co-existing multimodal self-attention model in the detection branch with the original image as input, fuse detection features, text features and generated latent representations, achieve cross-task knowledge transfer through cross-modal multi-head attention, and use a multi-task joint loss function to train the unified multimodal Transformer architecture as a whole. The feature alignment module 300 is used to freeze the parameters of the detection branch after the training of the unified multimodal Transformer architecture converges, obtain the frozen detector and use it as a supervision signal to train the generator for feature alignment, constrain the generator feature distribution to be consistent with the real image, and weaken the forgery sensitive features through feature similarity loss. The collaborative iteration module 400 is used to achieve collaborative optimization between image generation and forgery detection by the unified multimodal Transformer framework after two-stage optimization of unified training and feature alignment. This module supports two working modes: detection interpretation and perceptual generation. It outputs the corresponding image forgery detection results and generated images.

[0051] Regarding the apparatus in the above embodiments, the specific manner in which each module performs its operation has been described in detail in the embodiments related to the method, and will not be elaborated upon here.

[0052] Example 3 To implement the methods of the above embodiments, the present invention also provides an electronic device, which includes a memory and a processor; wherein the processor reads executable program code stored in the memory to run a program corresponding to the executable program code, so as to implement the various steps of the methods described above.

[0053] Example 4 To implement the above embodiments, this application also proposes a non-transitory computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the method described in the foregoing embodiments.

[0054] The above description is merely a preferred embodiment of the present invention and is not intended to limit the invention. Various modifications and variations can be made to the present invention by those skilled in the art. Any modifications, equivalent substitutions, improvements, etc., made within the spirit and principles of the present invention should be included within the scope of protection of the present invention.

[0055] In the description of this specification, the references to terms such as "one embodiment," "some embodiments," "example," "specific example," or "some examples," etc., indicate that a specific feature, structure, material, or characteristic described in connection with that embodiment or example is included in at least one embodiment or example of the present invention. In this specification, the illustrative expressions of the above terms do not necessarily refer to the same embodiment or example. Furthermore, the specific features, structures, materials, or characteristics described may be combined in any suitable manner in one or more embodiments or examples. Moreover, without contradiction, those skilled in the art can combine and integrate the different embodiments or examples described in this specification, as well as the features of different embodiments or examples.

[0056] Furthermore, the terms "first" and "second" are used for descriptive purposes only and should not be construed as indicating or implying relative importance or implicitly specifying the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one of that feature. In the description of this invention, "a plurality of" means at least two, such as two, three, etc., unless otherwise explicitly specified.

Claims

1. A unified image generation and forgery detection method based on a generative-understanding framework, characterized in that, include: A unified multimodal Transformer architecture is constructed as the backbone network. The Transformer architecture integrates a shared text encoder, a model detection branch, and a generation branch. The detection branch and the generation branch achieve unified representation and interaction between the generation task and the detection task by sharing an intermediate Transformer layer. Using the original image as input, a co-existing multimodal self-attention model is embedded in the detection branch, which integrates detection features, text features and generated latent representations. Cross-task knowledge transfer is achieved through cross-modal multi-head attention, and a multi-task joint loss function is used to train the unified multimodal Transformer architecture as a whole. After the unified multimodal Transformer architecture training converges, the parameters of the detection branch are frozen to obtain the frozen detector and used as a supervision signal. The generator is trained with feature alignment to constrain the generator feature distribution to be consistent with the real image. The forgery sensitive features are weakened by feature similarity loss. After unified training and feature alignment optimization, the unified multimodal Transformer framework supports two working modes: detection interpretation and perceptual generation. It outputs corresponding image forgery detection results and generated images, realizing collaborative optimization between image generation and forgery detection.

2. The method according to claim 1, characterized in that, A co-existing multimodal self-attention model is embedded in the detection branch, fusing detection features, text features, and generated latent representations. Cross-task knowledge transfer is achieved through cross-modal multi-head attention, including: A co-existing multimodal self-attention model is introduced and embedded layer by layer in the multi-layer Transformer structure of the detection branch; Multimodal feature fusion is performed on the detection features extracted by the detection branch, the text features output by the shared text encoder, and the generative latent representation output by the generation branch. Using detection features as attention query vectors and fused multimodal joint features as key and value vectors, cross-modal multi-head attention computation is performed, enabling detection features to be fused layer by layer to generate prior information and textual semantic information, thereby achieving feature-level knowledge transfer between generation and detection tasks.

3. The method according to claim 2, characterized in that, The unified multimodal Transformer architecture is trained using a multi-task joint loss function, including: Construct a multi-task joint loss function consisting of a weighted combination of detection and classification loss, explanation text generation loss, and generation stream matching loss; The detection classification loss is used to optimize the accuracy of true and false classification, the explanatory text generation loss is used to optimize the rationality of explanatory text, and the flow matching loss is used to optimize the image generation quality. By using a multi-task joint loss function, end-to-end joint training is performed on the shared structure, detection branch, and generation branch to achieve simultaneous optimization of detection and discrimination, interpretable analysis, and image generation capabilities.

4. The method according to claim 3, characterized in that, After the unified multimodal Transformer architecture training converges, the parameters of the detection branch are frozen to obtain the frozen detector, which is used as a supervision signal, including: After the first stage of overall training and convergence is completed in the unified multimodal Transformer architecture, all model parameters of the detection branch are fixed and frozen. The frozen detection branch is constructed into a frozen detector with stable discrimination capability; By using a frozen detector to perform forward inference on real images, high-level abstract features at the patch level in their deep Transformer structure are extracted to form a standard representation of realism in the detector feature space, and real image features are used as prior supervision signals.

5. The method according to claim 4, characterized in that, The generator is trained with feature alignment to ensure that its feature distribution matches that of the real image, including: Extract intermediate features between a specified intermediate layer and a specified time step during the generator denoising diffusion sampling process; Generator features are mapped to a feature space that matches the frozen detector using a lightweight trainable projection network; Using feature cosine similarity loss as the optimization objective, the feature distribution of the generator is constrained to be consistent with the feature distribution of the real image in the detector.

6. The method according to claim 5, characterized in that, Weakening sensitive features forgery through feature similarity loss includes: The generator's own flow matching generation loss and the detector-guided feature alignment loss are weighted and fused together to form the overall optimization objective in the second stage. While maintaining the consistency of the generator's text semantics and the quality of image generation, the generator is guided to actively learn and avoid the fake feature subspace that the detector is sensitive to; Suppress artifacts, texture anomalies, and statistical biases in generated images to make them visually approximate real images in terms of appearance and underlying distribution.

7. The method according to claim 6, characterized in that, The unified multimodal Transformer framework supports two working modes: detection interpretation and perceptual generation. It outputs corresponding image forgery detection results and generated images, achieving collaborative optimization and bidirectional iterative improvement, including: In the detection interpretation mode, it receives the image to be judged and the detection instruction input, outputs the image authenticity classification result, and simultaneously generates interpretable text to support the judgment result; In the perceptual generation mode, it receives text description input and outputs a high-fidelity synthesized image that is highly consistent with the semantics of the text, visually realistic, and resistant to detection. By forming a closed-loop evolutionary system through two-stage collaborative optimization, image generation and forgery detection are mutually promoted and iterated, thereby achieving a synergistic improvement in image generation and forgery detection capabilities.

8. A unified image generation and forgery detection device based on a generative-understanding framework, characterized in that, include: An architecture building module is used to build a unified multimodal Transformer architecture as the backbone network. The Transformer architecture integrates a shared text encoder, a model detection branch, and a generation branch. The detection branch and the generation branch achieve unified representation and interaction between generation and detection tasks by sharing an intermediate Transformer layer. The knowledge transfer module is used to embed a co-existing multimodal self-attention model in the detection branch with the original image as input, fuse detection features, text features and generated latent representations, achieve cross-task knowledge transfer through cross-modal multi-head attention, and use a multi-task joint loss function to train the unified multimodal Transformer architecture as a whole. The feature alignment module is used to freeze the parameters of the detection branch after the training of the unified multimodal Transformer architecture has converged, obtain the frozen detector and use it as a supervision signal to train the generator with feature alignment, constrain the generator feature distribution to be consistent with the real image, and weaken the forgery sensitive features through feature similarity loss. The collaborative iteration module is used to optimize the unified multimodal Transformer framework after two stages of unified training and feature alignment. It supports two working modes: detection interpretation and perceptual generation. It outputs the corresponding image forgery detection results and generated images, realizing collaborative optimization between image generation and forgery detection.

9. An electronic device, characterized in that, Including processor and memory; The processor reads executable program code stored in the memory to run a program corresponding to the executable program code, so as to implement the method as described in any one of claims 1-7.

10. A non-transitory computer-readable storage medium having a computer program stored thereon, characterized in that, When the program is executed by the processor, it implements the method as described in any one of claims 1-7.