Image detection model training method, image detection method and device

By training the image encoder and decoder, and using AI-generated image datasets for pre-training and image tampering datasets for fine-tuning, the high false alarm rate and low accuracy of AI-generated image tampering detection in existing technologies are solved, thereby improving the generalization and accuracy of the detection system.

CN122244652APending Publication Date: 2026-06-19TENCENT TECHNOLOGY (SHENZHEN) CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
TENCENT TECHNOLOGY (SHENZHEN) CO LTD
Filing Date
2024-12-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing image tampering detection technologies are ineffective in identifying and locating AI-generated image tampering, exhibiting problems such as high false alarm rates, low recognition accuracy, and insufficient generalization ability.

Method used

By acquiring the visual and text representations of the first training sample, the parameters of the image encoder and text encoder are updated. The image encoder is pre-trained using a dataset of AI-generated images, and the tamper detection capability of the decoder is improved by fine-tuning the dock through an image tampering dataset.

Benefits of technology

It improves the few-sample learning ability and generalization of the image detection system, and enhances the detection accuracy of AI-generated image tampering.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244652A_ABST
    Figure CN122244652A_ABST
Patent Text Reader

Abstract

This application provides an image detection model training method, an image detection method, and an apparatus, relating to the field of artificial intelligence technology. The model training method includes: acquiring a first training sample, including a first image sample and a first label; inputting the first image sample into an image encoder to obtain a first visual representation of the first image sample, and inputting it into a text encoder to obtain a text representation of the first label; updating the parameters of the image encoder and the text encoder based on the first visual representation and the text representation; acquiring a second training sample, including a second image sample and a second label; inputting the second image sample into an image encoder and a decoder to obtain a detection result of the second image sample, and updating the parameters of the decoder based on the detection result and the second label; and outputting the trained image encoder and decoder as an image detection model. This application can improve the few-shot learning ability of image detection systems and improve the generalization and accuracy of image detection.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of artificial intelligence technology, and in particular to an image detection model training method, an image detection method, and an apparatus. Background Technology

[0002] Image tampering detection and localization is an important research area, especially in content security and forensics. Artificial intelligence (AI)-based image tampering is characterized by low tampering costs, difficulty in identification and evidence collection, and rapid online dissemination, posing a serious potential threat to personal privacy, public opinion guidance, cognitive competition, and national security.

[0003] Image tampering detection and localization involves detecting whether an image has been tampered with and segmenting the tampered region. In this field, mainstream methods employ traditional image processing algorithms and multimodal feature fusion techniques. These methods typically use manually designed traditional features for filtering or camera fingerprints for matching. Many studies also focus on deep learning approaches. Deep learning methods automatically learn tampering features from large amounts of image data. They usually use color or frequency domain features of the image, inputting them into a feature encoding network. The encoded features are then decoded to obtain a tampering probability map, which is then statistically measured and mapped to binary classification probabilities to obtain the final tampering classification probability. In mainstream deep learning methods, the training dataset is usually composed of manually manipulated photos (e.g., copy-move, splitting, etc.) tampering data.

[0004] The development of Artificial Intelligence Generated Content (AIGC) has made image tampering exceptionally easy, posing challenges to image tampering detection and localization. How to detect tampering in AIGC-based images is a pressing issue. Summary of the Invention

[0005] This application provides an image detection model training method, an image detection method, and an apparatus, which can improve the few-shot learning ability of the image detection system, as well as improve the generalization and accuracy of image detection.

[0006] In a first aspect, embodiments of this application provide an image detection model training method, including:

[0007] Obtain a first training sample, the first training sample including at least one first image sample and at least one first label corresponding to the at least one first image sample, the first label being used to indicate whether the first image sample is an image generated based on artificial intelligence (AI);

[0008] The first image sample is input into the image encoder to obtain the first visual representation of the first image sample, and the first label is input into the text encoder to obtain the text representation of the first label; and the parameters of the image encoder and the text encoder are updated according to the first visual representation and the text representation to obtain the trained image encoder.

[0009] Obtain a second training sample, which includes a second image sample and a second label for the second image sample. The second label is used to indicate whether the second image sample is real or a tampered area mask of the second image sample.

[0010] The second image sample is input into the trained image encoder and decoder to obtain the detection result of the second image sample, and the parameters of the decoder are updated according to the detection result and the second label to obtain the trained decoder.

[0011] The trained image encoder and the trained decoder output the image detection model.

[0012] Secondly, embodiments of this application provide an image detection method, including:

[0013] Acquire the image to be detected;

[0014] The image to be detected is input into an image detection model, and the image encoder in the image detection model is used to encode the features of the image to be detected to obtain the third visual representation of the image to be detected; and the decoding head in the image detection model is used to decode the third visual representation to obtain the detection result of the image to be detected.

[0015] The image detection model is obtained according to the method described in the first aspect.

[0016] Thirdly, an image detection model training device is provided, comprising:

[0017] An acquisition unit is configured to acquire a first training sample, the first training sample including at least one first image sample and at least one first label corresponding to the at least one first image sample, the first label being used to indicate whether the first image sample is an image generated based on artificial intelligence (AI);

[0018] The training unit is configured to input the first image sample into an image encoder to obtain a first visual representation of the first image sample, input the first label into a text encoder to obtain a text representation of the first label, and update the parameters of the image encoder and the text encoder based on the first visual representation and the text representation to obtain the trained image encoder.

[0019] The acquisition unit is further configured to acquire a second training sample, the second training sample including a second image sample and a second label of the second image sample, the second label being used to indicate whether the second image sample is real or a tampered area mask of the second image sample;

[0020] The training unit is also used to input the second image sample into the trained image encoder and decoder to obtain the detection result of the second image sample, and to update the parameters of the decoder according to the detection result and the second label to obtain the trained decoder.

[0021] The output unit is used to output the trained image encoder and the trained decoder as the image detection model.

[0022] Fourthly, an image detection device is provided, comprising:

[0023] The acquisition unit is used to acquire the image to be detected;

[0024] An image detection model is used to input the image to be detected, perform feature encoding on the image to be detected using an image encoder in the image detection model to obtain a third visual representation of the image to be detected, and decode the third visual representation using a decoding head in the image detection model to obtain a detection result of the image to be detected.

[0025] The image detection model is obtained according to the method described in the second aspect.

[0026] Fifthly, embodiments of this application provide an electronic device, including: a processor and a memory, the memory being used to store a computer program, and the processor being used to call and run the computer program stored in the memory to perform the method as described in the first aspect, or the method as described in the second aspect.

[0027] In a sixth aspect, embodiments of this application provide a computer-readable storage medium including instructions that, when executed on a computer, cause the computer to perform the method as described in the first aspect or the method as described in the second aspect.

[0028] In a seventh aspect, embodiments of this application provide a computer program product including computer program instructions that cause a computer to perform the method as described in the first aspect or the method as described in the second aspect.

[0029] Eighthly, embodiments of this application provide a computer program that causes a computer to perform the method as described in the first aspect or the method as described in the second aspect.

[0030] The above technical solution involves inputting a first image sample from the first training sample into an image encoder to obtain a first visual representation of the first image sample, inputting a first label from the first training sample into a text encoder to obtain a text representation of the first label, and training the image encoder and text encoder based on the first visual representation, text representation, and first image sample. Then, by inputting a second image sample from the second training sample into the trained image encoder and decoder, a second visual representation of the second image sample is obtained, and the decoder is trained based on the second visual representation and the second label of the second image sample. Finally, the trained image encoder and decoder output as an image detection model. In this embodiment, the training process based on the first training sample utilizes a dataset of AI-generated images to train the image encoder, while the training process based on the second training sample fine-tunes the decoder based on an image tampering dataset to improve the image tampering detection capability of the decoder. Through these two stages of model training, the image encoder can be pre-trained using the dataset of AI-generated images. Fine-tuning of the decoder can then be achieved using the pre-trained image encoder and a small amount of AI-generated image tampering dataset, solving the problem of insufficient AI-generated image tampering dataset, improving the few-sample learning capability of the image detection system, and enhancing the generalization and accuracy of image detection. Attached Figure Description

[0031] Figure 1 This is a schematic diagram illustrating an application scenario of an embodiment of this application;

[0032] Figure 2 This is a schematic diagram of the system architecture according to an embodiment of this application;

[0033] Figure 3 This is a schematic flowchart illustrating an image detection model training method according to an embodiment of this application;

[0034] Figure 4 This is a schematic diagram of a training sample according to an embodiment of this application;

[0035] Figure 5 This is a schematic flowchart illustrating another image detection model training method according to an embodiment of this application;

[0036] Figure 6This is a schematic diagram of a model training architecture according to an embodiment of this application;

[0037] Figure 7 This is a schematic flowchart illustrating another image detection model training method according to an embodiment of this application;

[0038] Figure 8 This is a schematic diagram of the structure of a classification decoding head according to an embodiment of this application;

[0039] Figure 9 This is a schematic diagram of the structure of a segmentation decoding head according to an embodiment of this application;

[0040] Figure 10 This is a schematic flowchart of an image detection method according to an embodiment of this application;

[0041] Figure 11 This is a schematic diagram of the framework of an image detection system according to an embodiment of this application;

[0042] Figure 12 This is a schematic flowchart of an image detection model training apparatus according to an embodiment of this application;

[0043] Figure 13 This is a schematic block diagram of an image detection apparatus according to an embodiment of this application;

[0044] Figure 14 This is a schematic block diagram of an electronic device according to an embodiment of this application. Detailed Implementation

[0045] The technical solutions of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention.

[0046] It should be understood that in the embodiments of this application, "B corresponding to A" means that B is associated with A. In one implementation, B can be determined based on A. However, it should also be understood that determining B based on A does not mean determining B solely based on A; B can also be determined based on A and / or other information.

[0047] In the description of this application, unless otherwise stated, "at least one" means one or more, and "multiple" means two or more. Additionally, "and / or" describes the relationship between related objects, indicating that three relationships can exist. For example, A and / or B can mean: A alone, A and B simultaneously, or B alone, where A and B can be singular or plural. The character " / " generally indicates that the preceding and following related objects are in an "or" relationship. "At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items. For example, at least one of a, b, or c can mean: a, b, c, ab, ac, bc, or abc, where a, b, and c can be single or multiple.

[0048] It should also be understood that the descriptions of "first", "second", etc. appearing in the embodiments of this application are only for illustration and to distinguish the objects being described, and there is no order to them. They do not indicate any special limitation on the number of devices in the embodiments of this application, and cannot constitute any limitation on the embodiments of this application.

[0049] It should also be understood that specific features, structures, or characteristics relating to embodiments in the specification are included in at least one embodiment of this application. Furthermore, these specific features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

[0050] Furthermore, the terms “comprising” and “having”, and any variations thereof, are intended to cover non-exclusive inclusion, such that a process, method, system, product, or server that includes a series of steps or units is not necessarily limited to those steps or units that are explicitly listed, but may include other steps or units that are not explicitly listed or that are inherent to such processes, methods, products, or devices.

[0051] Artificial Intelligence Generated Content (AIGC) is a novel content creation method. It utilizes artificial intelligence technology to automatically generate various forms of content. Mainstream AIGC methods can be divided into two categories: those based on Generative Adversarial Networks (GANs) and those based on Diffusion Models (DMs). GANs include methods such as GALIP and DF-GAN, while DMs include methods such as Stable Diffusion and Midjourney. AIGC can easily and realistically modify the content of real images or create entirely new images. For example, given a mask for the area to be modified, the Stable Diffusion model can easily and realistically modify the image content. Therefore, AIGC presents challenges to image tamper detection and localization.

[0052] Due to significant drawbacks in related technologies, no mature systems or products have yet been applied to novel image tampering detection tasks based on AIGC. Firstly, existing schemes suffer from high false alarm rates and low recognition accuracy, poor segmentation of tampered regions, and a strong dependence on semantic features. As AI-generated images become increasingly visually and semantically realistic, traditional detection and localization methods cannot effectively distinguish between real and tampered images. Furthermore, these schemes lack generalization ability and adapt poorly to new AI tampering techniques. They are typically trained on manually tampered datasets and traditional image restoration methods. However, with the rapid development of AI technology and continuous improvements in image tampering techniques, existing detection and localization schemes cannot generalize to new image tampering methods.

[0053] In view of this, embodiments of this application provide an image detection model training method, an image detection method, and an apparatus, which can improve the few-shot learning ability of the image detection system, as well as improve the generalization and accuracy of image detection.

[0054] Specifically, in the method for training an image detection model, a first training sample can be obtained, including at least one first image sample and at least one first label corresponding to the at least one first image sample. The first label is used to indicate whether the first image sample is an AI-generated image. Then, the first image sample is input into an image encoder to obtain a first visual representation of the first image sample, and the first label is input into a text encoder to obtain a text representation of the first label. Based on the first visual representation and the text representation, the parameters of the image encoder and the text encoder are updated to obtain a trained image encoder. Then, a second training sample is obtained, including a second image sample and a second label of the second image sample. The second label is used to indicate whether the second image sample is real or a tampered region mask of the second image sample. Then, the second image sample is input into the trained image encoder and the decoder to obtain the detection result of the second image sample. Based on the detection result and the second label, the parameters of the decoder are updated to obtain a trained decoder. Finally, the trained image encoder and the decoder are output as an image detection model.

[0055] Therefore, in this embodiment, the first visual representation of the first image sample is obtained by inputting the first image sample from the first training sample into the image encoder, and the text representation of the first label is obtained by inputting the first label from the first training sample into the text encoder. The image encoder and text encoder are then trained based on the first visual representation, the text representation, and the first image sample. Next, the second visual representation of the second image sample is obtained by inputting the second image sample from the second training sample into the trained image encoder and decoder. The decoder is then trained based on the second visual representation and the second label of the second image sample. Finally, the trained image encoder and decoder output as an image detection model. In this embodiment, the training process based on the first training sample uses a dataset of AI-generated images to train the image encoder, while the training process based on the second training sample fine-tunes the decoder based on an image tampering dataset to improve the image tampering detection capability of the decoder. Through these two stages of model training, the image encoder can be pre-trained using the dataset of AI-generated images. Fine-tuning of the decoder can then be achieved using the pre-trained image encoder and a small amount of AI-generated image tampering dataset, solving the problem of insufficient AI-generated image tampering dataset, improving the few-sample learning capability of the image detection system, and enhancing the generalization and accuracy of image detection.

[0056] In image detection methods, an image to be detected is acquired and input into an image detection model. The image encoder in this model encodes features of the image to obtain a third visual representation of the image. The decoder in the image detection model decodes this third visual representation to obtain the detection result of the image. This image detection model is trained using the aforementioned image detection model training method.

[0057] This application's embodiments can be used in products such as social media content security management, judicial verification tools, and image tampering evidence collection. Specifically, in the digital age, images, as an important medium for information transmission, play a crucial role in various fields such as personal social interaction, media publishing, and legal evidence. However, with the rapid development of image editing software and AI-based image generation technology, image tampering has become increasingly easy and difficult to detect. This application's embodiments can be applied to products for image tampering detection, classification, and location evidence collection, such as automatically detecting and labeling images on social media platforms, identifying potentially tampered content, and alerting users to the possibility of AI tampering in images of popular content. Additionally, this application's embodiments can also be applied to news reporting to verify the authenticity of images, ensuring the authenticity of published image content, preventing the spread of fake news, and mitigating potential risks caused by image tampering.

[0058] Figure 1A schematic diagram illustrating an application scenario of an embodiment of this application is shown.

[0059] like Figure 1 As shown, this application scenario involves server 1 and terminal device 2. Terminal device 2 can communicate with server 1 via a communication network. Server 1 can be the backend server for terminal device 2.

[0060] For example, terminal device 2 can refer to a type of device that has rich human-computer interaction methods, internet access capabilities, typically runs various operating systems, and has strong processing capabilities. Terminal device 2 can be a smartphone, tablet computer, laptop computer, desktop computer, wearable device, in-vehicle device, etc., but is not limited to these. Optionally, in this embodiment of the application, terminal device 2 is equipped with an image processing program (Application, APP), or an APP with image processing functions.

[0061] Server 1 can be a standalone physical server, a server cluster or distributed system composed of multiple physical servers, or a cloud server providing basic cloud computing services such as cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDN (Content Delivery Network), and big data and artificial intelligence platforms. A server can also become a node in a blockchain.

[0062] There may be one or more servers. When there are multiple servers, at least two servers are used to provide different services, and / or at least two servers are used to provide the same service, such as providing the same service in a load-balanced manner. This application does not limit this.

[0063] Terminal devices and servers can be connected directly or indirectly via wired or wireless communication, and this application does not limit this. This application does not limit the number of servers or terminal devices. The solution provided in this application can be implemented independently by the terminal device, independently by the server, or jointly by the terminal device and the server; this application does not limit this.

[0064] It should be understood that Figure 1 This is merely an illustrative example and does not specifically limit the application scenarios of the embodiments in this application. For example, Figure 1 An exemplary terminal device and a server are shown, but in practice, other numbers of terminal devices and servers may be included, and this application does not limit this.

[0065] Figure 2 This is a schematic diagram of a system architecture involved in an embodiment of this application.

[0066] like Figure 2 As shown, the system architecture may include user equipment 101, data acquisition equipment 102, training equipment 103, execution equipment 104, database 105, and content library 106.

[0067] The data acquisition device 102 is used to read training data from the content library 106 and store the read training data in the database 105. As an example, the training data may include a first training sample and a second training sample.

[0068] The first training sample includes at least one first image sample and at least one first label corresponding to the at least one first image sample. The first label is used to indicate whether the first image sample is an image generated based on artificial intelligence (AI).

[0069] The second training sample includes a second image sample and a second label for the second image sample. The second label is used to indicate whether the second image sample is real or a mask of the tampered area of ​​the second image sample.

[0070] Training device 103 trains a machine learning model based on training data maintained in database 105. First, in this embodiment, the image encoder is trained using a first training sample. Specifically, the first image sample is input into the image encoder to obtain a first visual representation of the first image sample, and a first label is input into the text encoder to obtain a text representation of the first label; and the parameters of the image encoder and text encoder are updated based on the first visual representation and the text representation to obtain a trained image encoder. Next, the decoder is trained using a second training sample. Specifically, the second image sample is input into the trained image encoder and decoder to obtain a detection result for the second image sample, and the parameters of the decoder are updated based on the detection result and the second label to obtain a trained decoder. The trained image encoder and decoder can output an image detection model.

[0071] Optional, see reference Figure 2 The execution device 104 is equipped with an I / O interface 107 for data interaction with external devices, such as user equipment 101. The computing module 109 in the execution device 104 uses a trained image detection model to output image detection results for the image to be detected input from user equipment 101. Specifically, the image to be detected can be input into the image detection model, where the image encoder encodes features of the image to obtain a third visual representation of the image; the decoding head in the image detection model decodes the third visual representation to obtain the detection result of the image.

[0072] Optionally, the execution device 104 can send the corresponding image detection results to the user device 101 via the I / O interface.

[0073] Optionally, the trained machine learning model can also be deployed on user device 101. The user device can input the image to be detected into the machine learning model, and the model outputs the corresponding image detection result.

[0074] The user equipment 101 may include mobile phones, computers, smart voice interaction devices, smart home appliances, vehicle terminals, aircraft or other terminal devices, and this application embodiment does not limit this.

[0075] The execution device 104 can be a server. For example, the server can be a rack server, blade server, tower server, or cabinet server, etc. The server can be a standalone server or a server cluster composed of multiple servers, and this embodiment of the application does not limit it in this way.

[0076] In this embodiment, the execution device 104 is connected to the user equipment 101 via a network. The network can be an intranet, the Internet, the Global System for Mobile Communication (GSM), Wideband Code Division Multiple Access (WCDMA), 4G network, 5G network, Bluetooth, Wi-Fi, voice communication network, or other wireless or wired networks.

[0077] It should be noted that, Figure 2 This is merely a schematic diagram of a system architecture provided in this application embodiment, and the positional relationships between the devices, components, modules, etc. shown in the figure do not constitute any limitation. In some embodiments, the data acquisition device 102, user device 101, training device 103, and execution device 104 may be the same device. The database 105 may be distributed across one server or multiple servers, and the content library 106 may be distributed across one server or multiple servers.

[0078] The technical solutions of the embodiments of this application will be described in detail below through some examples. The following embodiments can be combined with each other, and the same or similar concepts or processes may not be described again in some embodiments.

[0079] Figure 3This is a schematic flowchart illustrating an image detection model training method 300 according to an embodiment of this application. The method 300 can be executed by any electronic device with data processing capabilities; for example, the electronic device can be implemented as... Figure 1 The server or terminal device in, or Figure 2 Training equipment 103 in the middle. For example... Figure 3 As shown, the image detection model training method 300 includes steps S310 to S350.

[0080] S310, Obtain a first training sample. The first training sample includes at least one first image sample and at least one first label corresponding to the first image sample. The first label is used to indicate whether the first image sample is an image generated based on artificial intelligence (AI).

[0081] Specifically, the first training samples include both AI-generated and non-AI-generated images. Optionally, the first training samples belong to a classification dataset of AI-generated images, which can be a large number of open-source datasets.

[0082] For example, AI-generated images can be AIGC-generated images, such as images modified from real images based on AIGC, or entirely new images created based on AIGC. As an example, given a mask for the modified region, the stablediffusion model can realistically modify the image content to obtain an AIGC-modified image. See, for example... Figure 4 (a) is a real image, (b) is a masked image of a given region, and (c) is a generated image, i.e., an image modified based on AIGC. For example, non-AI-generated images may include real images.

[0083] In one possible implementation, at least one first image sample and at least one first label corresponding to at least one first image sample can be referred to as at least one image-text pair, such as M image-text pairs, where M is a positive integer.

[0084] For example, an image-text pair can be represented as Among them, X i Y represents the i-th first image sample. i Let X represent the first label corresponding to the i-th first image sample. i ,Y i () represents the matched image-text pair, (X) i ,Y i ), where i≠j represents a mismatched image-text pair. Here, i and j are positive integers.

[0085] For example, the first label may include {"real","fake"}, where "real" indicates that the first image sample is not an AI-generated image, and "fake" indicates that the first image sample is an AI-generated image.

[0086] For example, the first label may include {0, 1}, where "1" indicates that the first image sample is not an AI-generated image, and "0" indicates that the first image sample is an AI-generated image.

[0087] S320, input the first image sample into the image encoder to obtain the first visual representation of the first image sample, input the first label into the text encoder to obtain the text representation of the first label; and update the parameters of the image encoder and the text encoder according to the first visual representation, the text representation and the first label to obtain the trained image encoder.

[0088] Specifically, the first image sample is input into the image encoder for feature encoding to obtain the first visual representation of the first image sample. The first label is input into the text encoder for feature encoding to obtain the text representation of the first label. Then, based on the first visual representation, the text representation, and the first label of the first image sample, the parameters of the image encoder and the text encoder are updated respectively to obtain the trained image encoder. This training process pre-trains the image encoder based on the first training sample, corresponding to the first stage of the model training process.

[0089] For example, an image encoder is represented as f θ The text encoder is represented as g. θ Then, using an image encoder, we can obtain image-text pairs (X). i ,Y i The first visual representation of ) can be represented as I i =f θ (X i The text representation can be expressed as T. i =g θ (Y i ), where Y i ∈{"real","fake"}.

[0090] In some embodiments, see Figure 5 The first visual representation of the first image sample is obtained by inputting the first image sample into the image encoder, which can be achieved through the following steps S321 to S323.

[0091] S321, input the first image sample into the noise extraction module to extract the noise features of the first image sample.

[0092] Specifically, the first image sample is input into the noise extraction model for noise extraction to obtain the noise features of the first image sample. These noise features can suppress semantic information and enhance texture features, thus improving the robustness and generalization ability of the model.

[0093] For example, the first image sample is a red-green-blue (RGB) image, and the noise extraction module can extract the noise features of the input RGB image.

[0094] Optionally, the noise extraction module can be an SRM (Sato Randomness Measure) filter. Specifically, the SRM filter can filter the input RGB image using a high-pass filter to obtain the residual image of the image as noise features.

[0095] Optionally, the noise extraction module can be a Bayer noise extractor or a depth noise extractor, etc. The depth noise extractor can be trained for parameter optimization.

[0096] It is understood that different noise extractors have different perceptions of the pixels and textures of an image, and the embodiments of this application can replace different types of noise extractors according to different needs.

[0097] S322, the noise features are concatenated with the image features of the first image sample to obtain the multimodal features of the first image sample.

[0098] Specifically, the noise features obtained in step S321 can be concatenated with the original image features of the first image sample to obtain the multimodal features of the first image sample.

[0099] Optionally, the original image features of the first image sample can be RGB image features. Optionally, the multimodal features can be represented as (RGB, Noise).

[0100] S323, input the multimodal features into the feature encoder to obtain the first visual representation.

[0101] Specifically, the multimodal features are input into the image encoder for feature encoding to obtain the first visual representation of the first image sample. For example, the image encoder may include, but is not limited to, a Residual Network (ResNet) model, a Vision Transformer (ViT) model, etc.

[0102] Optionally, in step 323, before inputting the multimodal features into the image encoder to obtain the first visual representation, the multimodal features can also be input into a convolutional layer (conv) to obtain enhanced multimodal features. These enhanced multimodal features can also be referred to as dimensionality-reduced multimodal features.

[0103] For example, the concatenated features from step S322 can be input into a convolutional layer to reduce the dimensionality of the input features, thereby reducing the number of parameters while retaining the most important information. For example, this convolutional layer can be implemented with a 1×1 convolutional kernel, which can operate on the channel dimension, effectively integrating cross-channel information and increasing the non-linearity of the features.

[0104] In some embodiments, the overall process of steps S321 to S323 described above can be expressed as the following formula (1):

[0105] F = C(H) SRM (I RGB ),I RGB (1)

[0106] Where F represents enhanced multimodal input features, C represents vector concatenation, and H represents... SRM I represents an SMR high-pass filter. RGB This represents the RGB image features, i.e., the original image features of the first sample image.

[0107] For example, see Figure 6 The RGB image 601 is input to the noise extraction module 602. The noise features obtained are concatenated with the features of the RGB image 601 to obtain multimodal features. The multimodal features are then input to the convolutional layer 603 for dimensionality reduction, reducing the number of parameters while retaining the most important information, resulting in enhanced multimodal features. The enhanced multimodal features are input to the image encoder 604 to obtain the first visual feature, denoted as I. The first label is input to the text encoder 605 to obtain the text feature, denoted as T.

[0108] Therefore, by introducing noise features into the features of the input image encoder, the embodiments of this application can effectively suppress the semantic information of the image, which is conducive to fully mining the pixel and texture features of the AI-generated image, thereby improving the effectiveness of image detection and localization.

[0109] In some embodiments, see continue to see Figure 5 The step of updating the parameters of the image encoder and the text encoder based on the first visual representation, the text representation and the first label to obtain the trained image encoder can be achieved through the following steps S324 to S326.

[0110] S324, based on the first visual representation and the text representation, obtain the similarity between at least one first image sample and at least one first label.

[0111] Optionally, the first visual features of at least one first image sample can be linearly projected and normalized, as shown in the following formula (2):

[0112] I e =L2_norm (I·W I (2)

[0113] Simultaneously, the text features of at least one first label are linearly projected and normalized, as shown in the following formula (3):

[0114] T e =L2_norm (T·W T (3)

[0115] Where L2_norm is the L2 norm, W I and W T Let I be the learnable projection matrix, where I is the first visual representation and T is the text representation.

[0116] Specifically, based on the linear projection and normalized visual and text features, at least one similarity between at least one first image sample and at least one first label is calculated.

[0117] Optionally, at least one similarity may include the similarity between the first visual representation and the text representation of a paired image-text pair, and the similarity between the first visual representation and the text representation of an unpaired image-text pair. For example, the similarity between an image-text pair (X...) can be calculated. i ,Y i The similarity of images and text pairs (X) and the similarity of mismatched image-text pairs (X) i ,Y i The similarity of i ≠ j.

[0118] Optionally, at least one similarity includes at least one first similarity between at least one first image sample and at least one first label, and at least one second similarity between at least one first label and at least one first image sample.

[0119] For example, the first similarity can be the similarity between the first visual features of the first image sample and the text features of the first label, which can be called image-text similarity. For example, it can be shown in the following formula (4):

[0120] logits1 = cosine_similarity(I e ,T e )·t (4)

[0121] The second similarity can be the similarity between the text features of the first label and the first visual features of the first image sample, which can be called text-image similarity. For example, it can be shown in the following formula (5):

[0122] logits2 = cosine_similarity(T) e ,I e )·t (5)

[0123] Where logits1 is the first similarity, logits2 is the second similarity, cosine_similarity is the cosine similarity operation, and t is a hyperparameter.

[0124] For example, see [link to example]. Figure 6 Given two RGB images, their first visual features are I1 and I2, and their first label text features are T1 and T2. The cosine similarity between the first visual features and the text features is calculated, yielding logits. Its dimension is 2×2, where I i ·T j (i = 1, 2, j = 1, 2) represents the similarity between the i-th first image sample and the j-th first label, i.e., the image-text similarity, indicating whether each image-text pair matches. Specifically, when i = j, the image and text match; when i ≠ j, the image and text do not match.

[0125] Optionally, the cosine similarity between the text features and the first visual features can also be calculated, and the resulting logits are: Its dimension is also 2×2, where T i ·I j (i = 1, 2, j = 1, 2) represents the similarity between the i-th first label and the j-th first image sample, i.e., the text-image similarity, indicating whether each text-image pair matches. Specifically, when i = j, the text and image match; when i ≠ j, the text and image do not match.

[0126] S325, a first loss function is obtained based on at least one similarity between at least one first image sample and at least one first label.

[0127] Specifically, a first loss function can be calculated using cross-entropy loss based on at least one similarity between at least one first image sample and at least one first label.

[0128] Optionally, an image contrast loss function is obtained based on at least one first similarity between at least one first image sample and at least one first label, and a text contrast loss function is obtained based on at least one first label and at least one second similarity between at least one first label and the at least one first image sample. Then, a first loss function is obtained based on the image contrast loss function and the text contrast loss function.

[0129] For example, the image contrast loss function can be expressed as the following formula (6):

[0130] Loss I =Φ(logits1,labels,axis=0) (6)

[0131] Among them, Loss I Φ represents the image contrast loss, Φ represents the cross-entropy loss operation, labels represent whether the image and text match, and axis=0 represents the row calculation operation.

[0132] For example, for The cross-entropy loss between logits and the corresponding label in each row can be calculated. When i = j, the image and text match, and the corresponding label is 1; when i ≠ j, the image and text do not match, and the corresponding label is 0.

[0133] Therefore, by optimizing the image contrast loss function, we can maximize the cosine similarity between the first visual features and the text features of matched image-text pairs, while minimizing the cosine similarity between the first visual features and the text features of mismatched image-text pairs.

[0134] For example, the text contrast loss function can be expressed as the following formula (7):

[0135] Loss T =Φ(logits2,labels,axis=1) (7)

[0136] Among them, Loss T Φ represents the text comparison loss, Φ represents the cross-entropy loss operation, labels represent whether the text and image match, and axis=1 represents the column calculation operation.

[0137] For example, for The cross-entropy loss between logits and the corresponding label in each column can be calculated. Specifically, when i = j, the text and image match, and the corresponding label is 1; when i ≠ j, the text and image do not match, and the corresponding label is 0.

[0138] Therefore, by optimizing the text contrast loss function, we can maximize the cosine similarity between the text features of the matched image-text pairs and the first visual features, while minimizing the cosine similarity between the text features of the mismatched image-text pairs and the first visual features.

[0139] The first loss function, Loss, can be the sum of the image contrast loss function and the text contrast loss function. For example, it can be expressed as the following formula (8):

[0140] Loss = Loss I + Loss T (8)

[0141] Therefore, by introducing a contrastive learning paradigm into the pre-training stage of the image encoder, the embodiments of this application can help improve the robustness and generalization ability of the model.

[0142] S326, Update the parameters of the image encoder and text encoder according to the first loss function to obtain the trained image encoder.

[0143] Specifically, the parameters of the image encoder and text encoder can be updated by optimizing the first loss function, and the parameter updates can be stopped when the model training stopping condition is met. For example, the model training stopping condition could be reaching the maximum number of training epochs, or the loss function fluctuating within a certain range and no longer decreasing significantly; there are no specific limitations. At this point, the trained image encoder can be obtained.

[0144] Optionally, when using a deep noise extractor as a noise extraction module, its parameters can be updated during the model training phase.

[0145] It should be noted that in this embodiment, the text encoder is used only during the model training phase, while only the image encoder is used during the model inference phase.

[0146] Steps S310 and S320 above constitute the first stage of model training, which pre-trains the image encoder using the first training samples. This allows for the use of a large number of open-source classification datasets of AI-generated images for pre-training. After the first stage of learning is completed, the image encoder can be frozen for the second stage of model training, i.e., steps S330 and S340 are executed below.

[0147] S330, Obtain the second training sample. The second training sample includes the second image sample and the second label of the second image sample. The second label is used to indicate whether the second image sample is real or the tampered area mask of the second image sample.

[0148] Specifically, the second training samples include real images or manipulated images. Optionally, the second training samples belong to an AI-generated image manipulation dataset. This AI-generated image manipulation dataset may include a small number of samples.

[0149] For example, this involves AI-generated image manipulation data, such as images modified from real images based on AIGC. As an example, given a region to be modified, the stablediffusion model can realistically modify the image content to obtain an AIGC-modified image.

[0150] One possible scenario is that the second label indicates whether the second image sample is authentic, that is, whether the second image sample has been tampered with. The second label can include {0,1}, where "0" indicates that the image is not a real image, that is, the image has been tampered with, and "1" indicates that the image is a real image, that is, the image has not been tampered with.

[0151] Another possible scenario is that the second label represents a mask of the tampered region in the second sample image, that is, it represents the tampered area of ​​the second image. For example, using... Figure 4 Taking the example of (c), the image can be used as the second image sample, and its corresponding second label is the region mask in the image (b), that is, the corresponding mask part has been tampered with.

[0152] S340, input the second image sample into the trained image encoder and decoder to obtain the detection result of the second image sample, and update the parameters of the decoder based on the detection result and the second label to obtain the trained decoder.

[0153] Specifically, the second image sample can be input into a trained image encoder to obtain the second visual features of the second image sample. These second visual features are then input into a decoding head to obtain the detection result of the second image sample. Based on this detection result and the second label, the parameters of the decoding head are updated to obtain a trained decoding head. This decoding head can also be called a detection decoding head, used to decode the encoded features obtained by the image encoder to obtain the image detection result.

[0154] It should be understood that during the model training phase of S340, a frozen image encoder is used to encode the features of the second image sample to obtain an image representation. This image representation is then input into the decoding head, and the decoding head is fine-tuned based on the output detection results and image labels. Here, the second image sample can be a small number of image samples from an AI-generated image tampering dataset. Therefore, this embodiment utilizes the image encoder trained in the first stage to perform a second-stage fine-tuning of the decoding head, enabling the training of the image detection model using a small number of samples. This improves the system's few-shot learning capability and enhances the generalization and accuracy of image detection.

[0155] In some embodiments, see Figure 7 Step S340 can be implemented according to the following steps S341 to S344.

[0156] S341, input the second image sample into the trained image encoder to obtain the second visual representation of the second image sample.

[0157] Specifically, the second image sample can be input into a trained image encoder for feature encoding to obtain the image representation of the second image sample, i.e., the second visual representation. For example, multimodal feature extraction can be performed on the second image sample, such as inputting it into an SRM filter to extract noise features, concatenating the noise features with the original image features, and then inputting the convolutional layer for dimensionality reduction to obtain enhanced multimodal features. These enhanced multimodal features are then input into the image encoder to obtain the second visual features of the second image sample.

[0158] For example, taking the image encoder as a ResNet encoder, the ResNet encoder outputs tensors at four scales. For instance, if the size of the input second image sample is adjusted to 224×224, that is, the size of the obtained multimodal feature input is (batch_size, 3, 224, 224), after the pre-trained model of the ResNet50 encoder, four feature maps at different scales can be obtained (an example of the second visual feature), namely C2: (batch_size, 64, 56, 56), C3: (batch_size, 128, 28, 28), C4: (batch_size, 256, 14, 14), and C5: (batch_size, 512, 7, 7).

[0159] S342, input the second visual representation into the classification decoding head to obtain a first prediction result, the first prediction result including the probability that the second image sample has been tampered with.

[0160] Specifically, the second visual representation can be input into the classification decoding head for decoding to achieve image classification prediction, obtaining the first prediction result, which is the probability that the second image sample has been tampered with. The classification decoding head can be applied to the task of classifying whether an image has been tampered with.

[0161] For example, the classification decoding head may include a fully connected layer, also known as a dense layer, to predict the probability that the input second image sample has been tampered with.

[0162] As one possible approach, an average pooling layer (AdaptiveAvgPool2d) can be included before the fully connected (FC) layer. Specifically, the first feature map at the smallest scale in the second visual representation can be input into the average pooling layer to obtain a scaled-down second feature map. This second feature map is then input into the fully connected layer to map it onto a two-dimensional feature space indicating whether the image has been tampered with, thus obtaining the first prediction result.

[0163] For example, see Figure 8 The C5:(batch_size,512,7,7) feature map (an example of the first feature map) can be input into the average pooling layer 801 to reduce the size of the C5 feature map to (512×1×1), resulting in the second feature map. Then, the second feature map is input to FC1 802, which has 512 input features and 1024 output features. Next, the ReLU activation function 803 is used to introduce non-linearity. Finally, the ReLU output features are input to FC2 804, which has 1024 input features and 2 output features, corresponding to the two categories in the binary classification task. For example, the tensor shape output by the classification decoding head can be (batch_size,2). Therefore, this embodiment of the application can map the second feature map to a two-dimensional feature space indicating whether the image has been tampered with, obtaining the classification prediction result of the second image sample.

[0164] S343, Based on the first prediction result and the second label, a second loss function is obtained; wherein, the second label is used to indicate whether the second image sample is real.

[0165] Specifically, a second loss function can be calculated using cross-entropy loss based on the first prediction result and the second label. For example, the second label may include a {0,1} label indicating whether the image has been tampered with.

[0166] S344, update the parameters of the classification decoder head according to the second loss function to obtain the trained classification decoder head.

[0167] Specifically, the parameters of the classifier decoder can be updated by optimizing the second loss function, and the parameter updates can be stopped when the model training stopping condition is met. For example, the model training stopping condition could be reaching the maximum number of training epochs, or the loss function fluctuating within a certain range and no longer decreasing significantly; there are no specific limitations. At this point, the trained classifier decoder head can be obtained.

[0168] In some embodiments, see continue to see Figure 8 Step S340 can be implemented according to the following steps S345 to S347.

[0169] S345, input the second visual representation into the segmentation decoding head to obtain the second prediction result, which includes the probability mask map of the second image sample being tampered with.

[0170] Specifically, the second visual representation can be input into the segmentation decoding head for decoding to achieve image segmentation prediction, obtaining a second prediction result, namely, a probability mask map of the second image sample being tampered with. The segmentation decoding head can be applied to image tampering region segmentation and localization tasks. The second visual representation can be obtained according to step S341.

[0171] For example, the segmentation decoding head may include an UpperNet decoder, a Segformer structure, or other lightweight decoding heads to predict the tampered probability mask map of the input second image sample.

[0172] As one possible approach, feature maps at various scales in the second visual representation can be fused, and the fused feature maps can be upsampled to the original image size to obtain the second prediction result.

[0173] For example, see Figure 9 The feature maps of different sizes C2:(batch_size,64,56,56), C3:(batch_size,128,28,28), C4:(batch_size,256,14,14), and C5:(batch_size,512,7,7) can be input into the fusion module 901 for fusion. The fused feature map is then input into the upsampling module 902, which progressively upsamples it to the original image size. The final output tensor shape is the same as the input image size, for example, (batch_size,1,224,224). The corresponding output is the probability mask map indicating whether the second image sample has been tampered with, i.e., the prediction result of the tampered region segmentation of the second image sample.

[0174] S346, Based on the second prediction result and the second label, the third loss function is obtained; wherein, the second label is used to represent the tampered region mask of the second image sample.

[0175] Specifically, a third loss function can be calculated using cross-entropy loss based on the second prediction result and the second label. For example, the second label may include a mask of the tampered regions of the image.

[0176] S347, Update the parameters of the segmentation decoder head according to the third loss function to obtain the trained segmentation decoder head.

[0177] Specifically, the parameters of the segmentation decoder can be updated by optimizing the third loss function, and the parameter updates can be stopped when the model training stopping condition is met. For example, the model training stopping condition could be reaching the maximum number of training epochs, or the loss function fluctuating within a certain range and no longer decreasing significantly; there are no specific limitations. At this point, the trained classification decoder head can be obtained.

[0178] In some embodiments, the decoding head may include the above-described classification decoding head and segmentation decoding head, thereby enabling the acquisition of the binary classification probability of image tampering and the distribution map of tampered areas.

[0179] The S350 outputs the trained image encoder and the trained decoder as an image detection model.

[0180] Specifically, steps S330 and S340 above enable the second stage of model training. By fine-tuning the decoder using a second training sample, the decoder head can be fine-tuned using a small dataset of AI-generated image tampering. Therefore, after the first and second stages of model training, the trained image encoder and the trained decoder head can be output as an image detection model.

[0181] Therefore, in this embodiment, the first visual representation of the first image sample is obtained by inputting the first image sample from the first training sample into the image encoder, and the text representation of the first label is obtained by inputting the first label from the first training sample into the text encoder. The image encoder and text encoder are then trained based on the first visual representation, the text representation, and the first image sample. Next, the second visual representation of the second image sample is obtained by inputting the second image sample from the second training sample into the trained image encoder and decoder. The decoder is then trained based on the second visual representation and the second label of the second image sample. Finally, the trained image encoder and decoder output as an image detection model. In this embodiment, the training process based on the first training sample uses a dataset of AI-generated images to train the image encoder, while the training process based on the second training sample fine-tunes the decoder based on an image tampering dataset to improve the image tampering detection capability of the decoder. Through these two stages of model training, the image encoder can be pre-trained using the dataset of AI-generated images. Fine-tuning of the decoder can then be achieved using the pre-trained image encoder and a small amount of AI-generated image tampering dataset, solving the problem of insufficient AI-generated image tampering dataset, improving the few-sample learning capability of the image detection system, and enhancing the generalization and accuracy of image detection.

[0182] Figure 10 This is a schematic flowchart illustrating an image detection method 1000 according to an embodiment of this application. The method 1000 can be executed by any electronic device with data processing capabilities; for example, the electronic device can be implemented as... Figure 1 The server or terminal device in the system. For example, the electronic device can be implemented as... Figure 2 The execution device 104 in the application is not limited thereto. Figure 10 As shown, the image detection method 1000 includes steps S1010 to S1020.

[0183] S1010, acquire the image to be detected.

[0184] For example, the image to be detected can include a real image or a manipulated image. The manipulated image can be an AI-generated image, such as an image modified from a real image based on AIGC, without limitation.

[0185] S1020, the image to be detected is input into the image detection model, and the image encoder in the image detection model is used to encode the features of the image to be detected to obtain the third visual representation of the image to be detected; the decoding head in the image detection model is used to decode the third visual representation to obtain the detection result of the image to be detected.

[0186] The image detection model is obtained according to the image detection model training method described above. Specifically, the model training method is described in the relevant section above. After model training is complete, all model parameters can be frozen, and the model can be deployed so that it can be used in subsequent model inference processes.

[0187] Specifically, the process of an image encoder encoding the image to be detected is similar to the process of an image encoder encoding the first image sample or the second image sample, as described above.

[0188] In some embodiments, the decoding head includes at least one of a classification decoding head and a segmentation decoding head. Accordingly, in step S1020 above, decoding the third visual representation using the decoding head in the image detection model to obtain the detection result of the image to be detected includes at least one of the following:

[0189] The classification decoding head in the image detection model is used to decode the third visual representation to obtain the first detection result of the image to be detected. The first detection result includes the probability that the image to be detected has been tampered with.

[0190] The third visual representation is decoded using the segmentation decoding head in the image detection model to obtain the second detection result of the image to be detected. The second detection result includes a probability mask map of the image to be detected being tampered with.

[0191] Specifically, the process of the classification decoding head decoding the third visual representation is similar to the process of the classification decoding head decoding the second visual representation, as described above. Similarly, the process of the segmentation decoding head decoding the third visual representation is similar to the process of the segmentation decoding head decoding the second visual representation, as described above.

[0192] Figure 11This diagram illustrates the framework of an image detection system provided by an embodiment of this application, which can be, for example, a multimodal image tampering detection and localization system. First, multimodal feature extraction is performed. Noise features of the RGB image 1101 are extracted using a noise extraction module 1102 (e.g., an SRM filter). These noise features are then concatenated with the original features of the RGB image 1101 to obtain multimodal features. These multimodal features are then input into a convolutional layer 1103 to obtain enhanced multimodal features. Extracting noise features suppresses semantic information in the image and fully exploits pixel and texture features. Next, feature encoding is performed. The enhanced multimodal features are input into an image encoder 1104 to obtain an image representation (or visual representation). Optionally, the image encoder 1104 can be a ResNet or ViT model. Then, classification and segmentation decoding are performed. The image representation is input into a classification decoder head 1105 and a segmentation decoder head 1105, respectively, to obtain a binary classification score indicating whether the image has been tampered with and a probability distribution map of the tampered region.

[0193] In this embodiment, the image encoder 1104 can be obtained through a first-stage pre-training, and the classification decoder 1105 and segmentation decoder 1105 can be obtained through a second-stage fine-tuning. Therefore, this embodiment can fully utilize the large-scale open-source AIGC detection dataset to train the image encoder during the training stage, and then fine-tune the classification decoder and segmentation decoder on the image tampering localization dataset to improve the ability to classify and segment tampering.

[0194] For example, after completing the first and second stages of model training, the trained image detection model can be used for model inference. As an example, during model inference, the RGB image 1101 can be resized to (224, 224), then all model parameters are frozen. Noise features are extracted using the noise extraction module 1102, and the RGB image features are concatenated with the noise features. Then, a 1×1 convolutional layer is used to reduce the dimensionality of the concatenated features to (batch_size, 3, 224, 224). Four feature maps at different scales are obtained through the image encoder 1104 (e.g., a ResNet model). The smallest feature map C5 is input into the classification decoder head 1105 to obtain the binary classification probability of whether the RGB image has been tampered with. Simultaneously, the feature maps C2, C3, C4, and C5 at the four scales are input into the segmentation decoder head 1105. After feature map fusion and upsampling, a probability mask map of the tampered image region is obtained.

[0195] Therefore, this application proposes an image tampering localization and detection system based on contrastive learning for AI-generated image tampering localization and detection tasks in open scenarios. The system employs a two-stage training process: a first stage of AIGC detection pre-training and a second stage of image tampering segmentation fine-tuning. It fully utilizes a large number of open-source AIGC image datasets to pre-train the model's image encoder, addressing the lack of datasets for AI-generated image tampering, improving the system's few-shot learning ability, and enhancing detection generalization and accuracy. Simultaneously, noise features are introduced into the input features to suppress semantic information and fully exploit the pixel and texture features of AI-generated images, improving the effectiveness of detection and localization. Therefore, this application not only helps verify the authenticity of images and protect information content security but also maintains a clean cyberspace environment and prevents the spread of false information. This application can be directly applied to products such as public opinion management, information content security, and image tampering detection, improving the accuracy and interpretability of these products.

[0196] The specific embodiments of this application have been described in detail above with reference to the accompanying drawings. However, this application is not limited to the specific details of the above embodiments. Within the scope of the technical concept of this application, various simple modifications can be made to the technical solutions of this application, and these simple modifications all fall within the protection scope of this application. For example, the various specific technical features described in the above embodiments can be combined in any suitable manner without contradiction. To avoid unnecessary repetition, this application will not describe the various possible combinations separately. Furthermore, various different embodiments of this application can also be arbitrarily combined, as long as they do not violate the spirit of this application, they should also be considered as the content disclosed in this application.

[0197] It should also be understood that, in the various method embodiments of this application, the sequence numbers of the above processes do not imply the order of execution. The execution order of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of this application. It should be understood that these sequence numbers can be interchanged where appropriate so that the embodiments of this application described can be implemented in a sequence other than those illustrated or described.

[0198] The method embodiments of this application have been described in detail above. The following description, in conjunction with... Figures 12 to 14 The following describes in detail the device embodiments of this application.

[0199] Figure 12 This is a schematic block diagram of the image detection model training device 10 according to an embodiment of this application. Figure 12 As shown, the device 10 may include an acquisition unit 11, a training unit 12, and an output unit 13.

[0200] Acquisition unit 11 is used to acquire a first training sample, the first training sample including at least one first image sample and at least one first label corresponding to the at least one first image sample, the first label being used to indicate whether the first image sample is an image generated based on artificial intelligence (AI);

[0201] Training unit 12 is configured to input the first image sample into an image encoder to obtain a first visual representation of the first image sample, input the first label into a text encoder to obtain a text representation of the first label; and update the parameters of the image encoder and the text encoder according to the first visual representation and the text representation to obtain the trained image encoder.

[0202] The acquisition unit 11 is further configured to acquire a second training sample, the second training sample including a second image sample and a second label of the second image sample, the second label being used to indicate whether the second image sample is real or a tampered area mask of the second image sample;

[0203] The training unit 12 is further configured to input the second image sample into the trained image encoder and decoder to obtain the detection result of the second image sample, and update the parameters of the decoder according to the detection result and the second label to obtain the trained decoder.

[0204] Output unit 13 is used to output the trained image encoder and the trained decoder as the image detection model.

[0205] In some embodiments, the training unit 12 is used to input the first image sample into an image encoder to obtain a first visual representation of the first image sample, including:

[0206] The first image sample is input into the noise extraction module to extract the noise features of the first image sample;

[0207] The noise features are concatenated with the image features of the first image sample to obtain the multimodal features of the first image sample;

[0208] The multimodal features are input into the image encoder to obtain the first visual representation.

[0209] In some embodiments, the training unit 12 is further configured to: input the multimodal features into a convolutional layer to obtain enhanced multimodal features before inputting the multimodal features into the image encoder to obtain the first visual representation.

[0210] In some embodiments, the training unit 12 is configured to update the parameters of the image encoder and the text encoder based on the first visual representation and the text representation to obtain the trained image encoder, including:

[0211] Based on the first visual representation and the text representation, obtain the similarity between the at least one first image sample and the at least one first label;

[0212] A first loss function is obtained based on at least one similarity between the at least one first image sample and the at least one first label;

[0213] The parameters of the image encoder and the text encoder are updated according to the first loss function to obtain the trained image encoder.

[0214] In some embodiments, the training unit 12 is used to obtain a first loss function based on at least one similarity between the at least one first image sample and the at least one first label, including:

[0215] An image contrast loss function is obtained based on at least one first similarity between the at least one first image sample and the at least one first label;

[0216] A text contrast loss function is obtained based on at least one second similarity between the at least one first label and the at least one first image sample;

[0217] The first loss function is obtained based on the image contrast loss function and the text contrast loss function.

[0218] In some embodiments, the training unit 12 is configured to input the second image sample into the trained image encoder and decoder to obtain the detection result of the second image sample, and to update the parameters of the decoder based on the detection result and the second label to obtain the trained decoder, including:

[0219] The second image sample is input into the trained image encoder to obtain the second visual representation of the second image sample;

[0220] The second visual representation is input into the classification decoding head to obtain a first prediction result, which includes the probability that the second image sample has been tampered with.

[0221] Based on the first prediction result and the second label, a second loss function is obtained; wherein, the second label is used to indicate whether the second image sample is real;

[0222] The parameters of the classification decoder are updated according to the second loss function to obtain the trained classification decoder.

[0223] In some embodiments, the training unit 12 is used to input the second visual representation into the classification decoding head to obtain a first prediction result, including:

[0224] The first feature map with the smallest scale in the second visual representation is input into the average pooling layer to obtain a second feature map with a reduced size.

[0225] The second feature map is input into a fully connected layer, and the second feature map is mapped to a two-dimensional feature space indicating whether the image has been tampered with, to obtain the first prediction result.

[0226] In some embodiments, the training unit 12 is configured to input the second image sample into the trained image encoder and decoder to obtain the detection result of the second image sample, and to update the parameters of the decoder based on the detection result and the second label to obtain the trained decoder, including:

[0227] The second image sample is input into the trained image encoder to obtain the second visual representation of the second image sample;

[0228] The second visual representation is input into the segmentation decoding head to obtain a second prediction result, which includes a probability mask of the second image sample being tampered with.

[0229] Based on the second prediction result and the second label, a third loss function is obtained; wherein, the second label is used to represent the tampered region mask of the second image sample;

[0230] The parameters of the segmentation decoder are updated according to the third loss function to obtain the trained segmentation decoder.

[0231] In some embodiments, the training unit 12 is used to input the second visual representation into the segmentation decoding head to obtain a second prediction result, including:

[0232] The feature maps at each scale in the second visual representation are fused, and the fused feature maps are upsampled to the original image size to obtain the second prediction result.

[0233] It should be understood that the device embodiments and method embodiments can correspond to each other, and similar descriptions can be referred to the method embodiments. To avoid repetition, further details will not be provided here. Specifically, Figure 12The image detection model training device 10 shown can execute the above method embodiments, and the aforementioned and other operations and / or functions of each module in the device 10 are respectively to implement the corresponding process in the above method 300. For the sake of brevity, they will not be described in detail here.

[0234] Figure 13 This is a schematic block diagram of the anomaly detection device 20 according to an embodiment of this application. Figure 13 As shown, the device 20 may include an acquisition unit 21 and an image detection model 22.

[0235] Acquisition unit 21 is used to acquire the image to be detected;

[0236] Image detection model 22 is used to input the image to be detected, use the image encoder in the image detection model 22 to perform feature encoding on the image to be detected to obtain the third visual representation of the image to be detected; and use the decoding head in the image detection model 22 to decode the third visual representation to obtain the detection result of the image to be detected.

[0237] The image detection model 22 is obtained according to the above-described image detection model training method.

[0238] In some embodiments, the image detection model is used to decode the third visual representation using a decoding head in the image detection model to obtain the detection result of the image to be detected, including at least one of the following:

[0239] The third visual representation is decoded using the classification decoding head in the image detection model to obtain a first detection result of the image to be detected, wherein the first detection result includes the probability that the image to be detected has been tampered with;

[0240] The third visual representation is decoded using the segmentation decoding head in the image detection model to obtain a second detection result of the image to be detected. The second detection result includes a probability mask map of the image to be detected being tampered with.

[0241] It should be understood that the device embodiments and method embodiments can correspond to each other, and similar descriptions can be referred to the method embodiments. To avoid repetition, further details will not be provided here. Specifically, Figure 13 The image detection device 20 shown can execute the above method embodiments, and the aforementioned and other operations and / or functions of each module in the device 20 are respectively to implement the corresponding process in the above method 1000. For the sake of brevity, they will not be described in detail here.

[0242] The apparatus of this application embodiment has been described above from the perspective of functional modules in conjunction with the accompanying drawings. It should be understood that this functional module can be implemented in hardware, in software instructions, or in a combination of hardware and software modules. Specifically, the steps of the method embodiments in this application can be completed by integrated logic circuits in the processor's hardware and / or by software instructions. The steps of the method disclosed in this application embodiment can be directly embodied as being executed by a hardware decoding processor, or by a combination of hardware and software modules in the decoding processor. Optionally, the software module can reside in a mature storage medium in the art, such as random access memory, flash memory, read-only memory, programmable read-only memory, electrically erasable programmable memory, registers, etc. This storage medium is located in memory, and the processor reads information from the memory and, in conjunction with its hardware, completes the steps in the above method embodiments.

[0243] Figure 11 This is a schematic block diagram of the electronic device provided in the embodiments of this application.

[0244] like Figure 11 As shown, the electronic device 30 may include:

[0245] The system includes a memory 33 for storing a computer program 34 and a processor 32 for transferring the program code 34 to the processor 32. In other words, the processor 32 can retrieve and run the computer program 34 from the memory 33 to implement the method provided in the embodiments of this application.

[0246] In some embodiments, the processor 32 may call and run the computer program 34 from the memory 33 to implement the image detection model training method provided in this application embodiment, including:

[0247] Obtain a first training sample, the first training sample including at least one first image sample and at least one first label corresponding to the at least one first image sample, the first label being used to indicate whether the first image sample is an image generated based on artificial intelligence (AI);

[0248] The first image sample is input into the image encoder to obtain the first visual representation of the first image sample, and the first label is input into the text encoder to obtain the text representation of the first label; and the parameters of the image encoder and the text encoder are updated according to the first visual representation and the text representation to obtain the trained image encoder.

[0249] Obtain a second training sample, which includes a second image sample and a second label for the second image sample. The second label is used to indicate whether the second image sample is real or a tampered area mask of the second image sample.

[0250] The second image sample is input into the trained image encoder and decoder to obtain the detection result of the second image sample, and the parameters of the decoder are updated according to the detection result and the second label to obtain the trained decoder.

[0251] The trained image encoder and the trained decoder output the image detection model.

[0252] For example, the processor 32 can be used to execute the steps in the method 300 described above according to the instructions in the computer program 34.

[0253] In some embodiments, the processor 32 may call and run the computer program 34 from the memory 33 to implement the image detection method in this application embodiment, including:

[0254] Acquire the image to be detected;

[0255] The image to be detected is input into an image detection model, and the image encoder in the image detection model is used to encode the features of the image to be detected to obtain the third visual representation of the image to be detected; and the decoding head in the image detection model is used to decode the third visual representation to obtain the detection result of the image to be detected.

[0256] The image detection model is obtained according to the above-described image detection model training method.

[0257] For example, the processor 32 can be used to execute the steps in the method 1000 described above according to the instructions in the computer program 34.

[0258] In some embodiments of this application, the processor 32 may include, but is not limited to:

[0259] General-purpose processors, digital signal processors (DSPs), application-specific integrated circuits (ASICs), field-programmable gate arrays (FPGAs), or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.

[0260] In some embodiments of this application, the memory 33 includes, but is not limited to:

[0261] Volatile memory and / or non-volatile memory. Non-volatile memory can be read-only memory (ROM), programmable read-only memory (PROM), erasable programmable read-only memory (EPROM), electrically erasable programmable read-only memory (EEPROM), or flash memory. Volatile memory can be random access memory (RAM), which is used as an external cache. By way of example, but not limitation, many forms of RAM are available, such as Static RAM (SRAM), Dynamic RAM (DRAM), Synchronous DRAM (SDRAM), Double Data Rate SDRAM (DDR SDRAM), Enhanced Synchronous DRAM (ESDRAM), Synchronous Link DRAM (SLDRAM), and Direct Rambus RAM (DR RAM).

[0262] In some embodiments of this application, the computer program 34 may be divided into one or more units, which are stored in the memory 33 and executed by the processor 32 to perform the method provided in this application. The one or more units may be a series of computer program instruction segments capable of performing a specific function, which describe the execution process of the computer program 34 in the electronic device 30.

[0263] Optional, such as Figure 11 As shown, the electronic device 30 may further include:

[0264] Transceiver 33, which can be connected to processor 32 or memory 33.

[0265] The processor 32 can control the transceiver 33 to communicate with other devices; specifically, it can send information or data to other devices or receive information or data sent by other devices. The transceiver 33 may include a transmitter and a receiver. The transceiver 33 may further include antennas, and the number of antennas may be one or more. It should be understood that the various components in this electronic device are connected through a bus system, which includes, in addition to a data bus, a power bus, a control bus, and a status signal bus.

[0266] This application also provides a computer storage medium storing a computer program thereon, which, when executed by a computer, enables the computer to perform the methods of the above-described method embodiments. Alternatively, embodiments of this application also provide a computer program product containing instructions that, when executed by a computer, cause the computer to perform the methods of the above-described method embodiments.

[0267] When implemented using software, it can be implemented entirely or partially as a computer program product. This computer program product includes one or more computer instructions. When these computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this application are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another. For example, the computer instructions can be transmitted from one website, computer, server, or data center to another via wired (e.g., coaxial cable, fiber optic, digital subscriber line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium that a computer can access or a data storage device such as a server or data center that integrates one or more available media. The available medium can be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., digital video disc (DVD)), or a semiconductor medium (e.g., solid-state disk (SSD)).

[0268] It is understood that in the specific implementation of this application, when the above embodiments of this application are applied to specific products or technologies and involve user information and other related data, user permission or consent is required, and the collection, use and processing of related data must comply with the relevant laws, regulations and standards of the relevant countries and regions.

[0269] Those skilled in the art will recognize that the modules and algorithm steps of the various examples described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are implemented in hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art can use different methods to implement the described functions for each specific application, but such implementation should not be considered beyond the scope of this application.

[0270] In the several embodiments provided in this application, it should be understood that the disclosed systems, apparatuses, and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or modules may be electrical, mechanical, or other forms.

[0271] The modules described as separate components may or may not be physically separate. The components shown as modules may or may not be physical modules; that is, they may be located in one place or distributed across multiple network units. Some or all of the modules can be selected to achieve the purpose of this embodiment according to actual needs. For example, the functional modules in the various embodiments of this application may be integrated into one processing module, or each module may exist physically separately, or two or more modules may be integrated into one module.

[0272] The above are merely specific embodiments of this application, but the scope of protection of this application is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the scope of the technology disclosed in this application should be included within the scope of protection of this application. Therefore, the scope of protection of this application should be determined by the scope of the claims.

Claims

1. A method for training an image detection model, characterized in that, include: Obtain a first training sample, the first training sample including at least one first image sample and at least one first label corresponding to the at least one first image sample, the first label being used to indicate whether the first image sample is an image generated based on artificial intelligence (AI); The first image sample is input into the image encoder to obtain the first visual representation of the first image sample, and the first label is input into the text encoder to obtain the text representation of the first label; And based on the first visual representation and the text representation, the parameters of the image encoder and the text encoder are updated to obtain the trained image encoder; Obtain a second training sample, which includes a second image sample and a second label for the second image sample. The second label is used to indicate whether the second image sample is real or a tampered area mask of the second image sample. The second image sample is input into the trained image encoder and decoder to obtain the detection result of the second image sample, and the parameters of the decoder are updated according to the detection result and the second label to obtain the trained decoder. The trained image encoder and the trained decoder output the image detection model.

2. The method according to claim 1, characterized in that, The first image sample is input into an image encoder to obtain a first visual representation of the first image sample, including: The first image sample is input into the noise extraction module to extract the noise features of the first image sample; The noise features are concatenated with the image features of the first image sample to obtain the multimodal features of the first image sample; The multimodal features are input into the image encoder to obtain the first visual representation.

3. The method according to claim 2, characterized in that, Before inputting the multimodal features into the image encoder to obtain the first visual representation, the method further includes: The multimodal features are input into the convolutional layer to obtain enhanced multimodal features.

4. The method according to claim 1, characterized in that, The step of updating the parameters of the image encoder and the text encoder based on the first visual representation and the text representation to obtain the trained image encoder includes: Based on the first visual representation and the text representation, obtain the similarity between the at least one first image sample and the at least one first label; A first loss function is obtained based on at least one similarity between the at least one first image sample and the at least one first label; The parameters of the image encoder and the text encoder are updated according to the first loss function to obtain the trained image encoder.

5. The method according to claim 4, characterized in that, The step of obtaining a first loss function based on at least one similarity between the at least one first image sample and the at least one first label includes: An image contrast loss function is obtained based on at least one first similarity between the at least one first image sample and the at least one first label; A text contrast loss function is obtained based on at least one second similarity between the at least one first label and the at least one first image sample; The first loss function is obtained based on the image contrast loss function and the text contrast loss function.

6. The method according to any one of claims 1-5, characterized in that, The step of inputting the second image sample into the trained image encoder and decoder to obtain the detection result of the second image sample, and updating the parameters of the decoder based on the detection result and the second label to obtain the trained decoder, includes: The second image sample is input into the trained image encoder to obtain the second visual representation of the second image sample; The second visual representation is input into the classification decoding head to obtain a first prediction result, which includes the probability that the second image sample has been tampered with. Based on the first prediction result and the second label, a second loss function is obtained; wherein, the second label is used to indicate whether the second image sample is real; The parameters of the classification decoder are updated according to the second loss function to obtain the trained classification decoder.

7. The method according to claim 6, characterized in that, The step of inputting the second visual representation into the classification decoding head to obtain the first prediction result includes: The first feature map with the smallest scale in the second visual representation is input into the average pooling layer to obtain a second feature map with a reduced size. The second feature map is input into a fully connected layer, and the second feature map is mapped to a two-dimensional feature space indicating whether the image has been tampered with, to obtain the first prediction result.

8. The method according to any one of claims 1-5, characterized in that, The step of inputting the second image sample into the trained image encoder and decoder to obtain the detection result of the second image sample, and updating the parameters of the decoder based on the detection result and the second label to obtain the trained decoder, includes: The second image sample is input into the trained image encoder to obtain the second visual representation of the second image sample; The second visual representation is input into the segmentation decoding head to obtain a second prediction result, which includes a probability mask of the second image sample being tampered with. Based on the second prediction result and the second label, a third loss function is obtained; wherein, the second label is used to represent the tampered region mask of the second image sample; The parameters of the segmentation decoder are updated according to the third loss function to obtain the trained segmentation decoder.

9. The method according to claim 8, characterized in that, The step of inputting the second visual representation into the segmentation decoding head to obtain the second prediction result includes: The feature maps at each scale in the second visual representation are fused, and the fused feature maps are upsampled to the original image size to obtain the second prediction result.

10. An image detection method, characterized in that, include: Acquire the image to be detected; The image to be detected is input into an image detection model, and the image encoder in the image detection model is used to encode the features of the image to be detected to obtain the third visual representation of the image to be detected; and the decoding head in the image detection model is used to decode the third visual representation to obtain the detection result of the image to be detected. The image detection model is obtained according to the method described in any one of claims 1-9.

11. The method according to claim 10, characterized in that, The step of decoding the third visual representation using the decoding head in the image detection model to obtain the detection result of the image to be detected includes at least one of the following: The third visual representation is decoded using the classification decoding head in the image detection model to obtain a first detection result of the image to be detected, wherein the first detection result includes the probability that the image to be detected has been tampered with; The third visual representation is decoded using the segmentation decoding head in the image detection model to obtain a second detection result of the image to be detected. The second detection result includes a probability mask map of the image to be detected being tampered with.

12. An image detection model training device, characterized in that, include: An acquisition unit is configured to acquire a first training sample, the first training sample including at least one first image sample and at least one first label corresponding to the at least one first image sample, the first label being used to indicate whether the first image sample is an image generated based on artificial intelligence (AI); The training unit is used to input the first image sample into the image encoder to obtain the first visual representation of the first image sample, and to input the first label into the text encoder to obtain the text representation of the first label. And based on the first visual representation and the text representation, the parameters of the image encoder and the text encoder are updated to obtain the trained image encoder; The acquisition unit is further configured to acquire a second training sample, the second training sample including a second image sample and a second label of the second image sample, the second label being used to indicate whether the second image sample is real or a tampered area mask of the second image sample; The training unit is also used to input the second image sample into the trained image encoder and decoder to obtain the detection result of the second image sample, and to update the parameters of the decoder according to the detection result and the second label to obtain the trained decoder. The output unit is used to output the trained image encoder and the trained decoder as the image detection model.

13. An image detection device, characterized in that, include: The acquisition unit is used to acquire the image to be detected; An image detection model is used to input the image to be detected, perform feature encoding on the image to be detected using an image encoder in the image detection model to obtain a third visual representation of the image to be detected, and decode the third visual representation using a decoding head in the image detection model to obtain a detection result of the image to be detected. The image detection model is obtained according to the method described in any one of claims 1-10.

14. An electronic device, characterized in that, It includes a processor and a memory, wherein the memory stores instructions, and when the processor executes the instructions, it causes the processor to perform the method according to any one of claims 1-9, or the method according to any one of claims 10-11.

15. A computer storage medium, characterized in that, Used to store a computer program, the computer program comprising a method for performing any one of claims 1-9, or any one of claims 10-11.

16. A computer program product, characterized in that, Includes computer program code, which, when executed by an electronic device, causes the electronic device to perform the method of any one of claims 1-9, or the method of any one of claims 10-11.