Image category detection method, device, equipment and storage medium

By first using a style classification model to determine the authenticity of an image in AIGC image detection, and then using a corresponding image detection model for further detection, the problem of insufficient accuracy in AIGC image detection is solved, and higher detection accuracy is achieved.

CN119027730BActive Publication Date: 2026-06-12EVERSEC BEIJING TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
EVERSEC BEIJING TECH
Filing Date
2024-08-13
Publication Date
2026-06-12

AI Technical Summary

Technical Problem

Existing technologies for detecting AIGC-generated images suffer from insufficient accuracy, particularly due to issues such as the failure of physical feature detection caused by the evolution of AIGC models, the susceptibility of digital watermarks to being cracked, and the reduced detection accuracy of deep learning models when there are large differences in sample styles.

Method used

A style classification model is used to initially determine whether an image belongs to a true or false style category. Then, an appropriate image detection model is selected based on the category for further detection, including true image detection models and false image detection models, thereby improving detection accuracy.

🎯Benefits of technology

By adopting a classification-then-detection approach, the accuracy of AIGC image category detection is improved, solving the problem of low detection accuracy in existing technologies.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN119027730B_ABST
    Figure CN119027730B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of image processing, and particularly provides a kind of detection method, device and equipment of image category and storage medium, which comprises: obtaining an image to be detected;Using a style classification model to classify the image to be detected, obtaining a first target category of the image to be detected, the first target category includes a real style category or a non-real style category;The image detection model corresponding to the first target category is used as a target detection model;Using the target image detection model to detect the image to be detected, obtaining a second target category of the image to be detected, the second target category includes a generation category or a non-generation category.Because the style category of the image to be detected is first judged by using the style classification model, and then the corresponding image detection model is sent according to the style category to judge whether the image to be detected is a generated image, the accuracy of image category detection is improved.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This application relates to the field of image processing technology, and in particular to a method, apparatus, device, and storage medium for detecting image categories. Background Technology

[0002] With the continuous development of artificial intelligence technology, inputting instructions into the AIGC (Artificial Intelligence Generated Content) model will allow the AIGC model to automatically create content such as text, images, audio, or video.

[0003] However, in some fields, due to issues such as authenticity and versioning, it is necessary to detect whether an image was generated by an AIGC model. The purpose of AIGC-generated image category detection is to effectively detect whether an image or video was generated by AIGC. Currently, image category detection methods can be divided into three categories according to their technical approaches. The first category distinguishes AIGC-generated images based on physical features. This type of method mainly distinguishes AIGC-generated images by looking for inconsistencies between human information and the physical world. For example, based on the physical rules of lighting and reflection in perspective, it is possible to determine whether an image was generated by AIGC. The second category adds a digital watermark to the AIGC-generated image. A digital watermark is added at the beginning of AI image generation, and subsequent detection only needs to detect this digital watermark to determine whether the image was generated by the AIGC model. The third category is a method for detecting AIGC images based on deep learning models. The detection effect of this method usually depends on the effectiveness of the network feature extraction and suitable and diverse training data.

[0004] As AIGC models continue to evolve, defects in images will gradually disappear, rendering the first type of detection method ineffective. Since a completely reliable digital watermarking technology does not yet exist, the second type is easily cracked, leading to reduced detection accuracy. The third type of detection method relies on extracted sample features; when there are significant differences in sample style, the model's detection accuracy decreases. Summary of the Invention

[0005] To address the aforementioned technical problems, this application provides a method, apparatus, device, and storage medium for detecting image categories, thereby improving the accuracy of AIGC image category detection.

[0006] In a first aspect, this application provides an image category detection method, the method comprising: acquiring an image to be detected; classifying the image to be detected using a style classification model to obtain a first target category of the image to be detected, the first target category including a real style category or a non-real style category; using an image detection model corresponding to the first target category as a target detection model; and detecting the image to be detected using a target image detection model to obtain a second target category of the image to be detected, the second target category including a generated category or a non-generated category.

[0007] Secondly, this application provides an image category detection device, which includes: an image acquisition module for acquiring an image to be detected; a style detection module for classifying the image to be detected using a style classification model to obtain a first target category corresponding to the image to be detected, wherein the first target category includes a true style category or a non-true style category; a model selection module for using an image detection model corresponding to the first target category as a target detection model; and a generation detection module for detecting the image to be detected using a target image detection model to obtain a second target category corresponding to the image to be detected, wherein the second target category includes a generated category or a non-generated category.

[0008] Thirdly, this application provides an electronic device comprising: one or more processors; a storage device for storing one or more programs; and, when the one or more programs are executed by the one or more processors, causing the one or more processors to implement the image category detection method as described in the first aspect above.

[0009] Fourthly, this application provides a storage medium, which may be a computer-readable storage medium storing a computer program thereon, which, when executed by a processor, implements the image category detection method as described in the first aspect above.

[0010] Fifthly, embodiments of this application provide a computer program product, which includes a computer program or instructions that, when executed by a processor, implement the image category detection method as described in the first aspect above.

[0011] The technical solution provided in this application has the following advantages compared with the prior art:

[0012] This application provides a method, apparatus, device, and storage medium for detecting image categories. The method includes: acquiring an image to be detected; classifying the image to be detected using a style classification model to obtain a first target category, wherein the first target category includes a true style category or a non-true style category; using an image detection model corresponding to the first target category as a target detection model; and detecting the image to be detected using the target image detection model to obtain a second target category, wherein the second target category includes a generated category or a non-generated category. In this application embodiment, by first determining the style category of the image to be detected using a style classification model, and then inputting the style category into the corresponding image detection model to determine whether the image to be detected is a generated image, the problem of low image detection accuracy is solved, and the accuracy of image category detection is improved. Attached Figure Description

[0013] The accompanying drawings, which are incorporated in and form part of this specification, illustrate embodiments consistent with this application and, together with the description, serve to explain the principles of this application.

[0014] To more clearly illustrate the technical solutions in the embodiments of this application or the prior art, the drawings used in the description of the embodiments or the prior art will be briefly introduced below. Obviously, for those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0015] Figure 1 A flowchart illustrating an image category detection method provided in an embodiment of this application;

[0016] Figure 2 This is a schematic diagram illustrating the structure of the various model relationships provided in the embodiments of this application;

[0017] Figure 3 A schematic flowchart illustrating the model training method provided in this application embodiment;

[0018] Figure 4 A flowchart illustrating another image category detection method provided in this application embodiment;

[0019] Figure 5 A schematic diagram of the parameter configuration of the image feature extraction network provided in the embodiments of this application;

[0020] Figure 6 This is a schematic diagram of the structure of the texture feature extraction network provided in the embodiments of this application;

[0021] Figure 7 This is a schematic diagram of the structure of the image detection model provided in the embodiments of this application;

[0022] Figure 8 A set of comparative diagrams of experimental activation maps provided for embodiments of this application;

[0023] Figure 9 A schematic diagram of the structure of the image category detection device provided in the embodiments of this application;

[0024] Figure 10 This is a schematic diagram of the structure of an electronic device provided in an embodiment of this application. Detailed Implementation

[0025] To better understand the above-mentioned objectives, features, and advantages of this application, the solution of this application will be further described below. It should be noted that, unless otherwise specified, the embodiments and features described in these embodiments can be combined with each other.

[0026] Many specific details are set forth in the following description in order to provide a full understanding of this application, but this application may also be implemented in other ways different from those described herein; obviously, the embodiments in the specification are only some embodiments of this application, and not all embodiments.

[0027] The term "comprising" and its variations as used herein are open-ended inclusions, meaning "including but not limited to". The term "based on" means "at least partially based on". The term "one embodiment" means "at least one embodiment"; the term "another embodiment" means "at least one additional embodiment"; the term "some embodiments" means "at least some embodiments". Definitions of other terms will be given in the description below.

[0028] It should be noted that the concepts of "first" and "second" mentioned in this application are only used to distinguish different devices, modules or units, and are not used to limit the order of functions performed by these devices, modules or units or their interdependencies.

[0029] It should be noted that the terms "a" and "a plurality of" used in this application are illustrative rather than restrictive, and those skilled in the art should understand that, unless otherwise expressly indicated in the context, they should be understood as "one or more".

[0030] The image category detection method provided in this application will be described in detail below with reference to the accompanying drawings and specific embodiments.

[0031] Figure 1 This is a flowchart of an image category detection method according to an embodiment of this application. This embodiment can be applied to the case of detecting whether an image is an AI-generated image. The method can be executed by an image category detection device, which can be implemented in software and / or hardware and can be configured in an electronic device.

[0032] like Figure 1 As shown, the image category detection method provided in this application embodiment mainly includes steps S101-S104.

[0033] S101. Obtain the image to be detected.

[0034] The image to be detected can be a single image. The image to be detected can be any one of four types of images: real non-generated image, real generated image, non-real generated image, and non-real non-generated image. The image category detection method provided in the embodiments of this application can identify the category of the image to be detected.

[0035] The image to be detected can also be a collection of multiple images to be detected. This collection can consist of any one of four types of images: real non-generated images, real generated images, non-real generated images, and non-real non-generated images. It can also be a collection composed of two or more of the above four types of images. The image category detection method provided in the embodiments of this application can identify the category of each image to be detected in the collection.

[0036] Real images refer to images of real objects or scenes that actually exist in the real world, while generated images refer to images synthesized using computer algorithms.

[0037] A real, ungenerated image is an image captured directly by a camera or other image-capturing device. For example, a photo taken by a mobile phone of a running dog is captured directly from the real world without any image generation algorithm processing. Therefore, this is a real, ungenerated image.

[0038] Realistically generated images are images created using computer algorithms based on real-world data or information. In other words, although the image is generated by a computer algorithm, its content is real, or at least likely to exist in the real world. For example, if an AIGC model is trained using a large number of real cat images as training data, and this AIGC model generates an image that looks very much like a real cat after training, then this image is a realistically generated image.

[0039] Non-realistically generated images refer to images generated by computer algorithms whose content may be entirely fictional and not based on any real-world data. In other words, the image content does not reflect the real world and is entirely imagined or created. For example, a generative model might create a fantastical creature, such as a dragon or unicorn, or an abstract artistic image, such as a virtual character for a game or a scene from a science fiction movie. In these cases, the image of the virtual character or the science fiction scene is a non-realistically generated image.

[0040] Non-realistic, non-generated images are images that are neither generated by algorithms nor represent any real-world scene or object. Examples include: a hand-drawn fantasy landscape by an artist, or a logo created by a designer using graphic design software.

[0041] S102. Use a style classification model to classify the image to be detected to obtain the first target category of the image to be detected. The first target category includes the real style category or the non-real style category.

[0042] The style classification model is primarily used to determine whether the image to be detected is a true style image. The first target category includes a true style category or a non-true style category. The true style category indicates that the image to be detected is a true image, while the non-true style category indicates that the image to be detected is a non-true image.

[0043] Furthermore, this style classification model is a binary classification model. After the image to be detected is input into the style classification model, the style classification model performs classification processing on the image to be detected to determine whether the image to be detected is a real image or a non-real image.

[0044] S103. Use the image detection model corresponding to the first target category as the target detection model.

[0045] First, the detection system for the entire image category will be explained, such as... Figure 2 As shown, the image category detection system includes a style classification model 21 and two image detection models, namely a real image detection model 22 and a non-real image detection model 23. The real image detection model 22 is used to detect whether a real image is a generated image, and the non-real image detection model 23 is used to detect whether a non-real image is a generated image.

[0046] In this system, each target category corresponds to a specific image detection model. In other words, the image detection model corresponding to the real style category is a real image detection model, and the image detection model corresponding to the non-real style category is a non-real image detection model. The selection of either the real image detection model or the non-real image detection model as the subsequent detection model is determined based on whether the image to be detected is a real image or a non-real image.

[0047] In one possible implementation, the image detection model corresponding to the first target category is used as the target detection model, including: when the first target category is a real style category, selecting the detection model of the real image category as the target detection model; when the second target category is a non-real style category, selecting the detection model of the non-real image category as the target detection model.

[0048] Specifically, such as Figure 2 As shown, both the real image detection model and the non-real image detection model are connected to the style classification model. When the style classification model determines that the image to be detected belongs to the real style category, that is, the image to be detected is a real image, the image to be detected can be directly input into the real image detection model. When the style classification model determines that the image to be detected belongs to the non-real style category, that is, the image to be detected is a non-real image, the image to be detected can be directly input into the non-real image detection model.

[0049] S104. Use the target image detection model to detect the image to be detected and obtain the second target category of the image to be detected. The second target category includes the generated category or the non-generated category.

[0050] Image detection models include real image detection models and non-real image detection models. The target image detection model can be either a real image detection model or a non-real image detection model.

[0051] Furthermore, the image detection model is also a binary classification model, used to determine whether an image to be detected is a generated image. Specifically, the ground truth image detection model is used to detect whether a ground truth image is a generated image, and the non-ground truth image detection model is used to detect whether a non-ground truth image is a generated image.

[0052] Specifically, when the image to be detected is a real image, a real image detection model is used to detect and identify the image to determine whether it is a generated image. When the image to be detected is a non-real image, a non-real image detection model is used to detect and identify the image to determine whether it is a generated image.

[0053] The image category detection system in this embodiment includes a style classification detection model, a real image detection model, and a non-real image detection model. During image category detection, the style classification model is first used to detect whether the image to be detected is a real or non-real image. Then, the image to be detected with a real style category is input into the real image detection model, which determines whether the image to be detected is a generated or non-generated image. The image to be detected with a non-real style category is input into the non-real image detection model, which determines whether the image to be detected is a generated or non-generated image. After determining the realism of the image to be detected, the system then detects whether the image to be detected is generated, thus improving the accuracy of image detection.

[0054] Based on the above embodiments, this application provides a model training method, such as... Figure 3As shown, the model training method provided in this application mainly includes S201-S205.

[0055] S201. Obtain sample images, including: real non-generated sample images, real generated sample images, non-real generated sample images, and non-real non-generated sample images.

[0056] Sample images refer to the training images used for model training. These images include: real non-generated sample images, real generated sample images, non-real generated sample images, and non-real non-generated sample images. The differences between these sample images can be found in the discussion of various types of images in the images to be detected in the above embodiments, and will not be discussed further in this application embodiment.

[0057] Specifically, real non-generated images and non-real non-generated images can be obtained through open-source datasets or web crawlers. A large number of real and non-real generated images can be generated using various publicly available AIGC tools or image generation platforms. This application's embodiments do not specifically limit the method of obtaining sample images.

[0058] S202. Divide the sample images into a first sample image set, a second sample image set, and a third sample image set. The first sample image set includes real generated sample images and non-real generated sample images. The second sample image set includes real non-generated sample images and real generated sample images. The third sample image set includes non-real non-generated sample images and non-real generated sample images.

[0059] Since the image category detection system includes three models, in this embodiment of the application, all sample images are divided into different sample image sets, and each sample image set is used to train its corresponding network model.

[0060] The first sample image set includes real generated sample images and non-real generated sample images, which are used to train a style classification model. The second sample image set includes real non-generated sample images and real generated sample images, which are used to train a real image detection model. The third sample image set includes non-real non-generated sample images and non-real generated sample images, which are used to train a non-real image detection model.

[0061] After acquiring a large number of sample images in S201, each sample image is labeled to classify it. The sample image categories include: real non-generated sample image, real generated sample image, non-real generated sample image, and non-real non-generated sample image. Each sample image corresponds to only one category. Labeling can be done manually or using existing labeling programs.

[0062] Furthermore, sample images can be labeled based on their source. For example, if a sample image is generated by a generative model, it can be labeled as a generative image. If a sample image is obtained from a camera platform, it can be labeled as a real image. If a sample image is obtained from a comic platform, it can be labeled as a non-real image. After labeling the sample images based on their source, the labeling information can be refined and modified.

[0063] Specifically, the sample images can be divided into a first sample image set, a second sample image set, and a third sample image set based on the annotation information of each sample image.

[0064] S203. Train the first network model using the first sample image set to obtain the style classification model.

[0065] The first network model is the initial model for the style classification model, i.e., the style classification model that has not been trained using the first sample image set. The first network model is a binary classification network model.

[0066] For example, the first network model can be the DaViT network model. The DaViT network model utilizes both spatial attention and channel attention mechanisms, which can more effectively capture global and local information in images and has a better effect on style classification.

[0067] The DaViT network model utilizes window attention to limit the computational scope, thereby reducing the computational complexity of global self-attention. This local attention mechanism helps the model capture local features of the image. In addition to spatial attention, the DaViT network model also considers channel-level attention, which helps the model learn the correlations between different feature channels, thus strengthening the modeling of global information.

[0068] To address the potential class imbalance issue in the sample data of the first sample image set, this embodiment modifies the original loss function of the DaViT network model.

[0069] Specifically, the total loss function of the style classification model is determined by the cross-entropy loss function and the center loss function, as shown in Equation (1).

[0070] L total =L CE +αL center (1)

[0071] Among them, L total L represents the total loss function of the style classification model. CE L represents the cross-entropy loss function. centerLet α represent the center loss function, and let α represent the weight of the center loss function, which is used to enhance intra-class consistency of images with different styles.

[0072] In classification tasks, the cross-entropy loss function is used to measure the difference between the probability distribution predicted by the style classification model and the probability distribution of the true labels. The true labels are usually represented as one-hot encoded, meaning that all positions except the true class position are 0.

[0073] Furthermore, if the model's prediction probability for the correct class is low, the cross-entropy loss function will have a high loss value. In this case, the model weights are adjusted during backpropagation to reduce the loss. When the model's prediction probability for the correct class approaches 1, the cross-entropy loss function approaches 0, indicating that the model's prediction is consistent with the true label.

[0074] The central loss function first calculates the squared differences between the predicted and actual values ​​for all samples, then sums these squared differences, and finally divides by the number of sample images N. This yields a metric reflecting the model's prediction accuracy: mean squared error (MSE). A smaller MSE indicates better prediction performance.

[0075] Specifically, the cross-entropy loss function is expressed by formula (2);

[0076]

[0077] Where N represents the total number of sample images in the first sample image set, C represents the number of categories in the first sample image set; i represents the sample image index in the first sample image set, λ represents the category index in the first sample image set, and λ represents the category index in the first sample image set. i p represents the weight value of the i-th sample image. ij ε represents the probability that the i-th sample image predicted by the style classification model belongs to category j, and ε represents the label smoothing coefficient.

[0078] Specifically, C represents the number of categories in the first sample image set, with an optional C value of 2. ε represents the label smoothing coefficient, used to prevent the style classification model from becoming overconfident. Specifically, to make the style classification model more robust, ε is used to smooth the probability values, reducing overfitting and improving the model's generalization ability.

[0079] λ i y represents the weight value of the i-th sample image, used to balance the influence of different categories. ij This represents the one-hot vector of the realism label of the i-th sample image as the i-th category.

[0080] Specifically, the center loss function is expressed by formula (3);

[0081]

[0082] Among them, y i x represents the one-hot vector of the realism label for the i-th sample image. i f(x) represents the feature vector of the i-th sample image before feature extraction by the feature extractor. i ) represents the feature vector extracted by the feature extractor from the i-th sample image. The explanations of other parameters can be found in the descriptions in the above embodiments.

[0083] By incorporating the cross-entropy loss function and center loss function into the DaViT network model and adjusting appropriate parameters for training, a style classification model is obtained. This model can classify the detected image into either a true style category or a non-true style category. In other words, the style classification model can determine whether the detected image belongs to the true style category or a non-true style category.

[0084] S204. Train the second network model using the second sample image set to obtain the real image detection model.

[0085] S205. The third network model is trained using the third sample image set to obtain the non-real image detection model.

[0086] The second and third network models have the same model structure. In other words, the model structure of the real image detection model and the non-real image detection model are the same; the difference lies in the different training sample images used, resulting in different model parameters. The model structures of the second and third network models can be referred to the description in the following embodiments.

[0087] The second network model is trained using real generated sample images to obtain a real image detection model; the third network model is trained using non-real, non-generated sample images to obtain a real image detection model.

[0088] In this embodiment, a binary classification network for image detection that generates realistic-style images using real-generated sample images and real-non-generated sample images is used. A binary classification network for generating images of non-real-style images is trained using non-real-style generated data and non-real-style ungenerated data. Compared to directly training a multi-classification network using four types of classification data, this training scheme has a simpler decision boundary, lower requirements for training data, and can be specifically trained for realistic-style generated images, improving the detection performance for this type of data.

[0089] In this embodiment, an image category detection system with three binary classification models is obtained. During detection, the system first determines the realism of the image's style, and then feeds the image into the corresponding image detection model based on the realism style to improve the image detection accuracy. The other two image detection models have strong robustness; the realism detection model also has good detection accuracy for generated images with non-realism styles, and vice versa, ensuring that even if the style classification is incorrect, the detection accuracy can still maintain a high level.

[0090] Based on the above embodiments, the image category detection method in this application is further optimized, such as... Figure 4 As shown, the optimized image category detection method in this application mainly includes:

[0091] S301. Obtain the image to be detected.

[0092] S302. Classify the image to be detected using a style classification model to obtain a first target category of the image to be detected. The first target category includes a true style category or a non-true style category.

[0093] S303. Use the image detection model corresponding to the first target category as the target detection model.

[0094] The execution flow of S301-S303 provided in this embodiment is the same as that of S101-S103 provided in the above embodiment. For details, please refer to the description in the above embodiment. This embodiment will not be specifically limited.

[0095] S304. Use an image feature extraction network to extract image features of different scales from the image to be detected.

[0096] In this embodiment, the image detection model employs an improved DaViT network model. Specifically, the DaViT-Tiny framework network is loaded using the timm library and then modified and optimized. The input image size for the DaViT-Tiny network is 224×224×3. The DaViT-Tiny network combines a hybrid model of convolutional neural networks (CNN) and Transformers.

[0097] In one possible implementation, the image feature extraction network has at least two stages, including a block embedding layer and multiple dual attention blocks. The block embedding layer is used to divide the input image into a series of non-overlapping small blocks; the dual attention blocks include a spatial attention layer and a channel attention layer.

[0098] See Figure 5As shown, the improved DaViT network model consists of four stages, which are connected sequentially. Each stage includes a patch embedding layer and multiple dual attention blocks.

[0099] The block embedding layer divides the input image or feature map into a series of non-overlapping blocks. Each block is then flattened and transformed into a fixed-length vector. In other words, each block is transformed into a higher-dimensional feature vector to facilitate subsequent processing by the dual attention block.

[0100] The dual attention block consists of a spatial attention layer and a channel attention layer. The spatial attention layer focuses on which locations in the image are more important, while the channel attention layer focuses on which channels (or feature maps) of the image features are more critical. The spatial and channel attention layers can be viewed as weighting the feature maps in two different dimensions. The spatial attention layer generates a weight matrix of the same size as the input feature map to highlight important regions in the image; while the channel attention layer produces a weight vector matching the number of channels to emphasize or suppress specific feature channels. Spatial and channel attention layers are used alternately to capture both global and local features of the image.

[0101] Specifically, such as Figure 5 As shown, the output image size of stage 1 is 56×56, and the parameters of the block embedding layer are: kernel size of 7, stride size of 4, pad width added at the image edges of 3 pixels, and number of output channels C. 1 The value is 96. `vin.ze.7x7` represents a 7x7 local attention window used by the dual attention block in the first stage. P w =49 This local attention window contains 49 pixels. N h 1 =N g 1 =3 indicates that the dual attention block in the first stage includes 3 spatial attention layers and 3 channel attention layers; C h 1 =C g 1 =32 indicates that each spatial attention layer and each spatial attention layer has 32 output channels.

[0102] The output image size of stage 2 is 28×28, and the parameters of the block embedding layer are: kernel size of 2, stride size of 2, pad width added at the image edges of 0, and output channel number C. 2 The value is 192. `vin.ze.7x7` represents a 7x7 local attention window used by the dual attention block in the second stage, P. w =49 This local attention window contains 49 pixels. N h 2 =N g 2 =6 indicates that the dual attention block in the second stage includes 6 spatial attention layers and 6 channel attention layers; C h 2 =C g 2 =32 indicates that each spatial attention layer and each spatial attention layer has 32 output channels.

[0103] The output image size of stage 3 is 14×14, and the parameters of the block embedding layer are: kernel size of 2, stride size of 2, pad width added at the image edges of 0, and number of output channels C. 3 The value is 384. `vin.ze.7x7` represents a 7x7 local attention window used by the dual attention block in the third stage, P. w =49 This local attention window contains 49 pixels. N h 3 =N g 3 =12 indicates that the dual attention blocks in the third stage include 12 spatial attention layers and 12 channel attention layers; C h 3 =C g 3 =32 indicates that each spatial attention layer and each spatial attention layer has 32 output channels.

[0104] The fourth stage outputs a 7×7 image, with the block embedding layer having a kernel of 2, a stride of 2, and a pad width of 0 pixels at the image edges. The number of output channels is C. 4 The value is 168. `vin.ze.7x7` represents a 7x7 local attention window used by the dual attention block in the fourth stage, P. w =49 This local attention window contains 49 pixels. Nh 4=Ng 4 =24 indicates that the dual attention blocks in the fourth stage include 24 spatial attention layers and 24 channel attention layers; C h 3 =C g 3 =32 indicates that each spatial attention layer and each spatial attention layer has 32 output channels.

[0105] Different network parameters are used in the spatial attention layer and channel attention layer at different stages so that the stacked dual attention modules can capture the layout features and global features of the image.

[0106] S305. For each scale of image features, extract texture features from the image features using a texture feature extraction network.

[0107] For AIGC-generated images with a realistic style, they typically possess distinct edges and details to simulate a realistic style; conversely, non-realistic AIGC-generated images often exhibit artistic textures and brushstroke features. Therefore, texture features play a crucial role in the detection of AIGC-generated images. The dual-attention mechanism ensures that the image detection model simultaneously focuses on both global and local features. Building upon this, a texture feature extraction network is added to extract texture features separately. Finally, these features are fused with those extracted by the DaViT network to improve detection performance.

[0108] In one possible implementation, such as Figure 6 As shown, the texture feature extraction network includes a first convolutional layer 61, a Gram matrix 62, a second convolutional layer 63, a third convolutional layer 64, and a first average pooling layer 65 connected in sequence; wherein, the convolutional kernels of the first convolutional layer, the second convolutional layer, and the third convolutional layer are the same, the number of channels of the first convolutional layer and the third convolutional layer are the same, and the stride of the number of channels of the first convolutional layer and the second convolutional layer is the same.

[0109] Specifically, the first convolutional layer 61 is a conv convolution with a kernel of 3, 48 channels, and a stride of 1. The second convolutional layer 63 is a conv convolution with a kernel of 3, 24 channels, and a stride of 2. The third convolutional layer 64 is a conv convolution with a kernel of 3, 24 channels, and a stride of 4.

[0110] S306. Use a fusion layer to fuse image features at different scales and the texture features corresponding to image features at each scale to obtain fused features.

[0111] In this embodiment of the application, the image feature extraction network includes four stages as an example for illustration, such as... Figure 7As shown, the image to be detected is input into the input layer 71 of the image detection model. The size of the input image is 224×224×224. After feature extraction in the first stage 72, a first feature map of 56×56×94 is obtained. Then, the first feature map is input into the second stage 73 for feature extraction, resulting in a second feature map of 28×28×192. The first feature map is input into the first texture feature extraction network 76, resulting in a first texture feature of 1×1×48. The second feature map is input into the second texture feature extraction network 77, resulting in a second texture feature of 1×1×48. The first texture feature and the second texture feature are merged to obtain a third texture feature.

[0112] The second feature map is input into the third stage 74 for feature extraction, resulting in a 14×14×384 third feature map; the third feature map is input into the third texture feature extraction network 78, resulting in a 1×1×48 fourth texture feature; the third texture feature and the fourth texture feature are merged to obtain the fifth texture feature.

[0113] The third feature map is input into the fourth stage 75 for feature extraction, resulting in a 7×7×768 fourth feature map. The fourth feature map is then passed through the second average pooling layer 79 and the first fully connected (FC) layer 710 to obtain the fifth feature.

[0114] After merging the fifth feature and the fifth texture feature, and passing through the second FC layer 711 and the output layer 712, the second target category of the image to be detected is obtained.

[0115] S307. The fused features are subjected to binary classification using a fully connected layer to obtain the second target category of the image to be detected.

[0116] The fused features are fed into a fully connected layer for binary classification, and the classification function is a label-smoothed cross-entropy binary classification loss function.

[0117] Following the fully connected layer is an output layer, which uses the Softmax function to convert the raw scores into the probability that the detected image belongs to the generated image class and the probability that it belongs to the non-generated image class.

[0118] To verify the effectiveness of the network, a class activation visualization was performed on a generated image for comparison, as follows: Figure 8As shown, the left image is an image generated using an existing generative class model, the middle image is the class activation map of DaViT, and the right image is the class activation map of the image class detection system provided in this application embodiment. It can be seen that, compared with DaViT, the image class detection system pays more attention to the edges and texture regions of the image due to the addition of the texture feature extraction network, which helps to better distinguish between the generated image and the real image.

[0119] Figure 9 This is a schematic diagram of the structure of an image category detection device according to an embodiment of this application, such as... Figure 9 As shown, the image category detection device 90 provided in this application embodiment mainly includes: an image acquisition module 91, a style detection module 92, a model selection module 93, and a generation detection module 94.

[0120] The system includes an image acquisition module 91 for acquiring an image to be detected; a style detection module 92 for classifying the image to be detected using a style classification model to obtain a first target category corresponding to the image to be detected, wherein the first target category includes a true style category or a non-true style category; a model selection module 93 for using an image detection model corresponding to the first target category as a target detection model; and a generation detection module 94 for detecting the image to be detected using the target image detection model to obtain a second target category corresponding to the image to be detected, wherein the second target category includes a generated category or a non-generated category.

[0121] This application provides an image category detection device, which performs the following steps: acquiring an image to be detected; classifying the image to be detected using a style classification model to obtain a first target category, wherein the first target category includes a true style category or a non-true style category; using the image detection model corresponding to the first target category as a target detection model; and detecting the image to be detected using the target image detection model to obtain a second target category, wherein the second target category includes a generated category or a non-generated category. By first determining the style category of the image to be detected using a style classification model, and then inputting the style category into the corresponding image detection model to determine whether the image to be detected is a generated image, the accuracy of image category detection is improved.

[0122] In one possible implementation, the system further includes: a model training module for acquiring sample images, the sample images including: real non-generated sample images, real generated sample images, non-real generated sample images, and non-real non-generated sample images; dividing the sample images into a first sample image set, a second sample image set, and a third sample image set, wherein the first sample image set includes real generated sample images and non-real generated sample images, the second sample image set includes real generated sample images and real generated sample images, and the third sample image set includes non-real non-generated sample images and non-real generated sample images; training a first network model using the first sample image set to obtain a style classification model; training a second network model using the second sample image set to obtain a real image detection model; and training a third network model using the third sample image set to obtain a non-real image detection model.

[0123] In one possible implementation, the model selection module 93 is specifically used to select the real image detection model as the target detection model when the first target category is a real style category; and to select the non-real image detection model as the target detection model when the second target category is a non-real style category.

[0124] In one possible implementation, the total loss function of the style classification model is determined by the cross-entropy loss function and the center loss function, wherein the cross-entropy loss function is expressed by the following formula;

[0125]

[0126] Where N represents the total number of sample images in the first sample image set, C represents the number of categories in the first sample image set; i represents the sample image index in the first sample image set, λ represents the category index in the first sample image set, and λ represents the category index in the first sample image set. i p represents the weight value of the i-th sample image. ij y is the probability that the i-th sample image predicted by the style classification model belongs to class i, ε represents the label smoothing coefficient, and y ij This represents the one-hot vector representing the realism label of the i-th sample image for the i-th category;

[0127] The central loss function is expressed by the following formula;

[0128]

[0129] y i x is a one-hot vector representing the realism label of the i-th sample image. i f(x) represents the feature vector of the i-th sample image before feature extraction by the feature extractor. i) represents the feature vector extracted by the feature extractor from the i-th sample image.

[0130] In one possible implementation, the target image detection model includes an image feature extraction network, a texture feature extraction network, a fusion layer, and a fully connected layer; the generation detection module 94 is specifically used to extract image features of different scales in the image to be detected using the image feature extraction network; for each scale of image features, the texture feature extraction network extracts texture features from the image features; the fusion layer fuses the image features of different scales and the texture features corresponding to each scale of image features to obtain fused features; and the fully connected layer performs binary classification processing on the fused features to obtain the second target category of the image to be detected.

[0131] In one possible implementation, the texture feature extraction network includes a first convolutional layer, a Gram matrix, a second convolutional layer, a third convolutional layer, and a first average pooling layer connected in sequence; wherein the first convolutional layer, the second convolutional layer, and the third convolutional layer have the same convolutional kernel, the first convolutional layer and the third convolutional layer have the same number of channels, and the first convolutional layer and the second convolutional layer have the same stride for the number of channels.

[0132] In one possible implementation, the image feature extraction network has at least two stages, each stage including a block embedding layer and multiple dual attention blocks. The block embedding layer is used to divide the input image into a series of non-overlapping small blocks; the dual attention blocks include a spatial attention layer and a channel attention layer.

[0133] The image category detection device provided in this application embodiment can execute the image category detection method provided in any embodiment of the present invention, and has the corresponding functional modules and beneficial effects of executing the method.

[0134] Figure 10 This is a schematic diagram of the structure of an electronic device provided in this embodiment. The electronic device may include an image category detection device, such as... Figure 10 As shown, the electronic device 1000 includes a processor 1010, a memory 1020, an input device 1030, and an output device 1040; the number of processors 1010 in the electronic device can be one or more. Figure 10 Taking a processor 1010 as an example; the processor 1010, memory 1020, input device 1030, and output device 1040 in the electronic device can be connected via a bus or other means. Figure 10 Taking the example of a connection between China and Israel via a bus.

[0135] The memory 1020, as a computer-readable storage medium, can be used to store software programs, computer-executable programs, and modules, such as the program instructions / modules corresponding to the image category detection method in this embodiment of the invention. The processor 1010 executes various functional applications and data processing of the electronic device by running the software programs, instructions, and modules stored in the memory 1020, thereby implementing the image category detection method provided in this embodiment of the invention.

[0136] The memory 1020 may primarily include a program storage area and a data storage area. The program storage area may store the operating system and at least one application program required for a given function; the data storage area may store data created based on terminal usage. Furthermore, the memory 1020 may include high-speed random access memory and non-volatile memory, such as at least one disk storage device, flash memory, or other non-volatile solid-state storage device. In some instances, the memory 1020 may further include memory remotely located relative to the processor 1010, which can be connected to the electronic device via a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.

[0137] Input device 1030 can be used to receive input digital or character information, and to generate key signal inputs related to user settings and function control of electronic devices, and may include a keyboard, mouse, etc. Output device 1040 may include display devices such as a display screen.

[0138] This embodiment also provides a storage medium containing computer-executable instructions, which, when executed by a computer processor, are used to implement the image category detection method provided in this embodiment of the invention.

[0139] Of course, the computer-executable instructions provided in the embodiments of the present invention are not limited to the method operations described above, but can also perform related operations in the image category detection method provided in any embodiment of the present invention.

[0140] Based on the above description of the implementation methods, those skilled in the art can clearly understand that the present invention can be implemented using software and necessary general-purpose hardware, and of course, it can also be implemented using hardware, but in many cases the former is a better implementation method. Based on this understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product. This computer software product can be stored in a computer-readable storage medium, such as a computer floppy disk, read-only memory (ROM), random access memory (RAM), flash memory, hard disk, or optical disk, etc., including several instructions to cause a computer device (which may be a personal computer, server, or network device, etc.) to execute the methods described in the various embodiments of the present invention.

[0141] It is worth noting that in the embodiments of the image category detection device described above, the various units and modules included are only divided according to functional logic, but are not limited to the above division, as long as the corresponding functions can be achieved; in addition, the specific names of each functional unit are only for easy differentiation and are not used to limit the scope of protection of the present invention.

[0142] It should be noted that, in this document, relational terms such as "first" and "second" are used merely to distinguish one entity or operation from another, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Furthermore, the terms "comprising," "including," or any other variations thereof are intended to cover non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements includes not only those elements but also other elements not expressly listed, or elements inherent to such a process, method, article, or apparatus. Without further limitations, an element defined by the phrase "comprising one..." does not exclude the presence of other identical elements in the process, method, article, or apparatus that includes said element.

[0143] The above description is merely a specific embodiment of this application, enabling those skilled in the art to understand or implement this application. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the general principles defined herein may be implemented in other embodiments without departing from the spirit or scope of this application. Therefore, this application is not to be limited to the embodiments described herein, but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

Claims

1. A method of detecting an image category, characterized by, The method includes: Acquire the image to be detected; The image to be detected is classified using a style classification model to obtain a first target category of the image to be detected, wherein the first target category includes a true style category or a non-true style category; Use the image detection model corresponding to the first target category as the target image detection model; The target image detection model is used to detect the image to be detected, and a second target category of the image to be detected is obtained. The second target category includes a generated category or a non-generated category. The target image detection model includes an image feature extraction network, a texture feature extraction network, a fusion layer, and a fully connected layer; The target image detection model is used to detect the image to be detected, and a second target category of the image to be detected is obtained, including: The image feature extraction network is used to extract image features at different scales from the image to be detected; For each scale of image features, a texture feature extraction network is used to extract texture features from the image features; A fusion layer is used to fuse image features at different scales and the corresponding texture features at each scale to obtain fused features; The fused features are subjected to binary classification using a fully connected layer to obtain the second target category of the image to be detected; The texture feature extraction network includes a first convolutional layer, a Gram matrix, a second convolutional layer, a third convolutional layer, and a first average pooling layer connected in sequence. The first, second, and third convolutional layers have the same convolutional kernels, the first and third convolutional layers have the same number of channels, and the first and second convolutional layers have the same stride for the number of channels. The image feature extraction network has at least two stages, each stage including a block embedding layer and multiple dual attention blocks. The block embedding layer is used to divide the input image into a series of non-overlapping small blocks. The dual attention blocks include a spatial attention layer and a channel attention layer.

2. The method of claim 1, wherein, Also includes: Acquire sample images, including: real non-generated sample images, real generated sample images, non-real generated sample images, and non-real non-generated sample images; The sample images are divided into a first sample image set, a second sample image set, and a third sample image set. The first sample image set includes real generated sample images and non-real generated sample images. The second sample image set includes real generated sample images and real generated sample images. The third sample image set includes non-real non-generated sample images and non-real generated sample images. The first network model is trained using the first sample image set to obtain a style classification model; The second network model is trained using the second sample image set to obtain a real image detection model; The third network model is trained using the third sample image set to obtain a non-real image detection model.

3. The method of claim 2, wherein, Using the image detection model corresponding to the first target category as the target detection model, including: If the first target category is a real style category, the real image detection model is selected as the target detection model; If the second target category is a non-realistic style category, the non-realistic image detection model is selected as the target detection model.

4. The method according to any one of claims 1-3, characterized in that, The total loss function of the style classification model is determined by the cross-entropy loss function and the center loss function, and the cross-entropy loss function is expressed by the following formula; ; Where N represents the total number of sample images in the first sample image set, C represents the number of categories in the first sample image set, i represents the index of a sample image in the first sample image set, and j represents the index of a category in the first sample image set. This represents the weight value of the i-th sample image. ε represents the probability that the i-th sample image predicted by the style classification model belongs to class j, and ε represents the label smoothing coefficient. This represents the one-hot vector representing the realism label of the i-th sample image as class j; The central loss function is expressed by the following formula; The one-hot vector representing the realism label of the i-th sample image. This represents the feature vector of the i-th sample image before it is extracted by the feature extractor. This represents the feature vector extracted by the feature extractor from the i-th sample image.

5. An image category detection device, characterized in that, The device includes: The image acquisition module is used to acquire the image to be detected; The style detection module is used to classify the image to be detected using a style classification model to obtain a first target category corresponding to the image to be detected. The first target category includes a true style category or a non-true style category. The model selection module is used to select the image detection model corresponding to the first target category as the target image detection model; A generation detection module is used to detect the image to be detected using the target image detection model to obtain a second target category corresponding to the image to be detected, wherein the second target category includes a generated category or a non-generated category; The target image detection model includes an image feature extraction network, a texture feature extraction network, a fusion layer, and a fully connected layer; The target image detection model is used to detect the image to be detected, and a second target category of the image to be detected is obtained, including: The image feature extraction network is used to extract image features at different scales from the image to be detected; For each scale of image features, a texture feature extraction network is used to extract texture features from the image features; A fusion layer is used to fuse image features at different scales and the corresponding texture features at each scale to obtain fused features; The fused features are subjected to binary classification using a fully connected layer to obtain the second target category of the image to be detected; The texture feature extraction network includes a first convolutional layer, a Gram matrix, a second convolutional layer, a third convolutional layer, and a first average pooling layer connected in sequence. The first, second, and third convolutional layers have the same convolutional kernels, the first and third convolutional layers have the same number of channels, and the first and second convolutional layers have the same stride for the number of channels. The image feature extraction network has at least two stages, each stage including a block embedding layer and multiple dual attention blocks. The block embedding layer is used to divide the input image into a series of non-overlapping small blocks. The dual attention blocks include a spatial attention layer and a channel attention layer.

6. An electronic device, characterized in that, The device includes: One or more processors; Storage device for storing one or more programs; When the one or more programs are executed by the one or more processors, the one or more processors implement the image category detection method as described in any one of claims 1-4.

7. A storage medium having a computer program stored thereon, characterized in that, When executed by the processor, the program implements the image category detection method as described in any one of claims 1-4.