Image detection model training and image detection method and device, and storage medium

By training an image detection model and utilizing adversarial loss and global feature analysis, the problem of inaccurate image authenticity detection is solved, achieving efficient identification of forged images and adaptability to multiple scenarios, thereby improving information security.

CN116168211BActive Publication Date: 2026-06-19ALIPAY (HANGZHOU) INFORMATION TECH CO LTD

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
ALIPAY (HANGZHOU) INFORMATION TECH CO LTD
Filing Date
2022-12-09
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing technologies are inaccurate in detecting the authenticity of images and inaccurate in extracting image features, making it difficult to effectively identify fake images generated by unknown AI face-swapping algorithms, and they lack robustness in multiple scenarios.

Method used

By training two image detection models and using real and fake images of the same original image as samples, the adversarial loss between the models is calculated to enhance the model's learning of the differences in features between real and fake images. By combining global and local feature analysis, an attention module is added to improve detection accuracy.

Benefits of technology

This improves the model's ability to detect forged images, enabling it to identify forged images more accurately, enhancing detection robustness in multiple scenarios, and protecting user information security.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN116168211B_ABST
    Figure CN116168211B_ABST
Patent Text Reader

Abstract

This specification discloses an image detection model training method, apparatus, and storage medium. Two image detection models are trained simultaneously using a combination of a real image and a forged image of the same original image. Based on the image features output by each model, an adversarial loss is determined between the two models. This adversarial loss is then used to train the models to learn the differences between real and forged image features. The two image detection models can not only train their own feature extraction capabilities separately but also compare them, learning the differences between real and forged image features in a single iteration. This allows the models to output forged image features that are significantly different from those of the real image when faced with a forged image, aiding in subsequent model judgments.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] The embodiments in this specification relate to the field of computer artificial intelligence technology, and in particular to an image detection model training and image detection method, apparatus, and storage medium. Background Technology

[0002] With the continuous development of facial recognition and image processing technologies in recent years, methods for face-swapping in images or videos using artificial intelligence (AI) have emerged. In today's internet environment, users' facial information is often linked to a large amount of private information, such as identity information, platform account information, and payment information. Therefore, forged images created by altering or replacing faces in images and videos using AI face-swapping technology pose a significant risk to network information security. Consequently, the ability to accurately determine the authenticity of faces in images or videos has become a crucial factor in enhancing network system security. Summary of the Invention

[0003] This specification provides an image detection model training method, apparatus, and storage medium, which can solve the technical problems of inaccurate image authenticity detection and inaccurate image features in related technologies.

[0004] Firstly, embodiments of this specification provide an image detection model training method, the method comprising:

[0005] The first sample image is input into the first image detection model, and the second sample image is input into the second image detection model. The first sample image and the second sample image are real images or fake images of the same original image.

[0006] Based on the first image features output by the first image detection model for the first sample image, and the second image features output by the second image detection model for the second sample image, the adversarial loss between the first image detection model and the second image detection model is determined.

[0007] The first image detection model and the second image detection model are trained based on the adversarial loss.

[0008] Secondly, embodiments of this specification provide an image detection method, the method comprising:

[0009] Obtain the image to be detected and input the image to be detected into the image detection model;

[0010] Based on the output data of the image detection model, it is determined whether the image to be detected is a real image or a fake image;

[0011] The image detection model is either the first image detection model or the second image detection model as described in any one of claims 1 to 7.

[0012] Thirdly, embodiments of this specification provide an image detection model training apparatus, the apparatus comprising:

[0013] The sample input module is used to input a first sample image into a first image detection model and a second sample image into a second image detection model, wherein the first sample image and the second sample image are real images or fake images of the same original image;

[0014] The loss calculation module is used to determine the adversarial loss between the first image detection model and the second image detection model based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image.

[0015] The model training module is used to train the first image detection model and the second image detection model based on the adversarial loss.

[0016] Fourthly, embodiments of this specification provide an image detection apparatus, which includes:

[0017] An image input module is used to acquire an image to be detected and input the image to be detected into an image detection model.

[0018] The authenticity detection module is used to determine whether the image to be detected is a real image or a fake image based on the output data of the image detection model.

[0019] The image detection model is either the first image detection model or the second image detection model described in any of the embodiments of the above specification.

[0020] Fifthly, embodiments of this specification provide a computer program product containing instructions that, when run on a computer or processor, cause the computer or processor to perform the steps of the method described above.

[0021] Sixthly, embodiments of this specification provide a computer storage medium storing a plurality of instructions adapted for loading by a processor and executing the steps of the method described above.

[0022] In a seventh aspect, embodiments of this specification provide a terminal including a memory, a processor, and a computer program stored in the memory and executable on the processor, the computer program being adapted to be loaded by the processor and to execute the steps of the method described above.

[0023] The beneficial effects of the technical solutions provided in some embodiments of this specification include at least the following:

[0024] This specification provides an image detection model training method. A first sample image is input into a first image detection model, and a second sample image is input into a second image detection model. The first and second sample images are either real or forged images of the same original image. Based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image, an adversarial loss is determined between the first and second image detection models. Based on this adversarial loss, both the first and second image detection models are trained. Since the first and second image detection models are trained synchronously using sample images from the same original image, they can not only train their own feature extraction capabilities but also compare them. In one iteration, they learn the differences between real and forged image features, enabling the models to output forged features that are significantly different from real image features when faced with forged images. This improves the model's ability to detect forged images more accurately. Attached Figure Description

[0025] To more clearly illustrate the technical solutions in the embodiments or prior art of this specification, the drawings used in the description of the embodiments or prior art will be briefly introduced below. Obviously, the drawings described below are only some embodiments of this specification. For those skilled in the art, other drawings can be obtained based on these drawings without creative effort.

[0026] Figure 1 An exemplary system architecture diagram of an image detection model training method provided in the embodiments of this specification;

[0027] Figure 2 A flowchart illustrating an image detection model training method provided in an embodiment of this specification;

[0028] Figure 3 A flowchart illustrating an image detection model training method provided in an embodiment of this specification;

[0029] Figure 4 A schematic diagram of the algorithm flow of an image detection model provided in the embodiments of this specification;

[0030] Figure 5 This is a schematic flowchart of an image detection method provided in an embodiment of this specification;

[0031] Figure 6This is a structural block diagram of an image detection model training device provided in the embodiments of this specification;

[0032] Figure 7 This is a structural block diagram of an image detection device provided in an embodiment of this specification;

[0033] Figure 8 This is a schematic diagram of the structure of a terminal provided in an embodiment of this specification. Detailed Implementation

[0034] To make the features and advantages of the embodiments of this specification more apparent and understandable, the technical solutions of the embodiments of this specification will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of this specification, and not all embodiments. Based on the embodiments in this specification, all other embodiments obtained by those skilled in the art without creative effort are within the scope of protection of the embodiments of this specification.

[0035] In the following description, when referring to the accompanying drawings, the same numbers in different drawings denote the same or similar elements unless otherwise indicated. The embodiments described in the following exemplary embodiments do not represent all embodiments consistent with those described in this specification. Rather, they are merely examples of apparatuses and methods consistent with some aspects of the embodiments described in this specification as detailed in the appended claims.

[0036] With the continuous development of facial recognition and image processing technologies in recent years, methods for face-swapping in images or videos using artificial intelligence (AI) have emerged. Leveraging the widespread reach of online platforms, AI face-swapping has become a popular entertainment activity, spawning numerous AI face-swapping apps and lowering the barrier to entry. However, in today's internet environment, users' facial information is often linked to a large amount of private information, such as identity information, platform account information, and payment information. Therefore, forged images created by altering or replacing images and videos using AI face-swapping technology pose a significant risk to user information security. Furthermore, AI face-swapping carries high public opinion risks for both facial recognition and cybersecurity systems. It has also led to some individuals using AI face-swapping technology for illicit profit. Therefore, accurately determining whether a face is fake through video and image technology has become a crucial aspect of security capabilities in various systems.

[0037] Currently, there are two main approaches to detecting face forgery in images. One approach is based on feature inverse kinematics (VK). Since AI face-swapping itself is generated through deep convolutional neural networks, and the basic network structures of different methods are highly similar, the features of the corresponding AI face-swapping network can be obtained by inverse kinematics from the face-swapping images output by some known basic AI face-swapping networks. This allows the network to learn the ability to detect face-swapping in reverse kinematics. However, this approach is significantly less adaptable to the differences in unknown and emerging AI face-swapping algorithms. In real-world scenarios, the accuracy of image detection drops significantly when faced with a large number of images from unknown AI face-swapping algorithms.

[0038] In addition to the above, there are face forgery detection methods based on local region texture analysis. Because AI face-swapping involves the fusion of features from two faces, and different faces have relatively obvious differences in facial shape, discrepancies often remain in the facial contour area after AI face-swapping. This method judges the authenticity of an image by detecting these discrepancies in the facial contour area, which is equivalent to using only local features for judgment. It does not consider changes in lighting, background, and other image variations in real-world scenarios, lacks global robustness, and is difficult to achieve universality across multiple scenarios.

[0039] Therefore, this specification provides an image detection model training method. By combining a real image and a fake image of the same original image, two image detection models are trained simultaneously. Based on the image features output by the two models, the adversarial loss between the two models is determined. The model is then trained based on the adversarial loss to learn the differences between real image features and fake image features, thereby solving the aforementioned technical problems of inaccurate authenticity detection and inaccurate image feature extraction.

[0040] Please see Figure 1 , Figure 1 This is an exemplary system architecture diagram of an image detection model training method provided in the embodiments of this specification.

[0041] like Figure 1 As shown, the system architecture may include a terminal 101, a network 102, and a server 103. The network 102 serves as the medium for providing a communication link between the terminal 101 and the server 103. The network 102 may include various types of wired or wireless communication links, such as wired communication links including fiber optic cables, twisted-pair cables, or coaxial cables, and wireless communication links including Bluetooth communication links, Wireless-Fidelity (Wi-Fi) communication links, or microwave communication links, etc.

[0042] Terminal 101 can interact with server 103 via network 102 to receive messages from or send messages to server 103. Alternatively, terminal 101 can interact with server 103 via network 102 to receive messages or data sent to server 103 by other users. Terminal 101 can be hardware or software. When terminal 101 is hardware, it can be various electronic devices, including but not limited to smartwatches, smartphones, tablets, laptops, and desktop computers. When terminal 101 is software, it can be installed in the aforementioned electronic devices and can be implemented as multiple software programs or software modules (e.g., to provide distributed services) or as a single software program or software module; no specific limitation is made here.

[0043] In the embodiments of this specification, in terminal 101, a first sample image is first input into a first image detection model, and a second sample image is first input into a second image detection model, wherein the first sample image and the second sample image are real images or forged images of the same original image; further, terminal 101 determines the adversarial loss between the first image detection model and the second image detection model based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image; finally, terminal 101 trains the first image detection model and the second image detection model based on the adversarial loss.

[0044] Server 103 can be a business server providing various services. It should be noted that server 103 can be either hardware or software. When server 103 is hardware, it can be implemented as a distributed server cluster consisting of multiple servers, or as a single server. When server 103 is software, it can be implemented as multiple software programs or software modules (e.g., used to provide distributed services), or as a single software program or software module; no specific limitations are made here.

[0045] Alternatively, the system architecture may not include server 103. In other words, server 103 may be an optional device in the embodiments of this specification. That is, the method provided in the embodiments of this specification can be applied to a system structure that only includes terminal 101. The embodiments of this specification do not limit this.

[0046] It should be understood that Figure 1 The number of terminals, networks, and servers shown is only illustrative; the number can be any number of terminals, networks, and servers depending on the implementation requirements.

[0047] Please see Figure 2 , Figure 2This is a flowchart illustrating an image detection model training method provided in an embodiment of this specification. The execution entity in this embodiment can be a terminal executing image detection model training, a processor within the terminal executing the image detection model training method, or an image detection model training service within the terminal executing the image detection model training method. For ease of description, the following example uses a processor within the terminal as the execution entity to illustrate the specific execution process of the image detection model training method.

[0048] like Figure 2 As shown, image detection model training methods can include at least:

[0049] S202. Input the first sample image into the first image detection model and input the second sample image into the second image detection model. The first sample image and the second sample image are real images or fake images of the same original image.

[0050] Optionally, due to the widespread application of AI face-swapping technology, spoofed face images have become easier to obtain. This poses a security threat to systems that rely on real facial images to provide services and information, and also threatens users' personal information security. Therefore, it is necessary to accurately detect the authenticity of facial images to avoid losses caused by face spoofing attacks. With the development of machine learning technology, the method of training neural networks through deep learning to fit network models to complete target tasks has been widely used. Similarly, neural network models can also be used for target detection after training. However, current models are mainly trained on face-swapped images generated by known face-swapping algorithms to reverse-engineer known face-swapping algorithms, but they cannot accurately detect face spoofing images using unknown face-swapping algorithms. In real-world scenarios, it cannot be guaranteed that all face spoofing attack images come from known face-swapping algorithms, and face-swapping algorithms are updated and iterated rapidly. The applicability of models may gradually weaken, and the cost of maintenance and updates is high. At the same time, the accuracy of detecting spoofed images cannot meet security requirements.

[0051] Optionally, when performing image detection tasks, the image detection model first obtains the feature representation of the image, predicts the probability of the image belonging to each type based on the image features, and finally determines the most probable type of the image based on the probability corresponding to each type. Therefore, the model's feature extraction capability directly affects the accuracy of the image detection results. Thus, by improving the model's feature extraction capability for forged images, the model's ability to identify and detect forged face images can be improved.

[0052] Furthermore, considering common model training methods that typically input single sample forged images into the model, the model can only learn the feature patterns of the forged image itself during iterations, without being able to learn the features of the real image. Therefore, the model lacks accurate perception of the differences and invariants between the forged and real images, resulting in limited feature extraction capabilities and potentially inaccurate detection results. To improve the image detection model's ability to identify and detect forged face images, the real and forged image features of the same original image can be used as a comparison. Simultaneously, the loss function of the image detection model can be constructed based on these two features, allowing the model to learn the differences between them and enhancing its ability to distinguish between real and forged images.

[0053] Specifically, to simultaneously acquire two image features for comparison, two image detection models consisting of two symmetrical group feature extraction structures can be used for training, along with a combination of multiple two-image sample images. The sample image set includes a first sample image and a second sample image, where the first and second sample images are either real or forged images of the same original image. During training, the first sample image is input into the first image detection model, and the second sample image is input into the second image detection model, respectively.

[0054] It is important to note that, to ensure the effectiveness of training, the initial parameters of the first and second image detection models are the same, and they are trained synchronously in each subsequent iteration. Furthermore, the type combination of the first and second sample images in the sample image combination is a random permutation of real and fake images. That is, the type combination of the first and second sample images may have four cases: "real + real", "real + fake", "fake + real", and "fake + fake". This diverse combination of sample images allows the image detection model to simultaneously learn the consistency between features of the same type of images and the differences between features of different types of images, thereby better identifying face spoofing attacks and having better robustness.

[0055] S204. Based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image, determine the adversarial loss between the first image detection model and the second image detection model.

[0056] Optionally, the first sample image and the second sample image are respectively input into the first image detection model and the second image detection model to obtain the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image. Based on the first image features and the second image features, the first image detection model and the second image detection model can be trained so that the two models can learn the feature differences between real images and fake images based on the consistency and differences between the two image features, thereby enhancing the accuracy of the model when extracting image features.

[0057] Furthermore, after obtaining the first and second image features, these features can be used to train two image detection models. Specifically, the image detection models need to be trained based on a loss function to learn the consistency between images of the same type and the differences between images of different types. Therefore, an adversarial loss function can be used for training. The adversarial loss function is a loss function that reflects the differences between the first and second image features. The distance between the first and second image features reflects their differences. Therefore, the adversarial loss value can be calculated by calculating the distance between the first and second image features. In other words, the adversarial loss between the first and second image detection models can be determined based on the first and second image features.

[0058] Optionally, since AI face-swapping involves feature fusion between two faces, and the model only judges based on local features of the facial contour, it lacks global robustness and cannot be applied to various complex scenarios. Therefore, in the embodiments of this specification, when extracting image features, the global features of the image are directly extracted. That is, the first image feature is the global image feature of the first sample image, and the second image feature is the global image feature of the second sample image. When extracting image features, the overall relationship of the image is considered to enhance the robustness of the image features. On this basis, an attention module can also be added to the image detection model. This allows the model to increase its attention to areas with large differences between real people and face-swapping, such as facial contours, while extracting global features. Combining global and local features for analysis reduces the complexity of the final classification prediction.

[0059] S206. Based on adversarial loss, train the first image detection model and the second image detection model.

[0060] Optionally, after obtaining the adversarial loss based on the first and second image features, the first and second image detection models are trained according to the adversarial loss. The relevant parameters in the first and second image detection models are adjusted so that the two models learn the differences between real and fake images based on the differences shown in the loss function during the iteration process, expand the feature extraction distance between real and fake images, and more clearly distinguish between real and fake images. This enables the model to output fake features that are significantly different from the features of real images when faced with fake images, thereby improving the model's ability to detect fake images and more accurately detecting fake images.

[0061] This specification provides an image detection model training method. A first sample image is input into a first image detection model, and a second sample image is input into a second image detection model. The first and second sample images are either real or forged images of the same original image. Based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image, an adversarial loss is determined between the first and second image detection models. The first and second image detection models are trained based on the adversarial loss. Since the first and second image detection models are trained synchronously using sample images of the same original image, they can not only train their own feature extraction capabilities but also compare them, learning the differences between real and forged image features in a single iteration. This allows the model to output forged features that are significantly different from real image features when faced with a forged image, thereby improving the model's ability to detect forged images more accurately.

[0062] Please see Figure 3 , Figure 3 This is a flowchart illustrating an image detection model training method provided in an embodiment of this specification.

[0063] like Figure 3 As shown, image detection model training methods can include at least:

[0064] S302. Input the first sample image into the first image detection model and input the second sample image into the second image detection model. The first sample image and the second sample image are real images or fake images of the same original image.

[0065] For details regarding step S302, please refer to the description in step S202; it will not be repeated here.

[0066] S304. Based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image, determine the first prediction loss of the first image detection model based on the first image features and the first sample image, and determine the second prediction loss of the second image detection model based on the second image features and the second sample image.

[0067] Optionally, for a single image detection model, the sample image features output for a sample image can be used for model training. Based on the predicted type obtained from the sample image features and the standard predicted type corresponding to the sample image, the loss value of the cross-entropy loss (softmax loss) function is calculated to constrain the difference between the predicted classification label and the standard classification label when the model performs image detection. For example, when the image detection model performs a binary classification task: detecting whether the image is a real image or a fake image, if the sample image is a real image and the corresponding classification label is (real, fake), the standard classification label is (1,0). However, if the image detection model outputs a predicted classification label of (0.7,0.3) for this sample image, it means that the model believes there is a 70% probability that it is a real image. There is a gap between this and the 100% probability of a real image. Therefore, the model will calculate the loss value based on the predicted classification label, the standard classification label, and the cross-entropy loss function to constrain the direction of model parameter tuning until the model converges.

[0068] Therefore, in the embodiments of this specification, in the calculation of adversarial loss, the first image detection model and the second image detection model can also calculate cross-entropy loss based on the output image features to constrain their own independent face forgery recognition capabilities. Please refer to... Figure 4 , Figure 4 This is a schematic diagram of the algorithm flow of an image detection model provided in an embodiment of this specification. Figure 4 As shown in (A), the first sample image is input into the first image detection model, and the second sample image is input into the second image detection model; the first image detection model outputs the first image feature, and the second image detection model outputs the second image feature; the adversarial loss is calculated based on the first image feature and the second image feature.

[0069] For details on calculating the adversarial loss, please refer to [link / reference]. Figure 4 (B) in Figure 4 In (B), a first prediction loss of the first image detection model is determined based on the first image features and the first sample image, and a second prediction loss of the second image detection model is determined based on the second image features and the second sample image.

[0070] Specifically, the task corresponding to the first prediction loss and the second prediction loss is a binary classification task to determine whether the input image is a "real image" or a "fake image". The calculation of the first prediction loss and the second prediction loss adopts the same cross-entropy loss calculation method. That is, for the first prediction loss, firstly, the first prediction value output by the first image detection model for the type of the first sample image is determined according to the first image features, and then the first prediction loss of the first image detection model is determined according to the first standard value and the first prediction value corresponding to the type of the first sample image. For the second prediction loss, firstly, the second prediction value output by the second image detection model for the type of the second sample image is determined according to the second image features, and then the second prediction loss of the second image detection model is determined according to the second standard value and the second prediction value corresponding to the type of the second sample image.

[0071] It's important to note that the model's output prediction value for image type can specifically be the predicted probability value for the current image being (real, fake). For example, when the first sample image is a real image, the first predicted value might be (0.8, 0.2), while the first standard value for the first sample image is (1, 0). This allows for the independent image detection capabilities of two models, trained based on the first image features for the first image detection model and based on the second image features for the second image detection model, respectively, from the perspective of image standard labels.

[0072] S306. Based on the first image features and the second image features, determine the contrast loss between the first image detection model and the second image detection model.

[0073] Optionally, as can be seen from the above embodiments, if the first sample image and the second sample image are real or fake images of the same original image, then the first image features and the second image features can be used as a comparison. When the first sample image and the second sample image are of the same type, such as both being real images or both being face-swapped images, the comparison between the first image features and the second image features allows the model to learn the consistency between images of the same type. When the first sample image and the second sample image are of different types, such as one being a real image and the other a face-swapped image, the comparison between the first image features and the second image features allows the model to learn the differences between images of different types. After training in this way, the difference between the real image features and the fake image features output by the model can be increased, making the prediction results obtained based on image features more accurate.

[0074] Optionally, such as Figure 4As shown in (B), in the adversarial loss, the contrast loss between the first image detection model and the second image detection model can be determined based on the first image features and the second image features. The specific value of the contrast loss is calculated based on the contrastive loss function. The contrastive loss is used to calculate whether the paired features extracted by the group feature network model are of the same type. In the calculation of the contrastive loss, the distance between the first image features and the second image features is quantified by calculating the cosine similarity between the first image features and the second image features, that is, calculating the similarity between the first image features and the second image features; the contrast loss between the first image detection model and the second image detection model is determined based on the similarity. This enhances the difference in features extracted by the model for different types of images.

[0075] S308. Based on the first prediction loss, the second prediction loss, and the contrast loss, determine the adversarial loss between the first image detection model and the second image detection model.

[0076] Optionally, based on the first image features and the second image features, a first prediction loss of the first image detection model is obtained, and a second prediction loss of the second image detection model is obtained. Simultaneously, based on the first image features and the second image features, it is determined whether the paired features are of the same type. The feature extraction distance of the group network model (i.e., the first image detection model and the second image detection model) for different types of images is increased according to the cosine similarity of the first image features and the second image features. This makes the image features output by the model compact among images of the same type and dispersed among images of different types, thereby obtaining more accurate detection results.

[0077] S3010. The first image features and the second image features are fused to obtain fused image features. The classification loss between the first image detection model and the second image detection model is determined based on the fused image features, the first sample image, and the second sample image.

[0078] Optionally, in addition to training the model to learn the differences between image features of real and fake images through adversarial loss, the model can be further enhanced to learn the differences between real and fake images based on the combination type of paired sample images. For details, please refer to [link to relevant documentation]. Figure 4 In (C), the first image features and the second image features are fused to obtain fused image features. Based on the fused image features, the combination type prediction value output by the first image detection model and the second image detection model for the sample image combination composed of the first sample image and the second sample image is determined. Based on the combination type standard value and the combination type prediction value of the sample image combination, the classification loss between the first image detection model and the second image detection model is determined.

[0079] Optionally, when calculating the classification loss, the model's predicted value is compared with the true value. In this case, the cross-entropy loss function can also be used for specific calculation. The task is to determine the type of the combination of sample images as "the first sample image is a real image and the second sample image is a real image", "the first sample image is a real image and the second sample image is a fake image", "the first sample image is a fake image and the second sample image is a real image", or "the first sample image is a fake image and the second sample image is a fake image". Based on the gap between the predicted classification and the true standard classification, the model can further learn the differences between the image features of different types of images.

[0080] For example, the standard value for the classification task of real image corresponding to (real image, fake image) is (1,0), and the standard value for the classification task of fake image corresponding to (real image, fake image) is (0,1). Then, when the first sample image in the sample image combination is a real image and the second sample image is a fake image, the standard value of the combination type of the sample image combination is ((1,0), (0,1)). The fused image features obtained by fusing and splicing the features of the first image and the features of the second image can predict the combination type prediction value of the sample image combination based on the fused image feature model ((0.7,0.3), (0.2,0.8)). At this time, the standard value of the combination type and the predicted value of the combination type can be used to calculate the classification loss value between the model's predicted classification and the standard classification.

[0081] Optionally, in order to further increase the inter-class distance between real image features and fake image features, a center loss calculation can be added to the loss function based on the cross-entropy loss calculation. This reduces the distance between image features output for the same type of image, making the image features output by the model more compact among images of the same type. This increases the inter-class distance between different types of image features between real and fake images, and better constrains the model training process.

[0082] S3012. Based on adversarial loss and classification loss, train the first image detection model and the second image detection model.

[0083] Optionally, a first image detection model and a second image detection model are trained based on adversarial loss and classification loss. The prediction loss in the adversarial loss and classification loss is calculated using cross-entropy loss. Therefore, the training of the model is based on the difference between the predicted label and the real label to constrain the model. The first prediction loss and the second prediction loss correspond to the binary classification task where the sample image type is "real image" or "fake image", thus training the model's independent recognition ability. The classification loss corresponds to the four-class classification task where the combination of paired sample images is of type "first sample image is real image, second sample image is real image", "first sample image is real image, second sample image is fake image", "first sample image is fake image, second sample image is real image", or "first sample image is fake image, second sample image is fake image", which is equivalent to an enhancement method for the model to learn the difference features between real images and fake images.

[0084] The contrastive loss component of the adversarial loss is used for binary classification tasks that determine whether paired image combinations are of the same or different types. During training, when paired images are determined to be of the same type, the model parameters are adjusted to minimize the distance between image features of the same type. When paired images are determined to be of different types, the model parameters are adjusted based on the cosine distance between the features of the paired images. A distance threshold can be set. If the distance between the features of the paired images is greater than the threshold, the model is considered to have met the performance requirements for distinguishing between different types of features, and no further optimization is needed to save computational resources and time. If the distance between the features of the paired images is not greater than the threshold, the model parameters are adjusted to increase the distance between them to the threshold. This adjustment of model parameters helps to broaden the inter-class distance distribution of the model, increasing the accuracy of subsequent detection results.

[0085] Optionally, when adversarial loss and classification loss are used for model training, the model is adjusted once in one iteration using all loss values. In specific calculations, they can be added directly or added according to different weight ratios. The calculation between loss values ​​can be set according to actual needs based on the application requirements of the model. This specification does not impose specific limitations on this in the embodiments.

[0086] In the embodiments of this specification, an image detection model training method is provided. A first prediction loss is calculated for a first image detection model using first image features, and a second prediction loss is calculated for a second image detection model using second image features, to train the independent face forgery recognition capabilities of the first and second image detection models respectively. A contrast loss is calculated using the first and second image features to train paired image detection models, based on the differences between different types of features and the consistency between similar features, so that the image features output by the paired image detection models are compactly distributed among images of the same class and dispersed among images of different classes. Furthermore, the first and second image features are fused, and a classification loss is obtained based on the fused image features to strengthen the model's learning of the differences between real and forged images. Through the design of these three losses, the model not only trains independent image detection and recognition capabilities but also compares them to strengthen the inter-class distance of feature extraction for different classes, making the real image features and forged image features output by the model significantly different, thereby improving the model's ability to detect forged images more accurately.

[0087] Please see Figure 5 , Figure 5 This is a schematic flowchart of an image detection method provided in an embodiment of this specification.

[0088] like Figure 5 As shown, an image detection method may include at least:

[0089] S502. Obtain the image to be detected and input the image to be detected into the image detection model;

[0090] Optionally, in any of the above embodiments, only one of the network models, namely the converged first image detection model or the second image detection model, needs to be used in a real-world scenario. In this case, the converged image detection model increases the inter-class distance and reduces the intra-class distance when extracting features of different classes by learning the differences between real image features and fake image features. Therefore, after obtaining the image to be detected in the scene, inputting the image to be detected into the image detection model can determine a more accurate detection result based on the image detection model.

[0091] Optionally, in a given scenario, the image detection model can also be used for face forgery recognition in the video to be detected. Specifically, image detection can be performed on each frame of the video to be detected. Therefore, the specific form of the target to be detected is not limited to the embodiments of this specification. Furthermore, the practical application of the image detection model is not limited to face forgery images; it can also be trained using corresponding sample images for other types of synthetic forgery images to obtain image detection models applicable to the detection of other types of forgery images. This specification does not limit this aspect.

[0092] S504. Based on the output data of the image detection model, determine whether the image to be detected is a real image or a fake image.

[0093] Optionally, in the image detection model, image features of the image to be detected are obtained. There are obvious differences between the real image features extracted by the image detection model and the fake image features. Therefore, the image detection model is more likely to obtain accurate image detection and classification results based on the image features. Based on this, it can be directly determined whether the image to be detected is a real image or a fake image.

[0094] This specification provides an image detection method that involves acquiring an image to be detected and inputting it into an image detection model. Based on the output data of the image detection model, the method determines whether the image to be detected is a real image or a forged image. The image detection model is either the first image detection model or the second image detection model described in any of the above embodiments. Because the features of a real image and the features of a forged image output by the converged image detection model are different, the two types of features output by the image detection model are easier to distinguish and classify. Therefore, the image detection model can obtain a more accurate image detection classification result, improving the user experience and protecting information security in various scenarios.

[0095] Please see Figure 6 , Figure 6 This is a structural block diagram of an image detection model training device provided in an embodiment of this specification. Figure 6 As shown, the image detection model training device 600 includes:

[0096] The sample input module 610 is used to input a first sample image into a first image detection model and to input a second sample image into a second image detection model. The first sample image and the second sample image are real images or fake images of the same original image.

[0097] The loss calculation module 620 is used to determine the adversarial loss between the first image detection model and the second image detection model based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image.

[0098] Model training module 630 is used to train a first image detection model and a second image detection model based on adversarial loss.

[0099] Optionally, the loss calculation module 620 is further configured to determine a first prediction loss of the first image detection model based on the first image features and the first sample image, and to determine a second prediction loss of the second image detection model based on the second image features and the second sample image; to determine a contrast loss between the first image detection model and the second image detection model based on the first image features and the second image features; and to determine an adversarial loss between the first image detection model and the second image detection model based on the first prediction loss, the second prediction loss, and the contrast loss.

[0100] Optionally, the loss calculation module 620 is further configured to determine a first predicted value output by the first image detection model for the type of the first sample image based on the first image features, and to determine a first prediction loss of the first image detection model based on the first standard value and the first predicted value corresponding to the type of the first sample image; and to determine a second predicted value output by the second image detection model for the type of the second sample image based on the second image features, and to determine a second prediction loss of the second image detection model based on the second standard value and the second predicted value corresponding to the type of the second sample image.

[0101] Optionally, the loss calculation module 620 is also used to calculate the similarity between the first image features and the second image features; and to determine the contrast loss between the first image detection model and the second image detection model based on the similarity.

[0102] Optionally, the image detection model training device 600 further includes: a classification loss module, used to fuse the first image features and the second image features to obtain fused image features, determine the classification loss between the first image detection model and the second image detection model based on the fused image features, the first sample image and the second sample image; and train the first image detection model and the second image detection model based on the adversarial loss and the classification loss.

[0103] Optionally, the classification loss module is further configured to determine the combination type prediction value output by the first image detection model and the second image detection model for the sample image combination composed of the first sample image and the second sample image based on the fused image features; and to determine the classification loss between the first image detection model and the second image detection model based on the combination type standard value and the combination type prediction value of the sample image combination.

[0104] Optionally, the first image feature is the global image feature of the first sample image, and the second image feature is the global image feature of the second sample image.

[0105] In this embodiment, an image detection model training apparatus is provided, comprising: a sample input module for inputting a first sample image into a first image detection model and a second sample image into a second image detection model, wherein the first and second sample images are real or forged images of the same original image; a loss calculation module for determining the adversarial loss between the first and second image detection models based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image; and a model training module for training the first and second image detection models based on the adversarial loss. Since the first and second image detection models are trained synchronously using sample images of the same original image, they can not only train their own feature extraction capabilities separately but also compare themselves, learning the differences between real and forged image features in a single iteration. This allows the models to output forged features that are significantly different from real image features when faced with forged images, thereby improving the model's ability to detect forged images and more accurately detecting them.

[0106] Please see Figure 7 , Figure 7 This is a structural block diagram of an image detection device provided in an embodiment of this specification. Figure 7 As shown, the image detection device 700 includes:

[0107] The image input module 710 is used to acquire the image to be detected and input the image to be detected into the image detection model.

[0108] The authenticity detection module 720 is used to determine whether the image to be detected is a real image or a fake image based on the output data of the image detection model.

[0109] The image detection model is either the first image detection model or the second image detection model in any of the embodiments described above.

[0110] In this embodiment, an image detection device is provided, comprising an image input module for acquiring an image to be detected and inputting the image to be detected into an image detection model; and a authenticity detection module for determining whether the image to be detected is a real image or a forged image based on the output data of the image detection model. The image detection model is either the first image detection model or the second image detection model in any of the above embodiments. Because the features of a real image and the features of a forged image output by the converged image detection model are different, the two types of features output by the image detection model are easier to distinguish and classify. Therefore, the image detection model can obtain a better image detection classification result, improving the user experience and protecting information security in multiple scenarios.

[0111] This specification provides a computer program product containing instructions that, when run on a computer or processor, cause the computer or processor to perform the steps of any of the methods described above.

[0112] This specification also provides a computer storage medium that can store multiple instructions adapted for loading by a processor and executing the steps of any of the methods described in the above embodiments.

[0113] Please see Figure 8 , Figure 8 This is a schematic diagram of the structure of a terminal provided in an embodiment of this specification. Figure 8 As shown, terminal 800 may include: at least one terminal processor 801, at least one network interface 804, user interface 803, memory 805, and at least one communication bus 802.

[0114] The communication bus 802 is used to enable communication between these components.

[0115] The user interface 803 may include a display screen and a camera. Optionally, the user interface 803 may also include a standard wired interface and a wireless interface.

[0116] The network interface 804 may optionally include a standard wired interface or a wireless interface (such as a Wi-Fi interface).

[0117] The terminal processor 801 may include one or more processing cores. The terminal processor 801 connects to various parts within the terminal 800 using various interfaces and lines. It executes various functions and processes data by running or executing instructions, programs, code sets, or instruction sets stored in the memory 805, and by calling data stored in the memory 805. Optionally, the terminal processor 801 may be implemented using at least one of the following hardware forms: Digital Signal Processing (DSP), Field-Programmable Gate Array (FPGA), and Programmable Logic Array (PLA). The terminal processor 801 may integrate one or more of the following: Central Processing Unit (CPU), Graphics Processing Unit (GPU), and modem. The CPU primarily handles the operating system, user interface, and applications; the GPU is responsible for rendering and drawing the content to be displayed on the screen; and the modem handles wireless communication. It is understood that the modem may also be implemented as a separate chip without being integrated into the terminal processor 801.

[0118] The memory 805 may include random access memory (RAM) or read-only memory (ROM). Optionally, the memory 805 may include a non-transitory computer-readable storage medium. The memory 805 can be used to store instructions, programs, code, code sets, or instruction sets. The memory 805 may include a program storage area and a data storage area, wherein the program storage area may store instructions for implementing an operating system, instructions for at least one function (such as touch function, sound playback function, image playback function, etc.), instructions for implementing the above-described method embodiments, etc.; the data storage area may store data involved in the above-described method embodiments, etc. Optionally, the memory 805 may also be at least one storage device located remotely from the aforementioned terminal processor 801. Figure 8 As shown, the memory 805, which serves as a computer storage medium, may include an operating system, a network communication module, a user interface module, and an image detection model training and image detection program.

[0119] exist Figure 8In the terminal 800 shown, the user interface 803 is mainly used to provide an input interface for the user and to obtain the user's input data; while the terminal processor 801 can be used to call the image detection model training program stored in the memory 805 and specifically perform the following operations:

[0120] The first sample image is input into the first image detection model, and the second sample image is input into the second image detection model. The first sample image and the second sample image are real images or fake images of the same original image.

[0121] Based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image, determine the adversarial loss between the first image detection model and the second image detection model;

[0122] The first image detection model and the second image detection model are trained based on adversarial loss.

[0123] In some embodiments, when the terminal processor 801 performs the process of determining the adversarial loss between the first image detection model and the second image detection model, it specifically performs the following steps: determining a first prediction loss of the first image detection model based on the first image features and the first sample image, and determining a second prediction loss of the second image detection model based on the second image features and the second sample image; determining a contrast loss between the first image detection model and the second image detection model based on the first image features and the second image features; and determining an adversarial loss between the first image detection model and the second image detection model based on the first prediction loss, the second prediction loss, and the contrast loss.

[0124] In some embodiments, when the terminal processor 801 performs the following steps to determine the first prediction loss of the first image detection model based on the first image features and the first sample image, and to determine the second prediction loss of the second image detection model based on the second image features and the second sample image: determining the first predicted value output by the first image detection model for the type of the first sample image based on the first image features, and determining the first prediction loss of the first image detection model based on the first standard value and the first predicted value corresponding to the type of the first sample image; determining the second predicted value output by the second image detection model for the type of the second sample image based on the second image features, and determining the second prediction loss of the second image detection model based on the second standard value and the second predicted value corresponding to the type of the second sample image.

[0125] In some embodiments, when the terminal processor 801 performs the following steps when determining the contrast loss between the first image detection model and the second image detection model based on the first image features and the second image features: calculating the similarity between the first image features and the second image features; and determining the contrast loss between the first image detection model and the second image detection model based on the similarity.

[0126] In some embodiments, after determining the adversarial loss between the first image detection model and the second image detection model, the terminal processor 801 further performs the following steps: fusing the first image features and the second image features to obtain fused image features, and determining the classification loss between the first image detection model and the second image detection model based on the fused image features, the first sample image, and the second sample image; when training the first image detection model and the second image detection model based on the adversarial loss, the terminal processor 801 further performs the following steps: training the first image detection model and the second image detection model based on the adversarial loss and the classification loss.

[0127] In some embodiments, when the terminal processor 801 determines the classification loss between the first image detection model and the second image detection model based on the fused image features, the first sample image, and the second sample image, it specifically performs the following steps: determining the combination type prediction value output by the first image detection model and the second image detection model for the sample image combination composed of the first sample image and the second sample image based on the fused image features; and determining the classification loss between the first image detection model and the second image detection model based on the combination type standard value and the combination type prediction value of the sample image combination.

[0128] In some embodiments, the first image feature is the global image feature of the first sample image, and the second image feature is the global image feature of the second sample image.

[0129] Furthermore, in Figure 8 In the terminal 800 shown, the user interface 803 is mainly used to provide an input interface for the user and to acquire user input data; while the terminal processor 801 can also be used to call the image detection program stored in the memory 805 and specifically perform the following operations:

[0130] Acquire the image to be detected and input it into the image detection model;

[0131] Based on the output data of the image detection model, it is determined whether the image to be detected is a real image or a fake image; wherein, the image detection model is either the first image detection model or the second image detection model in any of the embodiments of the above specification.

[0132] In the several embodiments provided in this specification, it should be understood that the disclosed apparatus and methods can be implemented in other ways. For example, the apparatus embodiments described above are merely illustrative; for instance, the division of modules is only a logical functional division, and in actual implementation, there may be other division methods. For example, multiple modules or components may be combined or integrated into another system, or some features may be ignored or not executed. Furthermore, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces; the indirect coupling or communication connection between apparatuses or modules may be electrical, mechanical, or other forms.

[0133] The modules described as separate components may or may not be physically separate. Similarly, the components shown as modules may or may not be physical modules; they may be located in one place or distributed across multiple network modules. Some or all of the modules can be selected to achieve the purpose of this embodiment, depending on actual needs.

[0134] In the above embodiments, implementation can be achieved, in whole or in part, through software, hardware, firmware, or any combination thereof. When implemented in software, it can be implemented, in whole or in part, as a computer program product. The computer program product includes one or more computer instructions. When these computer program instructions are loaded and executed on a computer, all or part of the processes or functions described in the embodiments of this specification are generated. The computer can be a general-purpose computer, a special-purpose computer, a computer network, or other programmable device. The computer instructions can be stored in or transmitted through a computer-readable storage medium. The computer instructions can be transmitted from one website, computer, server, or data center to another website, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.) means. The computer-readable storage medium can be any available medium accessible to a computer or a data storage device such as a server or data center that integrates one or more available media. The aforementioned available media can be magnetic media (e.g., floppy disks, hard disks, magnetic tapes), optical media (e.g., Digital Versatile Discs (DVDs)), or semiconductor media (e.g., Solid State Disks (SSDs)).

[0135] It should be noted that, for the sake of simplicity, the foregoing method embodiments are all described as a series of actions. However, those skilled in the art should understand that the embodiments in this specification are not limited to the described order of actions, because according to the embodiments in this specification, some steps can be performed in other orders or simultaneously. Furthermore, those skilled in the art should also understand that the embodiments described in this specification are all preferred embodiments, and the actions and modules involved are not necessarily essential to the embodiments in this specification.

[0136] In the above embodiments, the descriptions of each embodiment have different focuses. For parts not described in detail in a certain embodiment, please refer to the relevant descriptions of other embodiments.

[0137] The above is a description of an image detection model training and image detection method, apparatus and storage medium provided in the embodiments of this specification. For those skilled in the art, based on the ideas of the embodiments of this specification, there will be changes in the specific implementation and application scope. Therefore, the content of this specification should not be construed as a limitation on the embodiments of this specification.

Claims

1. A method for training an image detection model, the method comprising: The first sample image is input into the first image detection model, and the second sample image is input into the second image detection model. The first sample image and the second sample image are real images or fake images of the same original image. Based on the first image features output by the first image detection model for the first sample image, and the second image features output by the second image detection model for the second sample image, the adversarial loss between the first image detection model and the second image detection model is determined. Determine the classification loss between the first image detection model and the second image detection model; The first image detection model and the second image detection model are trained based on the adversarial loss and the classification loss. Determining the adversarial loss between the first image detection model and the second image detection model includes: A first prediction loss of the first image detection model is determined based on the first image features and the first sample image, and a second prediction loss of the second image detection model is determined based on the second image features and the second sample image; Based on the first image features and the second image features, determine the contrast loss between the first image detection model and the second image detection model; Based on the first prediction loss, the second prediction loss, and the contrast loss, the adversarial loss between the first image detection model and the second image detection model is determined.

2. The method according to claim 1, wherein determining the first prediction loss of the first image detection model based on the first image features and the first sample image, and determining the second prediction loss of the second image detection model based on the second image features and the second sample image, comprises: Based on the first image features, a first predicted value is determined by the first image detection model for the type of the first sample image, and based on the first standard value corresponding to the type of the first sample image and the first predicted value, a first prediction loss of the first image detection model is determined. The second image detection model outputs a second predicted value for the type of the second sample image based on the second image features, and the second prediction loss of the second image detection model is determined based on the second standard value corresponding to the type of the second sample image and the second predicted value.

3. The method according to claim 1, wherein determining the contrast loss between the first image detection model and the second image detection model based on the first image features and the second image features comprises: Calculate the similarity between the first image feature and the second image feature; The contrast loss between the first image detection model and the second image detection model is determined based on the similarity.

4. The method according to claim 1, wherein determining the classification loss between the first image detection model and the second image detection model comprises: The first image features and the second image features are fused to obtain fused image features. The classification loss between the first image detection model and the second image detection model is determined based on the fused image features, the first sample image, and the second sample image.

5. The method according to claim 4, wherein determining the classification loss between the first image detection model and the second image detection model based on the fused image features, the first sample image, and the second sample image comprises: Based on the fused image features, determine the combination type prediction value output by the first image detection model and the second image detection model for the sample image combination composed of the first sample image and the second sample image; Based on the standard value of the combination type of the sample image combination and the predicted value of the combination type, the classification loss between the first image detection model and the second image detection model is determined.

6. The method according to claim 1, wherein the first image feature is a global image feature of the first sample image, and the second image feature is a global image feature of the second sample image.

7. An image detection method, the method comprising: Obtain the image to be detected and input the image to be detected into the image detection model; Based on the output data of the image detection model, it is determined whether the image to be detected is a real image or a fake image; The image detection model is either the first image detection model or the second image detection model as described in any one of claims 1 to 6.

8. An image detection model training device, the device comprising: The sample input module is used to input a first sample image into a first image detection model and a second sample image into a second image detection model, wherein the first sample image and the second sample image are real images or fake images of the same original image; The loss calculation module is used to determine the adversarial loss between the first image detection model and the second image detection model based on the first image features output by the first image detection model for the first sample image and the second image features output by the second image detection model for the second sample image. A classification loss module is used to determine the classification loss between the first image detection model and the second image detection model; The model training module is used to train the first image detection model and the second image detection model based on the adversarial loss and the classification loss. The loss calculation module is further configured to determine a first prediction loss of the first image detection model based on the first image features and the first sample image, and to determine a second prediction loss of the second image detection model based on the second image features and the second sample image; to determine a contrast loss between the first image detection model and the second image detection model based on the first image features and the second image features; and to determine an adversarial loss between the first image detection model and the second image detection model based on the first prediction loss, the second prediction loss, and the contrast loss.

9. An image detection apparatus, the apparatus comprising: An image input module is used to acquire an image to be detected and input the image to be detected into an image detection model. The authenticity detection module is used to determine whether the image to be detected is a real image or a fake image based on the output data of the image detection model. The image detection model is either the first image detection model or the second image detection model as described in any one of claims 1 to 6.

10. A computer program product comprising instructions that, when run on a computer or processor, causes the computer or processor to perform the steps of the method as claimed in any one of claims 1 to 6 or 7.

11. A computer storage medium storing a plurality of instructions adapted for loading by a processor and performing the steps of the method as claimed in any one of claims 1 to 6 or 7.

12. A terminal comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor, when executing the program, implements the steps of the method as claimed in any one of claims 1 to 6 or 7.