A deep fake face picture detection method based on a double-flow CNN and ViT hybrid model

By using a hybrid model of dual-stream CNN and ViT, combined with improved high-frequency noise residual feature extraction and adaptive frequency domain filter, we achieved efficient detection of deepfake face images, solved the problems of feature fragmentation and insufficient cross-modal interaction, and improved detection performance.

CN120833637BActive Publication Date: 2026-06-26YUNNAN NORMAL UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
YUNNAN NORMAL UNIV
Filing Date
2025-08-19
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing deepfake detection technologies suffer from problems such as feature fragmentation, local-global imbalance, and insufficient cross-modal interaction, making it difficult to effectively detect deepfake face images.

Method used

A method based on a hybrid model of dual-stream CNN and ViT is adopted. Through an improved high-frequency noise residual feature extraction module, multi-scale feature fusion and bidirectional feature transfer mechanism, combined with an adaptive frequency domain filter and cross-attention module, the high-frequency noise and RGB features are decoupled and enhanced, and local details and global semantics are captured in a coordinated manner.

Benefits of technology

It enhances the detection performance of deepfake face images, improves the robustness and accuracy of the model, and can more effectively identify forgery traces.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120833637B_ABST
    Figure CN120833637B_ABST
Patent Text Reader

Abstract

The application relates to the technical field of computer vision and pattern recognition, and particularly relates to a deep fake face picture detection method based on a double-flow CNN and ViT hybrid model. The method comprises the following steps: pre-processing real videos and fake videos to obtain original face images; obtaining high-frequency noise residual feature images of the original face images based on an improved high-frequency noise residual feature extraction module; performing first feature enhancement operations on the original face images and the high-frequency noise residual feature images based on a double-flow CNN; performing second feature enhancement operations on the original face images and the high-frequency noise residual feature images subjected to the first feature enhancement operations based on a double-flow ViT; fusing the features of the original face images and the high-frequency noise residual feature images subjected to the second feature enhancement operations, performing true-false prediction on the fused features, and outputting detection results. The method aims to solve the technical problems of feature fragmentation, local-global imbalance and insufficient cross-modal interaction in the prior art.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of computer vision and pattern recognition technology, specifically to a method for detecting deepfake face images based on a hybrid model of dual-stream CNN and ViT. Background Technology

[0002] With groundbreaking advancements in Generative Adversarial Networks (GANs) technology, deepfake technology has achieved leapfrog development in the field of image and video generation. Based on GANs and their derivative architectures (such as the StyleGAN series and CycleGAN), deepfake technology has formed a technical system centered on identity replacement, facial reenactment, attribute editing, and full-face synthesis. Among these technologies, identity replacement technology achieves cross-identity facial feature replacement through 3D face modeling, feature transfer, and texture mapping. For example, the classic algorithm FaceSwap aligns facial features using keypoint localization and 3D reconstruction, while FaceShifter further introduces an occlusion awareness module, significantly improving the realism of face swapping in complex scenes. Face replay technology, based on dynamic expression transfer and spatiotemporal consistency constraints, achieves cross-subject mapping of the target person's expression and posture. A typical example is the real-time replay framework proposed by Thies et al., based on 3D dense feature point tracking and reflection transformation. Attribute editing technology, through a dual-space GAN architecture and Transformer feature decoupling mechanism, achieves fine control over facial attributes such as age, gender, and expression. Full-face generation technology, relying on latent space decoupling and progressive generation strategies, achieves unsupervised generation from random noise to high-resolution face images. In terms of technological evolution, the progressive training strategy proposed by ProGAN solves the mode collapse problem in high-resolution image generation through a resolution hierarchical optimization mechanism. StyleGAN further introduces a mapping network and style mixing module, achieving hierarchical control over the visual features of the generated image. With the increasing complexity of generative models, deepfake technology has made significant breakthroughs in visual realism, semantic consistency, and operational robustness. Its generated results have reached a level that is indistinguishable from the real thing, posing a serious challenge to personal privacy, social security, and the public opinion environment.

[0003] Early detection methods mainly relied on traditional image classification networks, such as Xception and EfficientNet, which improved model performance through depthwise separable convolution and compound scaling strategies, respectively. However, their detection robustness was insufficient when faced with compression distortion, viewpoint changes, and the absence of forgery methods.

[0004] Current research on deepfake detection mainly revolves around the following technical approaches:

[0005] Local feature extraction based on CNN (Convolutional Neural Network): Deep convolutional networks, represented by Xception and EfficientNet, achieve preliminary detection by optimizing feature abstraction capabilities, but they are insufficient in modeling global semantic associations.

[0006] Fine-grained and attention-enhanced: The ability to locate fake artifacts is enhanced through multi-attention mechanisms (such as spatial-texture aggregation modules and operation trace extractors), but the coupling relationship between high-frequency noise and RGB features has not been fully explored.

[0007] Dual-stream heterogeneous feature fusion: Parallel networks are used to process the original RGB image and the high-frequency noise residual feature image. However, the locality limitation of CNN branches and the problem of global feature fragmentation in the traditional dual-stream architecture have not yet been solved.

[0008] Frequency domain and time domain analysis: Frequency domain filtering based on fast Fourier transform (such as rPPG signal heatmap) can capture physiological signal anomalies, but it is significantly affected by image compression and illumination interference; noise fingerprint analysis (such as camera model artifact enhancement) has limited discriminative power in high-fidelity forgery scenarios.

[0009] Global modeling of Vision Transformer (ViT): ViT achieves long-range dependency modeling through a self-attention mechanism, but it is less sensitive to local subtle artifacts (such as edge seams and texture inconsistencies) than CNN, and lacks an alignment mechanism for multimodal features.

[0010] The shortcomings of existing deepfake detection technologies can be summarized as follows:

[0011] Feature fragmentation problem: Relying solely on RGB space or high-frequency noise leads to the ineffective utilization of macro-micro feature complementarity;

[0012] Local-global imbalance: There is a modeling gap between the local inductive bias of CNN and the global attention of ViT, making it difficult to collaboratively capture cross-scale forgery traces;

[0013] Dynamic adaptation defects: Traditional multi-scale fusion methods (such as feature pyramids) lack learnability in response to frequency domain spoofing patterns and cannot adapt to changes in noise distribution;

[0014] Insufficient cross-modal interaction: Two-stream networks typically employ late-stage feature concatenation and lack a hierarchical feature transfer and attention-guided interaction mechanism. Summary of the Invention

[0015] The purpose of this invention is to provide a deep fake face image detection method based on a hybrid model of dual-stream CNN and ViT, aiming to solve the technical problems of feature fragmentation, local-global imbalance and insufficient cross-modal interaction in existing technologies.

[0016] To achieve the above objectives, this invention provides a method for detecting deepfake face images based on a hybrid model of two-stream CNN and ViT, comprising the following steps:

[0017] S1: Preprocess real and fake videos to obtain original face images;

[0018] S2: Improve the high-frequency noise residual feature extraction module of the traditional SRM (Spatial Rich Model) filter to obtain an improved high-frequency noise residual feature extraction module. Based on the improved high-frequency noise residual feature extraction module, obtain the high-frequency noise residual features of the original face image to obtain a high-frequency noise residual feature image.

[0019] The improvements include:

[0020] Establish a cascaded structure of SRM filter bank and gradient operator;

[0021] A channel-by-channel gradient calculation strategy is adopted, and the Sobel operator is processed independently for each feature channel;

[0022] A gradient magnitude normalization operation is introduced, and the dynamic range of high-frequency features is controlled by a differentiable boundary constraint function.

[0023] S3: Perform initial feature enhancement operations on the original face image and the high-frequency noise residual feature image based on a dual-stream CNN; wherein, the initial feature enhancement operations are, in sequence, multi-scale feature extraction, frequency domain enhancement, feature dimension alignment, and multi-scale feature fusion;

[0024] S4: Based on dual-stream ViT, a second feature enhancement operation is performed on the original face image and the high-frequency noise residual feature image after the first feature enhancement operation, respectively; wherein, the second feature enhancement operation is parallel encoding and bidirectional feature enhancement in sequence;

[0025] S5: The features of the original face image and the high-frequency noise residual feature image after the secondary feature enhancement operation are fused. The true and false predictions of the fused features are performed based on the multilayer perceptron classification head, and the detection results are output.

[0026] Optionally, the preprocessing includes video frame extraction, training set and test set division, face region localization, image size standardization, tensor format conversion, and normalization processing.

[0027] Optionally, obtaining the high-frequency noise residual of the original face image to obtain a high-frequency noise residual feature image includes:

[0028] The original face image is processed using a traditional SRM filter to obtain an initial high-frequency noise residual feature image.

[0029] By combining the Sobel gradient operator, gradient calculation is performed on each channel of the initial high-frequency noise residual feature image, and the region with a gradient greater than a preset threshold is extracted as the final high-frequency noise residual feature image corresponding to the original face image.

[0030] Optionally, the initial feature enhancement operation consists of multi-scale feature extraction, frequency domain enhancement, feature dimension alignment, and multi-scale feature fusion, specifically:

[0031] Multi-scale feature extraction: The original face image and the high-frequency noise residual feature image are respectively input into a dual-stream CNN branch. Each branch passes through a four-level convolutional layer ensemble module in sequence to obtain four feature maps with decreasing resolution.

[0032] Frequency domain enhancement: Learnable spatial frequency domain filters and learnable channel frequency domain filters are constructed using Fast Fourier Transform. The learnable high-frequency spatial filters and the learnable high-frequency channel filters are used to extract the feature maps of high-frequency spatial information and cross-channel high-frequency anomalies of the four feature maps of the original face image, as well as the high-frequency noise residual features of the high-frequency noise residual feature image at the corresponding scale level. At each scale level, the feature maps of high-frequency spatial information and cross-channel high-frequency anomalies of the original face image are added together and then processed using the SE attention mechanism to obtain an attention weight. The result of the addition is multiplied by the attention weight and then added to the high-frequency noise residual features of the high-frequency noise residual feature image at the current scale level to achieve fusion.

[0033] Feature Dimension Alignment: Feature mapping is performed on the features of the last three resolutions, channel expansion and spatial compression are implemented to convert them into 768-dimensional vectors; at the same time, the four-dimensional tensor format is reconstructed into a three-dimensional tensor to adapt to the feature format requirements of the ViT network.

[0034] Multi-scale feature fusion: The image features of the last three resolutions of the two branches in three-dimensional tensor format are formed into multi-scale feature blocks by cross-layer stitching.

[0035] Optionally, the construction of the learnable high-frequency spatial filter includes:

[0036] First, the input image tensor is transformed from the spatial domain to the frequency domain using a two-dimensional fast Fourier transform, and the zero frequency point in the frequency domain is transferred to the center of the spectrum.

[0037] Secondly, a learnable parameter is defined as the standard deviation of the Gaussian function, which is used to construct a low-pass mask for the shape of the two-dimensional Gaussian function.

[0038] Then, the low-pass mask in the shape of a two-dimensional Gaussian function is inverted into a high-pass mask, with a center value of 0 and a perimeter value of 1. The frequency domain image obtained by the two-dimensional fast Fourier transform is then multiplied element-wise with the high-pass mask.

[0039] Finally, the frequency domain signal after element-wise multiplication is converted back to the image spatial domain using the inverse Fourier transform. The real part of the complex result is obtained after the inverse Fourier transform and the ReLU activation function is applied. The final output is a feature map containing only high-frequency spatial information.

[0040] Optionally, the construction of the learnable high-frequency channel filter includes:

[0041] First, the channel dimension of the image tensor is processed using a one-dimensional fast Fourier transform, and the zero frequency point of the channel dimension is transferred to the center.

[0042] Secondly, a learnable parameter is defined to construct a low-pass mask with a one-dimensional Gaussian function shape.

[0043] Then, the low-pass mask with the shape of a one-dimensional Gaussian function is inverted into a high-pass mask, and the channel frequency domain signal obtained by the one-dimensional fast Fourier transform is multiplied element by element with the high-pass mask.

[0044] Finally, the inverse Fourier transform is used to convert the channel frequency domain signal after element-wise multiplication back to the original channel representation. The real part of the complex result after the inverse Fourier transform is obtained and the ReLU activation function is applied. The final output is a feature map containing cross-channel high-frequency anomalies.

[0045] Optionally, the secondary feature enhancement operation consists of parallel encoding and bidirectional feature enhancement, specifically:

[0046] Parallel encoding: First, the original face image after the first feature enhancement operation and the high-frequency noise residual feature image are respectively segmented into image block sequences. Then, category tokens and relative position codes are added to the sequences, and the parallel encoded image block sequences are sent into 12-layer Transform Block processing.

[0047] Bidirectional Feature Enhancement: The 12-layer Transform Block is divided into three groups: [1, 2, 3, 4], [5, 6, 7, 8], and [9, 10, 11, 12]. Before feeding the parallel encoded image block sequence into the Transform Block layer of each group, ViT features are used as queries and CNN multi-scale convolutional features are used as key-value pairs. Multi-head attention mechanism is used to enhance the transfer of CNN local features to ViT global features. After all Transform Block layers in the group have been processed, CNN multi-scale convolutional features are used as queries and ViT features are used as key-value pairs. Multi-head attention mechanism is used to enhance the transfer of ViT global features to CNN local features.

[0048] Optionally, S5 specifically includes:

[0049] S51: The original face image and the high-frequency noise residual feature image after the first feature enhancement operation are processed by 12 Transform Blocks respectively. The processing includes: after the processing of the 5th and 9th Transform Blocks, a cross-attention module is embedded to establish a feature interaction channel between the original face image branch and the high-frequency noise residual feature image branch. Before the processing of the 1st, 5th and 6th layers, the two branches perform their respective CNN local features to ViT global features transfer enhancement. After the processing of the 4th, 8th and 12th layers, the ViT global features to CNN local features transfer enhancement is performed. After the processing of the 5th and 9th layers, the feature interaction between the two branches is realized through the cross-attention module.

[0050] S52: After obtaining the features of the two branches after processing by 12 Transform Blocks, the features of the two branches are first concatenated to form joint features. Then, a fully connected layer is used to perform preliminary fusion of the concatenated features. Next, the preliminary fused features are sent to the channel attention module for weighted processing to obtain the final fused features.

[0051] S53: Employs a multilayer perceptron classification head, performs real / fake prediction based on the final fused features, and outputs the final deepfake face image detection result.

[0052] The beneficial effects of this invention are as follows: This invention constructs a "filter-gradient" two-stage residual extraction module, realizing the decoupling and enhancement of high-frequency noise and RGB features; it designs an adaptive frequency domain filter bank to dynamically optimize multi-scale feature fusion in Fourier space; it introduces a bidirectional feature transfer mechanism of CNN and ViT to achieve synergistic enhancement of local details and global semantics; and it embeds a hierarchical cross-attention module to establish the correlation reasoning of spatial-frequency domain features in the deep layer of Transformer, thereby enhancing the model's detection performance of deepfake face images. Attached Figure Description

[0053] Figure 1 This is a flowchart of the steps of the present invention;

[0054] Figure 2 The first layer of the multi-scale convolutional feature extraction module of this invention has a resolution of 56x56.

[0055] Figure 3 The multi-scale convolutional feature extraction module of this invention has a second layer with a resolution of 28x28, a third layer with a resolution of 14x14, and a fourth layer with a resolution of 7x7.

[0056] Figure 4 This is the spatial-frequency domain joint feature extraction module of the present invention;

[0057] Figure 5 This is the dual-modal cross-attention mechanism module of the present invention;

[0058] Figure 6 This is the dual-modal feature fusion module of the present invention;

[0059] Figure 7 This is a diagram showing the overall network structure of the deepfake face image detection model of the present invention;

[0060] Figure 8 This is the confusion matrix of the deepfake face image detection model of this invention on various deepfake methods. Detailed Implementation

[0061] The present invention will be further described below with reference to the accompanying drawings. The exemplary embodiments can be implemented in many forms, but the scope of protection of the present invention is not limited to this embodiment. Unless otherwise specified, the embodiments of the present invention and the features thereof can be combined with each other. Therefore, the present invention sets forth many details to facilitate a good understanding of the implementation of the method by those skilled in the art.

[0062] Example 1: As Figure 1 The diagram shows a flowchart of a deep fake face detection method based on a hybrid model of dual-stream CNN and ViT, comprising the following steps:

[0063] S1: Preprocess real and fake videos to obtain original face images.

[0064] Optionally, video frame extraction and dataset partitioning are performed. For each real video and its corresponding fake video, 50 video frames are extracted, and all extracted video frames are divided into training set and test set according to the proportion. The storage location information of the images and the real / fake labels are stored in a CSV file according to the dataset partitioning, so that the data can be loaded during subsequent training and testing.

[0065] Optionally, face region localization is performed. During the training phase, since the extracted video frames are full screenshots of the video, after loading the images according to the CSV file of the segmented dataset, a face detection algorithm (Dlib library) is used to accurately identify and extract the face regions in the images to obtain face region images for each video frame, providing key information for subsequent analysis.

[0066] Optionally, image standardization and normalization processing is performed. The extracted face region images are standardized to a fixed size of 224×224 pixels, converted to tensor format, and normalized to make the data meet the training and testing requirements of the model.

[0067] S2: Improve the high-frequency noise residual feature extraction module of the traditional SRM filter to obtain an improved high-frequency noise residual feature extraction module. Based on the improved high-frequency noise residual feature extraction module, obtain the high-frequency noise residual features of the original face image to obtain a high-frequency noise residual feature image.

[0068] Optionally, the improved high-frequency noise residual feature extraction module has been improved based on the traditional SRM filter as follows:

[0069] Improvement 1: A two-stage feature enhancement mechanism is proposed. By cascading the SRM filter bank (first-level feature extraction) and the gradient operator (second-level feature enhancement), a "filter-gradient" two-stage feature enhancement mechanism is constructed. Compared with the traditional single-stage SRM method, this design can capture high-frequency forgery traces across scales more precisely.

[0070] Improvement 2: Channel-independent gradient analysis is proposed, employing a channel-by-channel gradient calculation strategy to process each feature channel independently using the Sobel operator. This preserves the differentiated characteristic responses between different filter channels and avoids the high-frequency feature confusion problem caused by traditional multi-channel mixed processing.

[0071] Improvement 3: A dynamic range constraint mechanism is proposed, introducing gradient magnitude normalization and using a differentiable boundary constraint function (clamp(0,1)) to achieve dynamic range control of high-frequency features. This suppresses extreme noise interference while preserving the interpretability of the features.

[0072] Furthermore, in the SRM filtering stage, the method for extracting high-frequency noise residual feature images uses three conventional SRM filters with fixed weights to obtain high-frequency features and edge information from the image, specifically:

[0073] The first filter weight, filter1, is used to detect changes in the horizontal gradient, reflecting the intensity of horizontal texture and edges. The formula is as follows:

[0074] (1)

[0075] The second filter weight, filter2, is used to detect changes in the vertical gradient, reflecting the intensity of vertical texture and edges. The formula is as follows:

[0076] (2)

[0077] The third filter weight, filter3, is used to detect second-order horizontal texture features and fine-line noise. The formula is as follows:

[0078] (3)

[0079] Stacked filters, the formula is as follows:

[0080] (4)

[0081] Furthermore, during image processing, the original image x is combined with the stacked filter set. A two-dimensional convolution operation is performed. Specifically, each channel of the filter is multiplied element-wise (i.e., point-wise) with the corresponding channel of the input image, and the resulting products are then summed to generate a single-channel output. This convolution process involves the filter sliding within a spatial domain of H (image height) × W (image width) to traverse the entire input image and ultimately generate a complete output feature map. This convolution operation can be expressed by the following mathematical formula:

[0082] (5)

[0083] Where b represents the b-th sample, c represents the number of output channels, (h,w) represents the position coordinates (row and column) on the output feature map, i represents the number of input channels, and (u,v) represents the position index inside the filter. This represents the pixel value of the input image at position (h+u, w+v) in the nth sample, ith channel. This represents the weight values ​​of the convolution kernel at the output channel c, the input channel i, and the internal position (u,v) of the filter.

[0084] Furthermore, in the gradient enhancement stage, for each feature channel... Perform Sobel gradient analysis independently, specifically as follows:

[0085] In horizontal gradient The formula is as follows:

[0086] (6)

[0087] In vertical gradient The formula is as follows:

[0088] (7)

[0089] Obtain the gradient magnitude The formula is as follows:

[0090] (8)

[0091] Furthermore, the multi-channel features are synthesized and normalized to produce the final output. As a high-frequency noise residual feature, where B represents the size of each batch of training data fed into the model during model training, the expression for the high-frequency noise residual feature is:

[0092] (9)

[0093] in, This is a computation function in PyTorch used to limit the range of values.

[0094] S3: Perform initial feature enhancement operations on the original face image and the high-frequency noise residual feature image based on a dual-stream CNN; wherein, the initial feature enhancement operations are, in sequence, multi-scale feature extraction, frequency domain enhancement, feature dimension alignment and multi-scale feature fusion.

[0095] It's important to understand that deepfake images often exhibit inconsistencies or noise anomalies in high-frequency regions or specific channel frequencies. To enhance the model's ability to model subtle tampering traces in deepfake images, this step constructs a spatial-frequency joint enhancement prior module. This module guides the network to capture more subtle frequency and texture residual features in the forged regions by fusing a learnable frequency domain filtering mechanism with a SE attention mechanism. Two learnable frequency domain filtering modules are used to suppress low frequencies in the spatial and channel dimensions of the image, respectively, to obtain more discernible forgery traces. This assists the backbone network in extracting richer and more sensitive forgery clues from the original image. The main structure of this module is designed as follows:

[0096] like Figure 2 , 3As shown, the multi-scale convolutional feature extraction module performs multi-scale feature extraction. The original face image and the high-frequency noise residual feature image are respectively input into the dual-stream CNN branch. Each branch sequentially passes through a four-level convolutional layer ensemble module to gradually extract multi-scale features of 56×56 pixels (64 dimensions), 28×28 pixels (128 dimensions), 14×14 pixels (256 dimensions), and 7×7 pixels (512 dimensions).

[0097] The spatial-frequency domain joint feature extraction module performs frequency domain enhancement, specifically by constructing a learnable spatial-frequency domain filter and a learnable channel-frequency domain filter using Fast Fourier Transform. The learnable high-frequency spatial filter and the learnable high-frequency channel filter are used to extract feature maps of high-frequency spatial information and cross-channel high-frequency anomalies from the four feature maps of the original face image, as well as high-frequency noise residual features at the corresponding scale level of the high-frequency noise residual feature image. At each scale level, the feature map of high-frequency spatial information of the original face image and the feature map of cross-channel high-frequency anomalies are added together, and then processed using the SE attention mechanism to obtain an attention weight. The result of the addition is multiplied by the attention weight, and then added to the high-frequency noise residual features at the current scale level of the high-frequency noise residual feature image to achieve fusion.

[0098] It is understandable that, such as Figure 4 As shown, in the spatial-frequency domain joint feature extraction module, low-frequency components are suppressed by Gaussian masks in learnable spatial-frequency domain filters and learnable channel-frequency domain filters, which preserves more sensitive high-frequency forgery features. These features are then used to adaptively enhance the image to capture subtle tampering traces by targeting deep-forgery-sensitive frequency components. Furthermore, the SE attention mechanism adaptively controls the influence of frequency domain enhancement features on the branch features of the high-frequency noise residual feature image, highlighting the tampered region.

[0099] Feature Dimension Alignment: Feature mapping is performed on the features of the last three resolutions, channel expansion and spatial compression are implemented to convert them into 768-dimensional vectors; at the same time, the four-dimensional tensor format is reconstructed into a three-dimensional tensor to adapt to the feature format requirements of the ViT network.

[0100] Multi-scale feature fusion: The image features of the last three resolutions of the two branches in 3D tensor format are formed into multi-scale feature blocks by cross-layer stitching, which are then used by the subsequent Vision Transform network to further fuse spatial-frequency joint multi-scale convolutional feature information.

[0101] Furthermore, the construction of the learnable spatial frequency domain filter and the learnable channel frequency domain filter in this embodiment will be described in detail:

[0102] Understandably, a learnable high-frequency spatial filter (LearnableHFS) is applied to the input feature map. A two-dimensional fast Fourier transform is performed, and the spectrum and a learnable Gaussian mask are calculated. This suppresses the low-frequency region in the center and improves the response of high-frequency textures.

[0103] Specifically, the two-dimensional Fast Fourier Transform (FFT2) is formulated as follows:

[0104] (10)

[0105] Perform a two-dimensional fast Fourier transform on the two-dimensional feature map of each channel in the input image x to transform the image from the spatial domain to the frequency domain. Counterfeit regions are often accompanied by local frequency anomalies, which are more pronounced in the frequency domain.

[0106] Furthermore, a two-dimensional Gaussian function shape mask in the frequency domain is constructed. Generate a two-dimensional Gaussian kernel centered at the center of the spectrum, as shown in the following formula:

[0107] (11)

[0108] In the formula, Represents the coordinates of the spectrum, ( , The ) represents the lowest frequency at the center of the spectrum. The mask has the highest low frequencies at the center and gradually decreases outwards. It suppresses the low frequencies at the center while retaining the high-frequency information. It is a learnable parameter used to control the mask size. The network can learn a high-frequency filtering window suitable for the current task.

[0109] Furthermore, the formulas for frequency suppression and enhancement are as follows:

[0110] (12)

[0111] Multiplying the spectrogram by an inverse Gaussian mask preserves the high-frequency components of the image. After processing, low-frequency information, such as large outlines and smooth background areas, is weakened, while local sensitive areas such as edges and textures are enhanced.

[0112] Furthermore, the two-dimensional inverse Fourier transform The formula for restoring the frequency-domain enhanced data to a spatial domain image is as follows:

[0113] (13)

[0114] In this approach, the Re(real part) IFFT outputs a complex number, and its real part is used as the image value. The ReLU activation function prevents negative values, ensuring that the image features are non-negative. This allows for the acquisition of image feature maps containing only high-frequency information. It is sensitive to the boundaries and residues of counterfeit areas.

[0115] Understandably, Learnable High-Frequency Channel Filters (LearnableHFCs) perform Fourier transforms along the channel dimension, specifically FFTs along the channel axis, to simulate the frequency distribution across the channels, particularly for cross-channel perturbations in high-frequency fake textures. The mask is constructed as follows.

[0116] Specifically, the formula for the one-dimensional Fourier transform (FFT) is as follows:

[0117] (14)

[0118] When a one-dimensional Fourier transform is performed on the input features in the channel dimension (dim=1), inconsistencies may occur between the RGB domain or feature channels when the image color or texture is tampered with. These anomalies can be captured by obtaining the channel spectrum.

[0119] Furthermore, a one-dimensional Gaussian function shape mask for the channel is constructed. Generate a one-dimensional Gaussian kernel centered at the center of the spectrum, as shown in the following formula:

[0120] (15)

[0121] in, It is a learnable parameter that controls the mask size. Indicates the channel index. This indicates the middle position of the low-frequency channel. Low-frequency components in the channel are suppressed, while the high-frequency characteristics between channels are preserved.

[0122] Furthermore, the frequency suppression and enhancement formulas are as follows:

[0123] (16)

[0124] Multiplying the spectrogram by an inverse Gaussian mask preserves the high-frequency components of the image. .

[0125] Furthermore, the one-dimensional inverse Fourier transform (IFFT) channel frequency enhancement and restoration formula is as follows:

[0126] (17)

[0127] Understandably, while both methods suppress low-frequency components, the one-dimensional inverse Fourier transform restores the frequency-domain enhanced data to a channel-domain image. The output image features... It preserves high-frequency information changes across channels, which helps the model capture forgery traces such as color artifacts and channel inconsistencies.

[0128] Understandably, the SE attention fusion mechanism module can adaptively select key information from the frequency domain enhancement features, highlighting spoofed salient regions for use in the reweighted fused high-frequency features. The feature fusion formula is as follows:

[0129] (18)

[0130] in, This represents the high-frequency noise residual image flow features obtained by the current level convolutional module. This represents the original image stream features obtained by the current level convolution module. This indicates that the original image stream features are spatially filtered in the frequency domain to obtain high-frequency enhanced features. This represents the high-frequency enhanced features obtained after the original image stream has undergone frequency domain channel filtering. This represents the fused features after fusing high-frequency noise residual features from the image stream with frequency domain enhancement features from the original image stream. The goal is to enhance high-frequency information only in forged regions, while keeping non-forged regions stable.

[0131] S4: Based on dual-stream ViT, a second feature enhancement operation is performed on the original face image and the high-frequency noise residual feature image after the first feature enhancement operation, respectively; wherein, the second feature enhancement operation is parallel encoding and bidirectional feature enhancement in sequence.

[0132] It's important to understand that the ViT network utilizes a self-attention mechanism to establish global dependencies between patches after segmenting the image into multiple patches. This allows it to efficiently capture global inconsistencies in deepfake images. The model possesses powerful high-dimensional feature representation capabilities, enabling it to detect subtle forgery traces in deepfake images, such as frequency domain artifacts and fine texture details.

[0133] Optionally, the secondary feature enhancement operation consists of parallel encoding and bidirectional feature enhancement, specifically:

[0134] Parallel encoding: First, the original face image after the first feature enhancement operation and the high-frequency noise residual feature image are respectively segmented into image block sequences. Then, category tokens and relative position codes are added to the sequences, and the parallel encoded image block sequences are sent into 12-layer Transform Block processing.

[0135] Specifically, in this embodiment, the image is first segmented into a series of patches, which are then flattened and converted into a patch sequence. Each image patch is 16x16 pixels in size, resulting in 196 image patches. The image dimension changes from [Batch_Size, 3, 224, 224] to the sequence [Batch_Size, 196, 768]. To incorporate positional information and focus on the positional relationships between each patch, positional encoding and a class token are added to the patch sequence, changing its dimension to [Batch_Size, 197, 768]. The processed patch sequence is then fed into a Transformer Encoder composed of multiple stacked blocks for forgery feature extraction. The Transformer Encoder's self-attention mechanism and feedforward network (FFN) learn the complex relationships between different elements in the patch sequence, thereby extracting global information about the image. It is important to understand that Block combines a multilayer perceptron (MLP) and a multi-head self-attention mechanism to model global dependencies.

[0136] Bidirectional Feature Enhancement: The 12-layer Transform Block is divided into three groups: [1, 2, 3, 4], [5, 6, 7, 8], and [9, 10, 11, 12]. Before feeding the parallel-encoded image block sequence into each group's Transform Block layer, ViT features are used as the Query and CNN multi-scale convolutional features as the Key-Value pair. A multi-head attention mechanism is used to enhance the transfer of CNN local features to ViT global features. After all Transform Block layers in that group have been processed, the CNN multi-scale convolutional features are used as the Query and ViT features as the Key-Value pair again. A multi-head attention mechanism is then used to enhance the transfer of ViT global features to CNN local features. This process includes:

[0137] The Injector module enables attention injection from local detail information of convolutional features to global structural information of ViT features, improving the ability to detect forgery clues in the global modeling process by integrating information from the original image and high-frequency noise residual feature image.

[0138] The Extractor module enables the reverse fusion of ViT features into convolutional features. It uses high-level semantic information from ViT to guide the enhancement of low-level texture information in the convolutional network, thereby further improving the accuracy of forged image recognition.

[0139] S5: The features of the original face image and the high-frequency noise residual feature image after the secondary feature enhancement operation are fused. The true and false predictions of the fused features are performed based on the multilayer perceptron classification head, and the detection results are output.

[0140] S51: The original face image and the high-frequency noise residual feature image after the first feature enhancement operation are processed by 12 Transform Blocks respectively. The processing includes: after the processing of the 5th and 9th Transform Blocks, a cross-attention module is embedded to establish a feature interaction channel between the original face image branch and the high-frequency noise residual feature image branch. Before the processing of the 1st, 5th and 6th layers, the two branches perform their respective CNN local features to ViT global features transfer enhancement. After the processing of the 4th, 8th and 12th layers, the ViT global features to CNN local features transfer enhancement is performed. After the processing of the 5th and 9th layers, the feature interaction between the two branches is realized through the cross-attention module.

[0141] Specifically, in this embodiment, such as Figure 5 As shown, a bimodal cross-attention mechanism module was constructed, aiming to establish the association between two input modalities (i.e., the original face image and the high-frequency noise residual feature image) through an attention mechanism. Specifically, a cross-modal self-attention mechanism was introduced to capture and fuse the interaction information between the two different modalities.

[0142] In the computation, the original face image x and the high-frequency noise residual feature image y are treated as two keys. First, a fully connected layer projects these two keys into a shared low-dimensional space, and then matrix operations are used to calculate their similarity, i.e., energy. Based on this energy value, the attention weight matrix of x on y and the attention weight matrix of y on x are calculated respectively.

[0143] Next, x and y are mapped to a new feature space through a fully connected layer. Using the previously calculated attention weight matrix, the weighted output of y on x and the weighted output of x on y are calculated respectively. These weighted outputs are then weighted and summed with the original input to obtain the final output. and The following is based on... Taking the calculation process as an example, The calculation process is the same.

[0144] The energy calculation formula is as follows:

[0145] (19)

[0146] In the formula, and This means projecting the input features x and y into the same low-dimensional space. This represents the attention energy matrix obtained by calculating the matching relationship between the two modes x and y.

[0147] The formula for calculating the attention weight matrix is ​​as follows:

[0148] (20)

[0149] In the formula, This represents the attention weight matrix of y on x, which means that when x is enhanced, each position of x should pay attention to which positions of y. Normalization represents the process of standardizing the results into a probability distribution. This represents a fully connected layer that performs additional learning transformations on the energy matrix.

[0150] The weighted output calculation formula is as follows:

[0151] (twenty one)

[0152] In the formula, The Value feature of mode y is represented after projection, and is used to pass information about y to x. This represents a learnable scalar parameter used to control the influence strength of the y-axis fused features.

[0153] Understandable This represents the enhanced x obtained by fusing information from y. This represents the enhanced y obtained after fusing information from x. The bimodal cross-attention mechanism module performs feature interaction only after processing the original face image (x) and the high-frequency noise residual feature image (y) at layers 5 and 9 of the 12-layer Transform Block. The result after feature interaction is... and y is then used as the new x and y, and sent to the next Transform Block for further processing.

[0154] Understandably, this model pays particular attention to subtle differences, as the real and tampered areas of a forged image often exhibit different noise patterns. By computing the interaction information between the two modalities, the model can capture these differences and effectively identify forgery traces. By integrating information from both modalities, the model can acquire a more comprehensive feature representation, thereby enhancing its ability to discriminate forged images. When processing images with subtle forgery traces, this attention mechanism significantly improves the model's sensitivity, enabling it to more accurately identify forged content.

[0155] S52: After obtaining the features of the two branches after processing by 12 Transform Blocks, the features of the two branches are first concatenated to form joint features. Then, a fully connected layer is used to perform preliminary fusion of the concatenated features. Next, the preliminary fused features are sent to the channel attention module for weighted processing to obtain the final fused features.

[0156] Specifically, in this embodiment, a dual-modal feature fusion module is constructed to fuse the original face image with high-frequency noise residual feature image data, and to provide richer and more discriminative feature information for subsequent classification tasks by enhancing feature representation. The feature fusion module process is as follows: Figure 6 As shown.

[0157] In the computational process, the original image x is first concatenated with the high-frequency noise residual feature image y along the feature dimension to form a joint feature map. Then, a fully connected layer is applied for feature fusion, transforming the concatenated feature map into a new, more compact feature representation through a linear transformation. To accelerate the model training process and ensure learning stability, batch normalization is performed on the output of the fully connected layer. Next, the ReLU activation function is introduced to map features non-linearly, enabling the network to learn more complex feature representations.

[0158] After the above processing, the fused feature map is fed into the channel attention module for further weighting. This module uses an attention mechanism to dynamically adjust the features according to the importance of each channel, highlighting key information and suppressing irrelevant features. Finally, a weighted feature representation is obtained, which retains important features while reducing redundant information, providing strong support for subsequent classification tasks. The specific calculation is as follows.

[0159] The formula for feature splicing calculation is as follows:

[0160] (twenty two)

[0161] In the formula, x and y represent the final image features obtained after the original image and the high-frequency noise residual feature image have passed through 12 Transform Blocks. This indicates that x and y are concatenated. This represents the result of concatenating x and y.

[0162] Weighted features The calculation formula is as follows:

[0163] (twenty three)

[0164] In the formula This represents a linear layer where the channel dimension is doubled after concatenating x and y. It describes the result of concatenating x and y using a linear layer. The projection is the same channel dimension as x and y.

[0165] The formula for calculating channel attention fusion is as follows:

[0166] (twenty four)

[0167] in the formula The attention weights are obtained through the channel attention module. This is the final output result.

[0168] S53: Employs a multilayer perceptron classification head, performs real / fake prediction based on the final fused features, and outputs the final deepfake face image detection result.

[0169] In summary, this invention employs a dual-stream collaborative architecture of the original face image and high-frequency noise residual feature image, along with a multimodal feature fusion and attention-guided structural design, aiming to uncover potential forgery traces from different frequency domains. The model constructs a framework for identifying forgery traces through multimodal feature fusion and frequency domain enhancement techniques. Its core design lies in an improved high-frequency noise residual feature extraction module, employing a two-stage enhancement strategy: first, it uses an SRM filter bank to capture cross-scale high-frequency residuals (such as edge anomalies and texture discontinuities); then, it performs channel-independent dynamic gradient analysis using the Sobel gradient operator, combined with normalization constraints to suppress noise interference, generating a 3-channel high-frequency residual feature map. At the frequency domain level, a learnable high-frequency spatial filter (LearnableHFS) and a learnable high-frequency channel filter (LearnableHFC) are designed to suppress low-frequency components in the image space and low-frequency perturbations between channels, respectively. Finally, the original image and frequency domain enhancement features are fused through an SE attention mechanism to strengthen the response of forged regions. The model employs a dual-stream ViT architecture to process the original image and the high-frequency noise residual feature image in parallel. The ViT network utilizes a self-attention mechanism to model global dependencies and achieves dynamic complementarity between CNN local details and ViT global semantics through bidirectional feature interaction modules Injector and Extractor: the former injects CNN features into ViT to enhance the modeling of forged details, while the latter uses ViT semantics to optimize CNN texture perception. In the cross-modal decision-making stage, the energy matrices of the original image and high-frequency modalities are calculated through cross-modal self-attention, complementary information is weighted and fused, and key features are selected using channel attention (CA module). Finally, the detection result is output through the MLP classification head. The overall network structure of the deepfake face image detection model is as follows: Figure 7 As shown.

[0170] The advantages of this model lie in its high-frequency sensitivity and multimodal collaboration: two-stage residual extraction locates edges and texture anomalies, spatial-frequency joint modeling effectively suppresses interference, dual-stream ViT and bidirectional interaction take into account both global context and local forgery traces, and the cross-attention mechanism strengthens intermodal collaborative discrimination.

[0171] Based on the detailed implementation description, the effectiveness of the technical solution of this invention is illustrated through experiments. Comparative experiments were conducted on four different tampering methods (Deepfake, Face2Face, FaceSwap, and NeuralTextures) of version c23 on the FF++ dataset. The training and test sets used datasets with the same tampering method. The experimental results are shown in Table 1.

[0172] Table 1. Test results of data with different tampering methods in the FF++ (c23) dataset.

[0173]

[0174] This experiment uses accuracy (ACC) to evaluate the model's detection performance. Experimental results show that the accuracy of this invention can reach over 90%, and is superior to existing mainstream models.

[0175] Furthermore, such as Figure 8 As shown, to visualize the detection performance of this invention, a confusion matrix was plotted based on the training and testing of the model on different forgery methods. The confusion matrix characterizes the model's ability to distinguish between positive and negative samples. Based on the labels of the horizontal and vertical axes of the confusion matrix, as well as the values ​​in the four regions, we can derive true positives (forged images are identified as forged images), true negatives (real images are identified as real images), false positives (real images are identified as forged images), and false negatives (forged images are identified as real images). The confusion matrix shows that true positives and true negatives constitute the majority of samples, with only a very small percentage being false positives and false negatives.

[0176] The above description is merely an illustration of preferred embodiments of the present invention and the technical principles employed. The scope of protection of the present invention is not limited thereto. Those skilled in the art can make various corresponding changes and modifications based on the above technical solutions within the scope of the technology disclosed in the present invention, but all such changes should be covered within the scope of protection of the present invention.

Claims

1. A deepfake face detection method based on a hybrid model of two-stream CNN and ViT, characterized in that, Includes the following steps: S1: Preprocess real and fake videos to obtain original face images; S2: Improve the high-frequency noise residual feature extraction module of the traditional SRM filter to obtain an improved high-frequency noise residual feature extraction module. Based on the improved high-frequency noise residual feature extraction module, obtain the high-frequency noise residual features of the original face image to obtain a high-frequency noise residual feature image. The improvements include: Establish a cascaded structure of SRM filter bank and gradient operator; A channel-by-channel gradient calculation strategy is adopted, and the Sobel operator is processed independently for each feature channel; A gradient magnitude normalization operation is introduced, and the dynamic range of high-frequency features is controlled by a differentiable boundary constraint function. S3: Perform initial feature enhancement operations on the original face image and the high-frequency noise residual feature image based on a dual-stream CNN; wherein, the initial feature enhancement operations are, in sequence, multi-scale feature extraction, frequency domain enhancement, feature dimension alignment, and multi-scale feature fusion; S4: Based on dual-stream ViT, a second feature enhancement operation is performed on the original face image and the high-frequency noise residual feature image after the first feature enhancement operation, respectively; wherein, the second feature enhancement operation is parallel encoding and bidirectional feature enhancement in sequence; S5: The features of the original face image and the high-frequency noise residual feature image after the secondary feature enhancement operation are fused, and the true and false predictions of the fused features are performed based on the multilayer perceptron classification head, and the detection results are output. The secondary feature enhancement operation consists of parallel encoding and bidirectional feature enhancement, specifically: Parallel encoding: First, the original face image after the first feature enhancement operation and the high-frequency noise residual feature image are respectively segmented into image block sequences. Then, category tokens and relative position codes are added to the sequences, and the parallel encoded image block sequences are sent into 12-layer Transform Block processing. After processing by the 5th and 9th Transform Blocks, a cross-attention module is embedded to establish a feature interaction channel between the original face image branch and the high-frequency noise residual feature image branch. The feature interaction between the two branches is realized through the cross-attention module. Bidirectional Feature Enhancement: The 12-layer Transform Block is divided into three groups: [1, 2, 3, 4], [5, 6, 7, 8], and [9, 10, 11, 12]. Before feeding the parallel encoded image block sequence into the Transform Block layer of each group, ViT features are used as the Query and CNN multi-scale convolutional features are used as the Key-Value. Multi-head attention mechanism is used to enhance the transfer of CNN local features to ViT global features. After all Transform Block layers in the group have been processed, CNN multi-scale convolutional features are used as the Query and ViT features are used as the Key-Value. Multi-head attention mechanism is used to enhance the transfer of ViT global features to CNN local features.

2. The deep fake face detection method based on a hybrid model of two-stream CNN and ViT as described in claim 1, characterized in that, The preprocessing includes video frame extraction, training and test set division, face region localization, image size standardization, tensor format conversion, and normalization.

3. The deep fake face detection method based on a hybrid model of two-stream CNN and ViT as described in claim 1, characterized in that, The step of obtaining the high-frequency noise residual from the original face image to obtain the high-frequency noise residual feature image includes: The original face image is processed using a traditional SRM filter to obtain an initial high-frequency noise residual feature image. By combining the Sobel gradient operator, gradient calculation is performed on each channel of the initial high-frequency noise residual feature image, and the region with a gradient greater than a preset threshold is extracted as the final high-frequency noise residual feature image corresponding to the original face image.

4. The deep fake face detection method based on a hybrid model of two-stream CNN and ViT as described in claim 1, characterized in that, The initial feature enhancement operation consists of multi-scale feature extraction, frequency domain enhancement, feature dimension alignment, and multi-scale feature fusion, specifically: Multi-scale feature extraction: The original face image and the high-frequency noise residual feature image are respectively input into a dual-stream CNN branch. Each branch passes through a four-level convolutional layer ensemble module in sequence to obtain four feature maps with decreasing resolution. Frequency domain enhancement: Learnable spatial frequency domain filters and learnable channel frequency domain filters are constructed using Fast Fourier Transform. The learnable high-frequency spatial filters and the learnable high-frequency channel filters are used to extract the feature maps of high-frequency spatial information and cross-channel high-frequency anomalies of the four feature maps of the original face image, as well as the high-frequency noise residual features of the high-frequency noise residual feature image at the corresponding scale level. At each scale level, the feature maps of high-frequency spatial information and cross-channel high-frequency anomalies of the original face image are added together and then processed using the SE attention mechanism to obtain an attention weight. The result of the addition is multiplied by the attention weight and then added to the high-frequency noise residual features of the high-frequency noise residual feature image at the current scale level to achieve fusion. Feature Dimension Alignment: Feature mapping is performed on the features of the last three resolutions, channel expansion and spatial compression are implemented to convert them into 768-dimensional vectors; at the same time, the four-dimensional tensor format is reconstructed into a three-dimensional tensor to adapt to the feature format requirements of the ViT network. Multi-scale feature fusion: The image features of the last three resolutions of the two branches in three-dimensional tensor format are formed into multi-scale feature blocks by cross-layer stitching.

5. The deep fake face detection method based on a hybrid model of two-stream CNN and ViT as described in claim 4, characterized in that, The construction of the learnable high-frequency spatial filter includes: First, the input image tensor is transformed from the spatial domain to the frequency domain using a two-dimensional fast Fourier transform, and the zero frequency point in the frequency domain is transferred to the center of the spectrum. Secondly, a learnable parameter is defined as the standard deviation of the Gaussian function, which is used to construct a low-pass mask for the shape of the two-dimensional Gaussian function. Then, the low-pass mask in the shape of a two-dimensional Gaussian function is inverted into a high-pass mask, with the center value being 0 and the surrounding values ​​being 1. The frequency domain image obtained by the two-dimensional fast Fourier transform is then multiplied element-wise with the high-pass mask. Finally, the frequency domain signal after element-wise multiplication is converted back to the image spatial domain using the inverse Fourier transform. The real part of the complex result is obtained after the inverse Fourier transform and the ReLU activation function is applied. The final output is a feature map containing only high-frequency spatial information.

6. The deep fake face detection method based on a hybrid model of two-stream CNN and ViT as described in claim 4, characterized in that, The construction of the learnable high-frequency channel filter includes: First, the channel dimension of the image tensor is processed using a one-dimensional fast Fourier transform, and the zero frequency point of the channel dimension is transferred to the center. Secondly, a learnable parameter is defined to construct a low-pass mask with a one-dimensional Gaussian function shape. Then, the low-pass mask with the shape of a one-dimensional Gaussian function is inverted into a high-pass mask, and the channel frequency domain signal obtained by the one-dimensional fast Fourier transform is multiplied element by element with the high-pass mask. Finally, the inverse Fourier transform is used to convert the channel frequency domain signal after element-wise multiplication back to the original channel representation. The real part of the complex result after the inverse Fourier transform is obtained and the ReLU activation function is applied. The final output is a feature map containing cross-channel high-frequency anomalies.

7. The deep fake face detection method based on a hybrid model of two-stream CNN and ViT as described in claim 1, characterized in that, Specifically, S5 is: S51: After obtaining the original face image and the high-frequency noise residual feature image after the first feature enhancement operation and the two branches after the second enhancement feature after 12 layers of Transform Block processing, the features of the two branches are first concatenated to form joint features. Then, a fully connected layer is used to perform preliminary fusion of the concatenated features. Next, the preliminary fused features are sent to the channel attention module for weighted processing to obtain the final fused features. S52: Employs a multilayer perceptron classification head, performs real / fake prediction based on the final fused features, and outputs the final deepfake face image detection result.