Deepfake detection method based on active high-frequency texture stripping and difference map reasoning
Patent Information
- Authority / Receiving Office
- CN · China
- Patent Type
- Applications(China)
- Current Assignee / Owner
- NANJING UNIV OF INFORMATION SCI & TECH
- Filing Date
- 2026-05-11
- Publication Date
- 2026-06-12
AI Technical Summary
[0003]然而,这类基于空域的方法存在根本性的缺陷:它们难以将图像的语义内容与篡改痕迹有效解耦,导致模型往往严重过拟合于特定训练集中的领域语义,而非学习到通用的伪造指纹
[0032]1、检测鲁棒性与准确率,本发明不依赖脆弱的单点像素或静态频率统计,而是通过将局部响应差异构建为全局图拓扑结构,能够容忍局部伪影的丢失;
Smart Images

Figure CN122199525A_ABST
Abstract
Description
Technical Field
[0001] This invention belongs to the fields of information security and artificial intelligence technology, specifically relating to a deepfake detection method based on active high-frequency texture stripping and difference graph reasoning, which is applied to address the threat of facial visual forgery brought about by generative artificial intelligence (AIGC). Background Technology
[0002] The core of deepfake detection technology lies in uncovering the differences between real and fake samples across various feature dimensions. Early research primarily relied on passive detection based on spatial features, focusing on traces such as splicing artifacts, unnatural smoothness of local textures, and resolution inconsistencies in images, and using mainstream image classification networks for binary classification training. To improve the model's ability to capture subtle forgery traces, subsequent research introduced texture enhancement modules and attention mechanisms to guide the network to focus on frequently altered areas such as facial features and facial edges. Simultaneously, methods such as Self-Blended Images (SBI) attempted to improve the model's generalization performance by actively simulating forgery boundaries during the training phase.
[0003] However, these spatial domain-based methods suffer from a fundamental flaw: they struggle to effectively decouple the semantic content of an image from tampering traces, leading to severe overfitting of the model to the domain semantics of a specific training set, rather than learning a general forgery fingerprint. Furthermore, these methods heavily rely on the apparent high-frequency information of the image, making the learned synthetic boundaries inherently fragile. When faced with post-processing in real-world scenarios such as video compression, blurring, or noise addition, these spatial artifacts are easily disrupted, resulting in a significant decrease in the robustness of the detection model.
[0004] To address the limitations of spatial domain detection, detection methods based on frequency domain analysis have gradually become a research hotspot. Generative models such as Generative Adversarial Networks (GANs) introduce spectral truncation or anomalous high-frequency attenuation during upsampling, which manifests as unique spectral fingerprints in the frequency domain. Existing two-stream network architectures like F3Net attempt to utilize frequency-aware decomposition and local frequency statistics to capture the differences in frequency distribution and phase structure between real and fake faces.
[0005] Nevertheless, existing frequency domain methods primarily treat frequency components as "static statistical features," completely ignoring the dynamic stability of these features under external perturbations. With the rapid iteration of generative techniques, new generative models generate images with increasingly smooth spectral features, capable of highly simulating the spectral distribution of real faces. Meanwhile, common image and video compression algorithms inherently discard high-frequency information. This makes methods that solely rely on extracting static statistical features in the frequency domain unable to capture the inherent consistency differences between real and fake faces, and struggle to maintain stable detection performance in complex scenes and against advanced forgery techniques.
[0006] In processing video-level samples, existing techniques typically rely on spatiotemporal consistency, utilizing recurrent neural networks (RNNs) or Transformer architectures to capture unnatural jitter in fake faces between frames, or combining graph neural networks (GCNs) to construct graph structures from facial keypoints for information fusion across different regions. Some studies even incorporate audio, using the synchronization between lip movements and speech signals to aid in discrimination. However, extracting spatiotemporal features not only incurs high computational costs but also heavily depends on the length and quality of the video sequence. Existing graph network methods are mostly limited to simple feature fusion, ignoring the representational differences between semantic priors and texture details in the graph topology, failing to truly validate the inherent physiological consistency of the global structure. More importantly, video-level detection still rests on the discriminative power of frame-level features. If the low-level feature representation of a single frame is not robust enough or the decoupling is not thorough enough, the effectiveness of spatiotemporal consistency detection or global topological inference will be significantly reduced.
[0007] In summary, although existing deepfake detection methods (whether in the spatial, frequency, or spatiotemporal domains) have made some progress, they are fundamentally limited to "passively mining" the "static artifacts" left by the generative model. This traditional paradigm of passively extracting static features inevitably leads to a generalization bottleneck. Faced with a new generation of forgery techniques that are excessively smooth, high-fidelity, and capable of cross-domain evolution, the limitations of these methods become increasingly apparent. Therefore, breaking away from the mindset of passive mining and exploring new detection paradigms that can actively apply perturbations and reveal the inherent consistency differences between real and forged samples under dynamic responses has become crucial to overcoming current generalization barriers and improving the robustness of deepfake detection. Summary of the Invention
[0008] Purpose of the invention: The purpose of this invention is to address the shortcomings of existing technologies and provide a deep forgery detection method based on active high-frequency texture stripping and difference graph inference, aiming to improve the generalization ability and robustness of deep forgery detection in cross-domain and cross-generator scenarios.
[0009] Technical Solution: The present invention discloses a deepfake detection method based on active high-frequency texture stripping and difference graph inference. A deepfake detection network is constructed and trained to detect video frames. The deepfake detection network includes a heterogeneous feature capture module, a high-frequency texture stripping module, and a difference graph inference module. After the video frame sequence is input into the deepfake detection network, the following steps are performed:
[0010] S1, the heterogeneous feature capture module extracts multidimensional heterogeneous features of the current image frame, and then concatenates and fuses these features to obtain a comprehensive feature vector. Then, the dynamic difference between the comprehensive feature vectors of two adjacent frames is calculated using a heterogeneity score function, and keyframes of mutation are selected based on the heterogeneity scores of each image frame. ;
[0011] S2, For mutation keyframes The high-frequency texture stripping module first uses wavelet transform to reconstruct the high-frequency residual map. Then, spatial domain ROI masking is performed to generate the region of interest ROI mask M. Finally, the original keyframes and high-frequency residual maps are processed. The region of interest (ROI) mask M is linearly weighted, and high-frequency textures in keyframes are actively stripped to generate perturbation samples, i.e., high-frequency stripped samples. ;
[0012] S3, the difference graph reasoning module uses a dual-stream network branch to extract the original keyframes and high-frequency stripped samples respectively. The features are subtracted and the difference features are transformed into a graph structure. Global physiological consistency topological reasoning is performed through a graph convolutional network (GCN) to finally output the true and false classification results.
[0013] Furthermore, the multidimensional heterogeneous features include Canny edge features, Local Binary Pattern (LBP) features, and Discrete Cosine Transform (DCT) coefficients.
[0014] Further, the heterogeneity score mentioned in step S1 is calculated using the following formula:
[0015] ;
[0016] Where β is the equilibrium hyperparameter, with a value of 0.1. and The combined feature vectors of two adjacent frames are used; the image frame with the largest heterogeneity score S(t) is selected as the final mutation keyframe. .
[0017] Furthermore, step S2 is implemented as follows:
[0018] Frequency domain high-frequency separation: Haar Discrete Wavelet Transform (HaarDWT) is used for keyframes. Multi-band decomposition is performed, and inverse discrete wavelet transform (IDWT) is conducted using the high-frequency components in the horizontal (LH), vertical (HL), and diagonal (HH) directions to reconstruct a high-frequency residual map that includes only fine-grained details. ;
[0019] Generate a spatial domain ROI mask: Use a face keypoint detector to extract facial contours, generate an initial binary mask, and perform morphological dilation on the initial mask to generate a region of interest (ROI) mask covering the core face and transition boundaries. ;
[0020] By stripping samples using linear weighting and introducing an adaptive intensity factor α, the original keyframes are... Region of Interest (ROI) Mask With high-frequency residual plot Perform linear weighted subtraction to generate high-frequency stripping samples. :
[0021] .
[0022] Furthermore, the adaptive intensity factor α is set to 0.5.
[0023] Furthermore, step S3 is implemented as follows:
[0024] Dual-stream feature extraction and fusion: converting the original keyframes and high-frequency stripping samples The constructed two-stream network with shared weights is input separately; the two-stream network includes a DINOv2 visual model with frozen pre-trained parameters and a lightweight CNN; the adaptive feature fusion module (AFF) driven by channel-level attention mechanism aligns and fuses the semantic features output by DINOv2 and the texture features output by CNN to obtain the original feature maps M. o and peeling feature map M p ;
[0025] Difference map generation and graph structure flattening: transforming the original feature map M o and peeling feature map M p Subtract the absolute values to obtain an explicit absolute difference feature map; flatten its spatial dimensions into a node sequence of N nodes. Each node represents a localized facial patch;
[0026] Dynamic Adjacency Matrix Construction: A self-attention mechanism is introduced to dynamically calculate the semantic association strength between nodes and construct an adaptive adjacency matrix A.
[0027] ;
[0028] in, and These are the learnable projection matrices for the Query and Key, respectively. This is the scaling factor;
[0029] Graph Convolutional Reasoning and Classification: Based on the adjacency matrix A, graph convolutional aggregation calculation of non-local difference information between nodes is performed. ,in, This is the weight matrix;
[0030] The attention pooling layer assigns higher weights to nodes with unusually high values, aggregates them into a compact graph representation vector, and feeds it into a multilayer perceptron classifier (MLP) to output the final true or false probability.
[0031] Beneficial effects: Compared with the prior art, the beneficial effects of the present invention are as follows:
[0032] 1. In terms of robustness and accuracy, this invention does not rely on fragile single-point pixels or static frequency statistics, but instead constructs a global graph topology structure by local response differences, which can tolerate the loss of local artifacts.
[0033] 2. Cross-domain generalization ability: This invention breaks the mindset of passively mining artifacts and creates a new paradigm of active perturbation-response detection, which completely solves the problem that the model is prone to overfitting to a specific generator.
[0034] 3. The high-frequency texture stripping, graph reasoning, and frozen semantic prior of this invention complement each other and are indispensable, together forming a moat for detection capabilities. Attached Figure Description
[0035] Figure 1 This is a schematic diagram of the deepfake detection network architecture proposed in this invention;
[0036] Figure 2 This is a schematic diagram of the high-frequency texture stripping module structure proposed in this invention;
[0037] Figure 3 This is a schematic diagram of the difference graph reasoning module structure proposed in this invention. Detailed Implementation
[0038] The present invention will now be described in further detail with reference to the accompanying drawings.
[0039] like Figure 1 As shown, this invention proposes a deepfake detection method based on active high-frequency texture stripping and difference graph inference. A deepfake detection network is constructed and trained to detect video frames. The deepfake detection network includes a heterogeneous feature capture module, a high-frequency texture stripping module, and a difference graph inference module. After the video frame sequence is input into the deepfake detection network, the following steps are performed:
[0040] Step 1: Extract multi-dimensional heterogeneous features of the current image frame using the heterogeneous feature capture module, and then concatenate and fuse the multi-dimensional heterogeneous features of the current image frame to obtain a comprehensive feature vector. Then, the dynamic difference between the comprehensive feature vectors of two adjacent frames is calculated using a heterogeneity score function, and the most discriminative mutation keyframes are selected based on the heterogeneity scores of each image frame. .
[0041] Since deepfake videos often have tiny structural or spectral abrupt changes between frames, the heterogeneous feature capture module acts as a temporal filter. It aims to accurately locate key frames where the forgery features change without relying on frame-by-frame intensive computation, thereby significantly reducing computational redundancy and providing the most discriminative samples for subsequent modules.
[0042] For each frame of the input video stream or image group, its multidimensional heterogeneous features are extracted. These features include Canny edge features for capturing structural discontinuities, Local Binary Pattern (LBP) features for encoding local texture distortions, and Discrete Cosine Transform (DCT) coefficients for detecting spectral anomalies invisible to the naked eye. These three features are concatenated and fused to construct the comprehensive feature vector of the current frame. .
[0043] Constructing heterogeneity fractional functions Quantize the combined feature vector of two adjacent frames and The dynamic difference between them is calculated using the following formula:
[0044] ;
[0045] In this formula, the first term uses cosine distance to measure the directional deviation (structural offset) in the feature space, the second term uses L1 norm to measure the amplitude change (sudden appearance or disappearance of artifacts), and β is the equilibrium hyperparameter (preferably 0.1).
[0046] Image frames whose heterogeneity score S(t) reaches its peak (maximum heterogeneity score S(t) value) are selected as the final keyframes. .
[0047] Step 2, as follows Figure 2 As shown, for keyframes The high-frequency texture stripping module first uses wavelet transform to reconstruct the high-frequency residual map. Then, spatial domain ROI masking is performed to generate the region of interest ROI mask M. Finally, the original keyframes and high-frequency residual maps are processed. The region of interest (ROI) mask M is linearly weighted, and high-frequency textures in keyframes are actively stripped to generate perturbation samples, i.e., high-frequency stripped samples. .
[0048] Since real faces have rich high-frequency details (such as pores and hair), while fake faces often lack high-frequency information due to the upsampling smoothing constraint of the generation model, the high-frequency texture stripping module actively strips the high-frequency texture of key frames to simulate generation attacks, thereby triggering different response amplitudes in the feature space between real samples and fake samples.
[0049] First, high-frequency separation is performed in the frequency domain, and Haar Discrete Wavelet Transform (HaarDWT) is used for keyframes. Multi-band decomposition is performed, discarding the low-frequency approximation component LL containing the main image structure, and only using the high-frequency components in the horizontal LH, vertical HL, and diagonal HH directions to perform inverse discrete wavelet transform (IDWT) to reconstruct a high-frequency residual map containing only fine-grained details. .
[0050] In the spatial domain branch, a spatial domain ROI mask is generated. Facial contours are extracted using a face landmark detector, generating an initial binary mask. To cover boundary artifacts near the chin and hairline, a morphological dilation operation is performed on the initial mask to generate a region of interest (ROI) mask covering the core face and transition boundaries. .
[0051] Finally, by stripping samples using linear weighting and introducing an adaptive intensity factor α, the original keyframes are... Region of Interest (ROI) Mask With high-frequency residual plot Perform linear weighted subtraction to generate high-frequency stripping samples. The calculation formula is as follows:
[0052] .
[0053] The parameter α can take values in the range of [0.2, 0.7], with the optimal value being 0.5, which achieves the best balance between destroying artifacts and preserving physiological structures.
[0054] Step 3, as follows Figure 3 As shown, the difference graph inference module uses a two-stream network branch to extract the original keyframes and high-frequency stripped samples respectively. The features are subtracted and the difference features are transformed into a graph structure. Global physiological consistency topological reasoning is performed through a graph convolutional network (GCN) to finally output the true and false classification results.
[0055] The difference graph reasoning module transforms signal-level perturbation differences into feature-level classification criteria. It extracts features from the original frame and the stripped frame through a two-stream network and calculates the difference. Then, it uses a graph convolutional network (GCN) to perform global topological reasoning to identify advanced forgery samples that are locally realistic but have inconsistent global structures.
[0056] Dual-stream feature extraction and fusion are performed to construct a dual-stream network with shared weights, with the original keyframe I input separately. in and high-frequency stripping sample I out The two-stream network branch consists of a DINOv2 vision model with frozen pre-trained parameters (for extracting global semantic priors with strong generalization) and a lightweight CNN (for extracting local texture details).
[0057] Subsequently, the adaptive feature fusion module AFF (which locks when the swing hyperparameter reaches its optimal value) driven by the channel-level attention mechanism aligns and fuses the features of the two modules, yielding the original feature maps M. o and peeling feature map M p .
[0058] Difference map generation and graph structure flattening, transforming the original feature map M o and peeling feature map M p Subtract the absolute values to obtain an explicit absolute difference feature map; flatten its spatial dimensions into a node sequence of N nodes. Each node represents a localized facial patch.
[0059] A self-attention mechanism is introduced to dynamically calculate the semantic association strength between nodes, and an adaptive adjacency matrix A is constructed:
[0060] ;
[0061] in, and These are the learnable projection matrices for the Query and Key, respectively. This is the scaling factor.
[0062] Graph convolutional reasoning and classification, based on adjacency matrix A, performs graph convolutional aggregation calculations on non-local difference information between nodes: ,in This is the weight matrix.
[0063] The attention pooling layer assigns higher weights to nodes with unusually high values, aggregates them into a compact graph representation vector, and feeds it into a multilayer perceptron classifier (MLP) to output the final true or false probability.
[0064] To verify the feasibility and technical effectiveness of the present invention, this embodiment constructs and trains a deepfake detection network based on the PyTorch deep learning framework. It is assumed that the input video stream is extracted as an image sequence tensor. Where T is the frame rate, and H and W are uniformly adjusted to 224.
[0065] Inter-frame abrupt changes are quickly calculated using basic operators. For frames t and t+1 in the sequence, the feature extraction operator first converts the RGB images to grayscale. The Canny edge operator (with high and low thresholds set to 100 and 200 respectively) is used to extract the binary edge map and flatten it into a one-dimensional vector. The LBP operator (with a radius of 1 and 8 sampling points) is used to extract local texture histogram features. The 2D-DCT operator is used to extract and flatten the coefficient matrix from low to high frequencies.
[0066] Then, concatenate the three vectors along the feature dimension to obtain a one-dimensional feature vector F. t and F t+1 The mutation score is obtained by calculating the cosine similarity between the two frames, subtracting the cosine similarity, and then summing the L1 distance (weighted at 0.1). After traversing T frames, the frame with the highest score is selected as the input keyframe using the argmax() function. .
[0067] Under the constraint of spatial masking, the keyframe image frequency domain is filtered and reconstructed. Wavelet transform libraries (such as PyWavelets) are used to process the keyframes. Performing a two-dimensional Haar discrete wavelet transform yields the low-frequency component LL and the high-frequency components (LH, HL, HH). The key operation is to force the LL matrix to be an all-zero matrix, followed by calling the inverse wavelet transform to reconstruct the image residual tensor I containing only the details. high The MediaPipe face mesh detection interface is called to obtain facial key points. A solid binary mask is drawn using OpenCV's fillPoly function. Then, the dilate function is called to perform a dilation operation using a 5×5 kernel, generating a mask tensor with smooth edges. Pixel-level operations can be performed using PyTorch's tensor broadcasting mechanism.
[0068] ,
[0069] This operation outputs high-frequency disrupted perturbation samples. Use it as a high-frequency stripping sample .
[0070] Load the pre-trained DINOv2 (ViT-Small) model using torch.hub and freeze all its require_grad values. Preferred input keyframes. and high-frequency stripping samples The features of the last layer are extracted, and their spatial resolution is resampled to 16×16 using bilinear interpolation. Then, a shallow network consisting of 4 layers of Conv2d+BatchNorm2d+ReLU is constructed to simultaneously extract the texture feature maps of both, with the output size also being 16×16. The two feature maps are concatenated along the channel dimension and fed into an SE-Block consisting of global average pooling (GAP) and fully connected layers to calculate channel attention weights, and then weighted before output.
[0071] Then, tensor subtraction and absolute value operations are performed:
[0072] ,
[0073] The output tensor size is (B, C, 16, 16), where B is the batch size and C is the number of feature channels.
[0074] Subsequently, the difference between the original feature vector and the feature vector after removing high-frequency information is calculated. Reshape and transpose, resulting in a graph with dimensions (B, 256, C). Here, 256 represents the number of nodes N in the graph. Generate Query and Key tensors through two independent linear layers. After matrix multiplication, pass through a softmax layer to generate a dense adjacency matrix of size (B, 256, 256). Perform chain matrix multiplication: The non-local node information is aggregated. An importance weight for each node is predicted through a linear layer, and the weighted sum of the 256 nodes yields a single one-dimensional feature vector. Finally, this vector is fed into a multilayer perceptron consisting of Linear->ReLU->Linear layers, outputting a 2-dimensional Logits that represent the classification confidence for Real and Fake.
[0075] This invention achieves face detection by constructing an end-to-end deep learning network system. It does not rely on fragile single-point pixel or static frequency statistics, but instead constructs a global graph topology based on local response differences, thus tolerating the loss of local artifacts. As shown in Table 1, in evaluation on the intra-dataset, this invention achieves an AUC of 99.58% and an accuracy (ACC) of 99.28% on the base dataset FF++, achieving near-perfect detection. More importantly, even on the DFDC dataset, which contains significant real-world perturbations and heavy video compression, where most existing methods experience performance degradation, this invention still achieves an extremely high accuracy of 97.35% and an AUC of 99.58%, comprehensively surpassing all compared existing technologies. This demonstrates that this invention possesses extremely high security against complex image post-processing.
[0076] Table 1. Comparison of detection performance on homologous datasets (intra-database evaluation)
[0077]
[0078] This invention breaks away from the conventional mindset of passively detecting artifacts and pioneers a new detection paradigm of "active perturbation-response." The model learns the universal physical law that "real faces are rich in detail, and feature changes drastically after high-frequency stripping; fake faces are excessively smooth, with weak feature changes," thus completely solving the problem of models easily overfitting to specific generators.
[0079] Table 2. Cross-dataset generalization performance comparison (Source training set: FF++)
[0080]
[0081] As shown in Table 2, in the highly challenging cross-dataset generalization experiments, all models were trained solely on the FF++ dataset. Faced with unseen target datasets, the performance of existing techniques experienced a precipitous drop (for example, on DFDC, the AUC of most comparative methods plummeted to the 60%-70% range). However, this invention maintained an AUC of 90.29% on the DFDC dataset (significantly exceeding the current best baseline model by more than 9%), and also achieved an extremely high accuracy of 85.82% on the high-quality forged dataset Celeb-DF. This fully demonstrates the excellent universality of this invention when facing unknown generation techniques.
[0082] The three core technical modules of the deepfake detection network of this invention—high-frequency texture stripping, graph reasoning, and frozen semantic priors—complement each other and are indispensable, as shown in Tables 3 to 6:
[0083] Table 3 Ablation Experiment of High-Frequency Stripping Module
[0084]
[0085] Table 4 Ablation Experiment of Graph Reasoning Module
[0086]
[0087] Table 5 Semantic Prior Module Ablation Experiment
[0088]
[0089] Table 6. Inference Mechanism Replacement Ablation Experiment
[0090]
[0091] As shown in Tables 3 to 6, if only RGB images or simple Gaussian blur are used instead of the Haar wavelet stripping of this invention, the AUC of the model on cross-domain DFDC drops sharply to 62.14% and 76.66%, respectively, proving that actively extracting high-frequency wavelet residuals is the most effective means of revealing forged features, thus proving the necessity of high-frequency stripping. If standard global pooling (GAP) or standard graph network (StandardGCN) is used instead of the differential graph inference of this invention, the cross-domain generalization ability decreases significantly, proving that the mechanism of dynamically constructing facial semantic topology graphs in this invention can more effectively capture global physiological inconsistencies, thus proving the necessity of graph inference. If the frozen large model is replaced with a trainable network (ResNet / Xception), the model performs extremely well in tests within FF++, but immediately collapses in cross-domain tests (AUC drops to 71%-73%), proving that the strategy of freezing the parameters of the large model in this invention is the key to preventing overfitting and improving scalability, thus proving the necessity of freezing semantic priors.
[0092] The above content merely illustrates the technical concept of the present invention and should not be construed as limiting the scope of protection of the present invention. Any modifications made based on this technical solution fall within the scope of protection of the technical concept proposed in this invention and are included in the claims of this invention.
Claims
1. A deepfake detection method based on active high-frequency texture stripping and difference graph inference, characterized in that, A deepfake detection network is constructed and trained to detect video frames. The deepfake detection network includes a heterogeneous feature capture module, a high-frequency texture stripping module, and a difference map inference module. After the video frame sequence is input into the deepfake detection network, the following steps are performed: S1, the heterogeneous feature capture module extracts multidimensional heterogeneous features of the current image frame, and then concatenates and fuses these features to obtain a comprehensive feature vector. Then, the dynamic difference between the comprehensive feature vectors of two adjacent frames is calculated using a heterogeneity score function, and keyframes of mutation are selected based on the heterogeneity scores of each image frame. ; S2, For mutation keyframes The high-frequency texture stripping module first uses wavelet transform to reconstruct the high-frequency residual map. Then, spatial domain ROI masking is performed to generate the region of interest ROI mask M. Finally, the original keyframes and high-frequency residual maps are processed. The high-frequency texture of the keyframe is actively stripped by linear weighting of the region of interest (ROI) mask M to generate perturbation samples, i.e., high-frequency stripped samples. ; S3, the difference graph reasoning module uses a dual-stream network branch to extract the original keyframes and high-frequency stripped samples respectively. The features are subtracted and the difference features are transformed into a graph structure. Global physiological consistency topological reasoning is performed through a graph convolutional network (GCN) to finally output the true and false classification results.
2. The deepfake detection method based on active high-frequency texture stripping and difference graph inference according to claim 1, characterized in that... The multidimensional heterogeneous features include Canny edge features, Local Binary Pattern (LBP) features, and Discrete Cosine Transform (DCT) coefficients.
3. The deepfake detection method based on active high-frequency texture stripping and difference graph inference according to claim 1, characterized in that, The heterogeneity score mentioned in step S1 is calculated using the following formula: ; Where β is the equilibrium hyperparameter, and The combined feature vectors of two adjacent frames are used; the image frame with the largest heterogeneity score S(t) is selected as the final mutation keyframe. .
4. The deepfake detection method based on active high-frequency texture stripping and difference graph inference according to claim 1, characterized in that, The equilibrium hyperparameter β is set to 0.
1.
5. The deepfake detection method based on active high-frequency texture stripping and difference graph inference according to claim 1, characterized in that, The implementation process of step S2 is as follows: Frequency domain high-frequency separation: Haar Discrete Wavelet Transform (HaarDWT) is used for keyframes. Multi-band decomposition is performed, and inverse discrete wavelet transform (IDWT) is conducted using the high-frequency components in the horizontal (LH), vertical (HL), and diagonal (HH) directions to reconstruct a high-frequency residual map that includes only fine-grained details. ; Generate a spatial domain ROI mask: Use a face keypoint detector to extract facial contours, generate an initial binary mask, and perform morphological dilation on the initial mask to generate a region of interest (ROI) mask covering the core face and transition boundaries. ; By stripping samples using linear weighting and introducing an adaptive intensity factor α, the original keyframes are... Region of Interest (ROI) Mask With high-frequency residual plot Perform linear weighted subtraction to generate high-frequency stripping samples. : 。 6. The deepfake detection method based on active high-frequency texture stripping and difference graph inference according to claim 5, characterized in that, The adaptive intensity factor α is set to 0.
5.
7. The deepfake detection method based on active high-frequency texture stripping and difference graph inference according to claim 1, characterized in that, The implementation process of step S3 is as follows: Dual-stream feature extraction and fusion: converting the original keyframes and high-frequency stripping samples The constructed two-stream network with shared weights is input separately; the two-stream network includes a DINOv2 visual model with frozen pre-trained parameters and a lightweight CNN; the adaptive feature fusion module (AFF) driven by channel-level attention mechanism aligns and fuses the semantic features output by DINOv2 and the texture features output by CNN to obtain the original feature maps M. o and peeling feature map M p ; Difference map generation and graph structure flattening: transforming the original feature map M o and peeling feature map M p Subtract the absolute values to obtain an explicit absolute difference feature map; flatten its spatial dimensions into a node sequence of N nodes. Each node represents a localized facial patch; Dynamic Adjacency Matrix Construction: A self-attention mechanism is introduced to dynamically calculate the semantic association strength between nodes and construct an adaptive adjacency matrix A. ; in, and These are the learnable projection matrices for the Query and Key, respectively. This is the scaling factor; Graph Convolutional Reasoning and Classification: Based on the adjacency matrix A, graph convolutional aggregation calculation of non-local difference information between nodes is performed. ,in, This is the weight matrix; The attention pooling layer assigns higher weights to nodes with unusually high values, aggregates them into a compact graph representation vector, and feeds it into a multilayer perceptron classifier (MLP) to output the final true or false probability.