Video deepfake detection method based on region perception and graph convolution network

By using a region-aware and graph convolutional network-based approach, global, local, and frequency domain features of face images are extracted. Combined with graph structure modeling, this solves the problem of insufficient cross-frame correlation information capture in existing technologies, achieving high robustness and generalization ability in detecting high-quality forged videos, and accurately locating the tampered areas.

CN122244767APending Publication Date: 2026-06-19XIHUA UNIV

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Applications(China)
Current Assignee / Owner
XIHUA UNIV
Filing Date
2026-04-17
Publication Date
2026-06-19

AI Technical Summary

Technical Problem

Existing video deepfake detection methods lack robustness when dealing with high-quality forgeries and complex compression distortion scenarios, making it difficult to capture cross-frame correlation information, resulting in low detection accuracy and poor generalization ability.

Method used

By constructing a method based on region perception and graph convolutional networks, global, local and frequency domain features of face images are extracted. Combined with graph structure modeling, the relationship between different key facial regions and frames is characterized. Hierarchical graph convolutional networks are used for feature propagation and relationship modeling. A temporal dynamic weight mechanism is introduced to enhance detection capabilities.

Benefits of technology

It significantly improves the robustness and generalization ability of detection under complex forgery methods and video compression interference, and can more accurately identify forged videos and locate the tampered areas.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN122244767A_ABST
    Figure CN122244767A_ABST
Patent Text Reader

Abstract

This invention relates to the field of video tampering detection technology, and discloses a video deepfake detection method based on region perception and graph convolutional networks, comprising the following steps: acquiring N frames of face images from a face video to be tested, and extracting the global feature vector of each frame of face image; performing face region segmentation on each frame of face image, and extracting local feature vectors of each key facial region; extracting weighted regional frequency domain features based on the local feature vectors; constructing a weighted composite feature matrix of the face image of that frame; constructing a region-level graph structure and a frame-level graph structure; inputting the region-level graph structure and the frame-level graph structure into a hierarchical graph convolutional network for feature propagation and relation modeling to obtain the final layer feature vector corresponding to each frame of face image; and classifying the face image as genuine or fake based on the final layer feature vector. This invention is not limited to single-frame analysis or local region features, but also effectively captures the high-order semantic associations and temporal consistency constraints across frames in face videos through graph structure modeling.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the field of video tampering detection technology, and in particular to a video deepfake detection method based on region awareness and graph convolutional networks. Background Technology

[0002] The widespread use of graphical interactive software and the rapid development of Generative Adversarial Networks (GANs) have also driven the manipulation of facial images in videos. However, these manipulated videos are easily disseminated through online platforms, posing a significant security threat. Deepfake is a common forgery technique that uses advanced deep learning algorithms to locally replace, splice, or synthesize parts of the original video, creating highly realistic forgeries. Because this manipulation often relies on local information from the same video source and achieves seamless integration through deep learning models, the forgery effect is extremely natural, making deepfake a difficult-to-detect video forgery method. Currently, widely used deepfake detection methods mainly include spatial feature-based methods, local region analysis-based methods, and frequency domain inconsistency-based methods.

[0003] Spatial feature-based methods primarily utilize global visual artifacts such as texture, color, and lighting in images for forgery detection. For example, convolutional neural networks (CNNs) like MesoNet and Xception directly perform end-to-end classification of face images, learning subtle spatial differences between real and forged videos to identify fake content. These methods typically use face images as input, extracting features from low-order edges and textures to high-order semantic features through multi-level convolutional operations, and finally outputting the true / false classification probability via fully connected layers. Their advantage lies in not requiring complex preprocessing or manually designed features, but their disadvantage is that they are limited to spatial features, analyzing only a single frame and failing to comprehensively capture the structural and relational aspects of forgery traces. Their effectiveness is particularly limited when processing low-quality videos, making it difficult to locate specific forged regions.

[0004] Methods based on local region analysis focus on anomalies that are prone to occur in key facial regions (such as eyes, lips, and teeth) during forgery, such as asymmetrical structures, inconsistent lighting, or texture discontinuities. These methods typically first use facial landmark detection or region proposal networks to locate facial sub-regions, then extract depth features from these sub-regions for fine-grained analysis, and determine whether forgery has occurred by comparing the consistency of local features with the overall face or different regions. However, this approach does not adequately consider higher-order structural relationships, focusing only on local features or short-term motion patterns, ignoring global structural anomalies introduced by forgery operations, and failing to capture the complex intrinsic relationships between facial regions.

[0005] Methods based on frequency domain inconsistency analyze the physiological or motion anomalies introduced by forgery operations in the frequency domain, starting from video characteristics. These methods typically combine convolutional neural networks and recurrent neural networks to extract local anomaly patterns and temporal dynamic features of face images in the frequency domain, and then use a classifier to determine the authenticity of the video. For example, F3Net uses frequency-aware cues combined with discrete cosine transform and cross-attention mechanisms to improve detection accuracy. However, these methods lack the ability to capture cross-frame correlation cues and fail to fully utilize the operational traces and consistency issues across multiple frames in deepfake videos, resulting in insufficient robustness to complex forgery types and compression distortions. They also lead to poor generalization ability across datasets, making it difficult to model the similarity and structural consistency of forgery traces between different samples. Summary of the Invention

[0006] This invention provides a video forgery detection method that is not limited to single-frame analysis or local region features. It effectively captures high-order semantic associations and temporal consistency constraints across frames in face videos through graph structure modeling. This improves detection robustness and generalization ability when facing high-quality forgery and complex compression distortion scenarios. Therefore, this invention provides a video deep forgery detection method based on region perception and graph convolutional networks.

[0007] To achieve the above-mentioned objectives, the embodiments of the present invention provide the following technical solutions:

[0008] A video depth forgery detection method based on region awareness and graph convolutional networks includes the following steps:

[0009] Step 1: Acquire N frames of face images from the face video to be tested, and extract the global feature vector of each frame of face image; perform face region segmentation on each frame of face image to obtain K key facial regions, and extract the local feature vector of each key facial region; extract the weighted region frequency domain features based on the local feature vectors; fuse the global feature vector and the weighted region frequency domain features of the same frame of face image to construct the weighted composite feature matrix of the frame of face image;

[0010] Step 2: Construct a region-level graph structure and a frame-level graph structure based on the weighted composite feature matrix corresponding to N frames of face images to represent the relationships between different key facial regions and different frames of face images; introduce a temporal dynamic weight mechanism into the constructed frame-level graph structure to characterize the temporal change features between video frames; input the region-level graph structure and the frame-level graph structure into a hierarchical graph convolutional network for feature propagation and relationship modeling to obtain the final layer feature vector corresponding to each frame of face image; classify the face images based on the final layer feature vector to determine whether each frame of face image is a fake face image or a real face image.

[0011] In the above scheme, an end-to-end localization mechanism based on region-aware graph structure modeling, from nodes to binary masks to pixel coordinates, is introduced to construct multi-frame face images into region-level and frame-level graph structures to characterize different key facial regions and the correlation between face images in different frames. When constructing the frame-level graph structure, a temporal dynamic weighting mechanism is introduced to adaptively adjust the connection relationship between nodes, thereby effectively capturing the temporal change features between video frames. This solves the problem in existing technologies that rely solely on single-frame analysis and are difficult to model cross-frame correlation information, significantly improving the robustness of detection against complex forgery methods and video compression interference. By simultaneously fusing the global feature vector of the face image, the local feature vector based on the mask matrix, and the corresponding weighted regional frequency domain features, a weighted composite feature matrix is ​​constructed. The global feature vector is used to retain the overall illumination distribution, structural morphology, and contextual semantic information of the face, the local feature vector is used to characterize fine-grained abnormal changes in key facial regions such as the eyes and mouth, and the weighted regional frequency domain features can effectively capture forgery traces left by the generation model in texture details and high-frequency components. Furthermore, by constructing a hierarchical graph convolutional network to perform feature propagation and relationship modeling on region-level and frame-level graph structures, the model can learn the structural relationships between nodes at both the region-level and frame-level, thereby further improving the ability to identify complex deepfake patterns.

[0012] Furthermore, step 1, the step of extracting the global feature vector of each frame of face image, includes:

[0013]

[0014] Among them, I i Let i be the face image of the i-th frame, i=1,2,...,N; f is the global feature vector of the i-th frame of the face image; g (.) represents the global feature extraction network.

[0015] Furthermore, in step 1, the step of segmenting the face region of each frame of the face image to obtain K key facial regions, and extracting the local feature vector of each key facial region, includes:

[0016] The Mask R-CNN model is used to process the face image in frame i. i Pixel-level semantic segmentation is performed to obtain a set of binary masks for K key facial regions:

[0017]

[0018] Where H and W are the face image I i Height and width; M i,t Represents a face image Ii The mask matrix for the t-th key facial region, where t=1,2,...,K;

[0019] The mask matrix M i,t Acting on face image I i The local feature vectors are obtained as follows:

[0020]

[0021] in; This represents the local feature vector of the t-th facial key region in the i-th frame of the face image; This indicates element-wise multiplication.

[0022] Furthermore, in step 1, the step of extracting weighted region frequency domain features based on local feature vectors includes:

[0023] Extracting spatial domain features based on local feature vectors:

[0024]

[0025] in, f represents the spatial domain features of the t-th facial key region; s (.) represents the spatial feature extraction network.

[0026] In the above scheme, spatial domain features can describe the texture structure, edge information and color distribution features in the regional image, which plays an important role in capturing structural anomalies in local areas of the face.

[0027] However, relying solely on spatial domain features is insufficient to fully characterize the subtle frequency artifacts produced by deepfakes; therefore, frequency domain features need to be further introduced. To enhance the ability to perceive the generation of forgery traces, frequency domain features are extracted from key facial regions based on local feature vectors to obtain corresponding frequency domain feature representations:

[0028]

[0029] in, Local feature vectors The frequency domain representation of an image; DCT, or Discrete Cosine Transform, is used to map an image from the spatial domain to the frequency domain, separating high-frequency and low-frequency information in the image. Since deepfake images often introduce unnatural frequency distributions during generation, such as abnormal high-frequency noise or inconsistent spectral structures, frequency domain analysis can effectively capture these forgery traces.

[0030] The frequency domain representation will then be used. Input to the frequency domain feature extraction network to extract frequency domain features:

[0031]

[0032] in, f represents the frequency domain features of the t-th key facial region; f (.) represents the frequency domain feature extraction network.

[0033] In order to comprehensively utilize spatial domain information and frequency domain information, spatial domain features and frequency domain features are fused:

[0034]

[0035] Among them, F i,t represents the fusion feature of the t-th key facial region; Concat is the stitching operation.

[0036] In the above scheme, by jointly representing spatial domain features and frequency domain features, the weighted composite feature matrix can simultaneously possess macroscopic semantic understanding ability and microscopic forgery trace perception ability, thereby avoiding the limitations brought about by single feature analysis, solving the problems of global features ignoring local details and local features lacking overall semantic information, and realizing a more comprehensive and accurate expression of face forgery features.

[0037] Different key facial regions exhibit varying degrees of importance in deepfake detection; for example, the eye and mouth areas are more prone to artifact generation. To highlight the feature information of key facial regions, a region attention mechanism is introduced to weight the fused features of each key facial region:

[0038]

[0039] Among them, a i,t The expression represents the attention score for the t-th facial key region; w represents the attention weight. Indicates matrix transpose;

[0040] Weighted region frequency domain features are obtained based on attention scores:

[0041]

[0042] in, This represents the weighted region frequency domain feature of the t-th key facial region.

[0043] In the above scheme, the region attention mechanism can adaptively highlight the regional features that are more important for forgery detection, while suppressing interference from irrelevant regions.

[0044] Furthermore, in step 1, the step of fusing the global feature vector and weighted regional frequency domain features of the same frame of face images to construct a weighted composite feature matrix of that frame of face images includes:

[0045]

[0046] in, This represents the characteristics of a weighted composite matrix.

[0047] Furthermore, step 2, the step of constructing a region-level graph structure based on the weighted composite feature matrix corresponding to N frames of face images, includes:

[0048] The weighted composite matrix features of the i-th frame face image The K weighted regional frequency domain features contained therein are integrated into a regional node feature matrix. .

[0049] In the constructed region-level graph structure, each region node v i,t For the t-th facial key region in the i-th frame of the face image, its node features are composed of the corresponding weighted region features. express.

[0050] To characterize the relationships between different key facial regions in the same frame of a face image, the feature similarity between region nodes is calculated:

[0051]

[0052] in, v represents the t-th region node in the i-th frame of the face image. i,t With the s-th region node v i,s The similarity between them; For region node v i,t Node characteristics; For region node v i,s The node characteristics.

[0053] Based on this, construct a region-level adjacency matrix:

[0054]

[0055] in, Represents the region node v in the i-th frame of the face image. i,t With v i,s The connection weights between them.

[0056] To avoid an overly dense regional graph structure while retaining the most representative regional relationships, a Top-k filtering strategy is used for each regional node, establishing connections only between the k neighboring nodes with the highest similarity:

[0057]

[0058] in, This represents a filtering operation that selects the k largest values ​​in row t. Using the above filtering strategy, only the k most similar neighboring nodes of each region node are retained, and the connection weights of the remaining nodes are reset to 0. This integration yields the region-level adjacency matrix A. i Its dimensions are K×K.

[0059] Finally, the region-level graph structure of the i-th frame face image is constructed:

[0060]

[0061] in, V represents the region-level graph structure of the i-th frame of the face image; i E is the feature matrix of the region nodes. i A is the set of edges between nodes in the region. i It is a region-level adjacency matrix.

[0062] Furthermore, step 2, the step of constructing a frame-level graph structure based on the weighted composite feature matrix corresponding to N frames of face images, includes:

[0063] The weighted composite feature matrix corresponding to N frames of face images Integrate into a node feature matrix .

[0064] In the constructed dynamic weighted graph structure, each node v i For a given frame of a face image in the video of the face to be tested, its node features are represented by the corresponding weighted composite feature matrix. express.

[0065] To characterize the dynamic relationships between video frames of the face under test, a temporal dynamic weighting mechanism is introduced to describe the degree of change between features in different video frames:

[0066]

[0067] in, Represents node v i to node v j Dynamic weights; The parameters are adjusted to control the degree of influence of dynamic weights on the frame-level graph structure. For node v i Node characteristics; For node v j Node characteristics; The L2 norm of a vector is used to represent the vector's L2 norm.

[0068] In the above scheme, by introducing a temporal dynamic weighting mechanism, the connection strength between nodes can be adaptively adjusted according to the feature differences between different video frames, so that the frame-level graph structure can better characterize the connection strength between video frames, thereby enabling the frame-level graph structure to better characterize the changing relationships between video frames.

[0069] Based on this, the feature similarity between nodes is calculated:

[0070]

[0071] Among them, Q i,j Represents node v i to node v j Feature similarity.

[0072] Construct a frame-level adjacency matrix using dynamic weights:

[0073]

[0074] Among them, A i,j Represents node v i to node v j Connection weights.

[0075] To avoid an overly dense frame-level graph structure while retaining the most representative neighbor nodes, a Top-k filtering strategy is used to filter node connectivity relationships:

[0076]

[0077] Among them, Topk(A) i,; The expression () represents a filtering operation that selects the first k largest values ​​in the i-th row. Using the above filtering strategy, only the k most similar neighboring nodes of each node are retained, and the connection weights of the rest are reset to 0. The frames are then integrated to obtain a frame-level adjacency matrix A, which has a dimension of N×N.

[0078] In the above scheme, the Top-k selection strategy can construct sparse and representative node connections, thereby reducing the interference of redundant connections on model training. For nodes with established connections, their connection weights A are set. i,j They are integrated into a frame-level adjacency matrix A.

[0079] Finally, the frame-level graph structure is constructed:

[0080]

[0081] in, V represents the frame-level graph structure; V is the node feature matrix; E is the set of edges between nodes; and A represents the frame-level adjacency matrix.

[0082] In the above scheme, by constructing a frame-level graph structure, the feature similarity relationship and dynamic change relationship between video frames can be modeled simultaneously. This allows the frame-level graph structure to not only reflect the feature similarity between different frames, but also to characterize the temporal change pattern in the video sequence, thereby providing more effective structural information for feature propagation in subsequent hierarchical graph convolutional networks.

[0083] Furthermore, the constructed region-level graph structure and frame-level graph structure are input into a hierarchical graph convolutional network for feature propagation and relation modeling. The hierarchical graph convolutional network performs convolution operations on the graph structure at both the region-level and frame-level levels, thereby enabling simultaneous modeling of the spatial relationships between key facial regions and the temporal relationships between video frames.

[0084] First, the region-level graph structure is input into the hierarchical graph convolutional network, and graph convolution operations are performed on the region-level graph structure to characterize the structural relationships between different key facial regions in the same frame of a face image. The region-level graph structure is then used in the... The calculation process for the layer is as follows:

[0085]

[0086] in, Indicates the first Layer-level node feature matrix; Indicates the first Layer-level node feature matrix; The normalized region-level adjacency matrix representing the region-level graph structure is given by the region-level adjacency matrix A. i get; For the first Trainable weight matrices for layered region-level graph structures; It is a non-linear activation function.

[0087] To enhance the features of regional nodes and improve the stability of graph convolution operations, a regional adjacency matrix A is used. i By introducing a self-join term, we get:

[0088]

[0089] in, I is the region-level adjacency matrix with self-loops; I is the identity matrix.

[0090] Then calculate the degree matrix D. i Its diagonal element is Finally, symmetric normalization is performed to obtain the normalized region-level adjacency matrix of the region-level graph structure:

[0091]

[0092] In the above scheme, graph convolution operation is performed on the regional graph structure. The model can model the feature relationship between different key facial regions (such as eyes, nose, mouth, etc.), thereby capturing possible forgery inconsistencies between local regions.

[0093] Subsequently, the frame-level graph structure is input into the hierarchical graph convolutional network, and graph convolution operations are performed on the frame-level graph structure to characterize the dynamic relationships between different video frames. The calculation process for the layer is as follows:

[0094]

[0095] in, Indicates the first Layer-frame level node feature matrix; Indicates the first Layer-frame level node feature matrix; The normalized frame-level adjacency matrix representing the frame-level graph structure is obtained from the frame-level adjacency matrix A; For the first Trainable weight matrix for layer-frame level graph structure.

[0096] To enhance node features and improve the stability of graph convolution operations, a self-connection term is introduced into the frame-level adjacency matrix A, resulting in:

[0097]

[0098] in, is the frame-level adjacency matrix with self-loops; I is the identity matrix.

[0099] Then calculate the degree matrix D, whose diagonal elements are: Finally, symmetric normalization is performed to obtain the normalized frame-level adjacency matrix of the frame-level graph structure:

[0100]

[0101] In the above scheme, graph convolution operations are performed on the frame-level graph structure, and the model can propagate and aggregate feature information between different frames in the video sequence, thereby learning the temporal variation rules between video frames.

[0102] After performing region-level and frame-level graph convolution operations, the features from the two levels are fused to obtain a more comprehensive node representation:

[0103]

[0104] Among them, H r,i H represents the region-level features corresponding to the i-th frame of the face image, output by the L-th layer of the region-level graph structure;f,i The frame-level features corresponding to the face image in the i-th frame are output by the L-th layer of the frame-level graph structure. Both the region-level graph structure and the frame-level graph structure have L layers; H i Let represent the final layer feature vector of the i-th frame face image.

[0105] In the above scheme, by using hierarchical feature fusion, the model can simultaneously utilize local regional structural information and video frame temporal information, thereby obtaining richer and more stable feature representations.

[0106] To further enhance the stability of the video forgery detection model in continuous frame scenarios and reduce the model's dependence on single-frame features, a temporal consistency loss function is introduced during model training to constrain the variation in feature representations between adjacent video frames. The model's total loss function L is defined as:

[0107]

[0108] Among them, L cls L represents the classification loss function; temp Represents the time-series consistency loss function; This is the loss weighting coefficient.

[0109] The timing consistency loss function is defined as follows:

[0110]

[0111] Among them, H i H represents the final layer feature vector of the i-th frame face image; i-1 W represents the final layer feature vector of the face image in frame i-1; i is the adaptive weight corresponding to the i-th frame of the face image; N represents the number of face image frames captured in the video.

[0112] In the above scheme, in order to further improve the model's ability to model the dynamic features of video sequences, this scheme introduces an adaptive weight mechanism W based on the temporal consistency constraint. i This allows the constraint strength between adjacent frames to be dynamically adjusted based on feature changes. Specifically, by measuring the feature differences between adjacent frames and calculating the corresponding weight coefficients, the model can adaptively adjust the temporal constraint strength according to the degree of change between frames during training. When the feature differences between adjacent frames are small, it indicates that the video content changes relatively smoothly, and the temporal consistency constraint is strengthened to maintain feature continuity; while when there are large feature changes between adjacent frames, the constraint strength is appropriately weakened to avoid over-restricting real dynamic changes (such as changes in facial pose, expression, or lighting).

[0113] By employing this adaptive temporal consistency loss function, necessary dynamic change information can be preserved while maintaining the overall smoothness of video sequence features. This allows the model to learn the temporal structural features in the video more accurately. In real videos, facial structure and texture changes are usually continuous, resulting in relatively small feature differences between adjacent frames. However, in deepfake videos, the feature changes between adjacent frames are often more pronounced due to potential forgery traces such as texture inconsistencies, boundary jitter, or local detail anomalies in the generation model. By introducing an adaptive temporal consistency constraint mechanism, this scheme can more effectively capture abnormal change models between video frames, thereby further improving the model's ability to identify forged videos and enhancing its generalization performance under different forgery methods and cross-data source conditions.

[0114] In the above scheme, a dynamically weighted graph structure with dynamic weights is constructed and input into a hierarchical graph convolutional network for feature propagation. This allows node features to pass information along the connection edges in the graph structure and aggregate information from neighboring nodes. Within this graph structure, a temporal dynamic weight factor is introduced to adaptively adjust the connection strength between nodes, enabling effective modeling of structural relationships between different video frames. This not only characterizes the correlation features between different key facial regions within the same frame but also captures facial change patterns across frames. During propagation in the hierarchical graph convolutional network, nodes with similar forgery feature patterns form stronger connections in the dynamically weighted graph structure. The model aggregates and propagates this node information through multi-layer graph convolution operations, thereby learning and reinforcing common feature patterns generated by the same or similar forgery techniques, rather than relying solely on the surface features of training samples for discrimination. Simultaneously, by introducing a hierarchical graph convolutional network to jointly model region-level and frame-level features, the model can learn the dynamic change patterns between video frames during training, further improving its generalization ability and robustness when facing novel forgery methods or datasets from different sources.

[0115] Furthermore, in step 2, the step of classifying the authenticity of face images based on the final layer feature vector, and determining whether each frame of face image is a fake face image or a real face image, includes:

[0116] Frame-level true / false classification based on the final layer feature vector:

[0117]

[0118] Among them, H i W represents the final layer feature vector of the i-th frame of the face image; c b is the trainable weight matrix of the classifier; cis the bias term; Softmax(.) is the classifier, c=1 indicates that the category is fake face image; Let be the predicted category of the face image in the i-th frame. This indicates that the face image in frame i is predicted to be a fake face image. This indicates that the face image in frame i is predicted to be a real face image.

[0119] Furthermore, the method also includes: step 3, calculating the tampering confidence of each key facial region in the forged face image, and calculating the heatmap of the forged face image; generating an overlay image based on the heatmap, and extracting the boundary points of the tampered region through a dual-condition filtering mechanism, thereby locating the tampered region on the overlay image.

[0120] Furthermore, step 3, which involves calculating the tampering confidence of each key facial region in the forged face image and calculating the heatmap of the forged face image, includes:

[0121] If step 2 outputs a fake face image I i Regarding the forged face image I in this frame i For each key facial region, the region-level node feature vector P is generated based on the output of the graph convolutional network. i,t Calculate the t-th facial key region tamper confidence S i,t :

[0122]

[0123] in, W is the Sigmoid function; s P is a trainable weight matrix; i,t Let H be the feature vector of the region-level node corresponding to the t-th facial key region in the i-th frame face image, and let H be the final layer feature vector output by the last layer of the region-level graph structure. i The features of the t-th region node are obtained; b s This is the bias matrix.

[0124] Using fake facial images I i Confidence S of tampering in each key facial region i,t Mask matrix M of key facial regions i,t Element-wise multiplication yields a contribution map of tampering signs for each key facial region. The contribution maps of tampering signs in K key facial regions are superimposed and fused at the pixel level to form a forged face image I. i Heatmap:

[0125]

[0126] Heat is a forged face image I i Heatmap.

[0127] Furthermore, step 3, the step of generating an overlay image based on the heatmap, includes:

[0128] Through color mapping function Convert the heatmap (Heat) to a color image (H). color By using color image H color and fake facial images I i Weighted blending generates overlay images:

[0129]

[0130] Among them, I overlay This represents an overlay image; α is the transparency parameter.

[0131] Furthermore, step 3, which involves extracting the boundary points of the tampered region using a dual-condition filtering mechanism to locate the tampered region on the overlaid image, includes:

[0132] For each pixel (x, y) in the key facial region, if simultaneously satisfying and If the condition is met, then pixel (x,y) is extracted as the boundary point of the key facial region, where Heat(x,y) is the pixel (x,y) in the heatmap. The confidence threshold. For the heatmap gradient vector, Let L be the L2 norm of the vector. This is the gradient magnitude sensitivity parameter;

[0133] The extracted boundary point set B is plotted on the overlaid image I. overlay This results in the final location of the tampered area.

[0134] In the above scheme, an end-to-end localization channel was designed, from node feature matrix → region confidence → pixel heatmap → visualized boundary. This method can not only determine the authenticity of face images, but also accurately backtrack and map the abstract confidence output by the graph convolutional network to specific key facial regions (such as the left eye and corner of the mouth) of the face image, generating a highlighted heatmap. Furthermore, by optimizing the boundary extraction of the confidence threshold of the image region boundary in the transparency mixing and overlay, a clear, continuous, and anatomically reasonable bounding box of the tampered region is finally output. This greatly enhances the interpretability and practicality of the detection results, solves the problem of lack of spatial localization ability for facial regions of forged face images and poor interpretability, and achieves pixel-level interpretable localization output.

[0135] Compared with existing technologies, the beneficial effects of this invention are as follows: By fusing global features of the entire face image with local tamper-sensitivity features based on mask emphasis, the global features maintain the overall rationality and contextual information of the facial structure, while the local features amplify the subtle artifacts generated when key facial areas are tampered with, overcoming the limitations of single feature analysis and achieving more accurate and robust forgery detection. Furthermore, an end-to-end interpretable localization channel is constructed, from abstract features of a graph convolutional network to pixel-level heatmaps. The final layer node features output by the graph convolutional network are converted into region-level confidence scores through learnable parameters. Then, using pre-extracted anatomically accurate facial region masks, the confidence scores are reliably propagated to the corresponding pixel coordinates, generating an intuitive heatmap, thus achieving accurate localization and visualization of the tampered area. Attached Figure Description

[0136] To more clearly illustrate the technical solutions of the embodiments of the present invention, the accompanying drawings used in the embodiments will be briefly introduced below. It should be understood that the following drawings only show some embodiments of the present invention and should not be regarded as a limitation on the scope. For those skilled in the art, other related drawings can be obtained based on these drawings without creative effort.

[0137] Figure 1 This is a schematic diagram of the process of the present invention;

[0138] Figure 2 This is a flowchart of the method of the present invention;

[0139] Figure 3 This is a schematic diagram illustrating the impact of the k value on network accuracy in an embodiment of the present invention;

[0140] Figure 4(a) is a heatmap of cross-union ratio under different combinations of transparency parameter α and confidence threshold τ in the embodiment of the present invention;

[0141] Figure 4(b) is a heatmap of visual evaluation scores under different combinations of transparency parameter α and confidence threshold τ in the embodiments of the present invention;

[0142] Figure 4(c) is a three-dimensional IoU analysis diagram under different combinations of transparency parameter α and confidence threshold τ in the embodiments of the present invention;

[0143] Figure 4(d) is a comparison of multidimensional parameters under different combinations of transparency parameter α and confidence threshold τ in the embodiments of the present invention. Detailed Implementation

[0144] The technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings. Obviously, the described embodiments are only some embodiments of the present invention, and not all embodiments. The components of the embodiments of the present invention described and shown in the accompanying drawings can generally be arranged and designed in various different configurations. Therefore, the following detailed description of the embodiments of the present invention provided in the accompanying drawings is not intended to limit the scope of the claimed invention, but merely to illustrate selected embodiments of the invention. All other embodiments obtained by those skilled in the art based on the embodiments of the present invention without inventive effort are within the scope of protection of the present invention.

[0145] It should be noted that similar reference numerals and letters in the following figures denote similar items; therefore, once an item is defined in one figure, it does not need to be further defined and explained in subsequent figures. Furthermore, in the description of this invention, the terms "first," "second," etc., are used only for distinguishing descriptions and should not be construed as indicating or implying relative importance, or suggesting any such actual relationship or order between these entities or operations. Additionally, the terms "connected," "linked," etc., can refer to a direct connection between elements, components, modules, etc., or an indirect connection via other elements, components, modules, etc.

[0146] Example 1:

[0147] This invention is achieved through the following technical solutions, such as... Figure 1 As shown, a video depth forgery detection method based on region awareness and graph convolutional networks includes the following steps:

[0148] Step 1: Acquire N frames of face images from the face video to be tested, and extract the global feature vector of each frame of face image; perform face region segmentation on each frame of face image to obtain K key facial regions, and extract the local feature vector of each key facial region; extract the weighted region frequency domain features based on the local feature vectors; fuse the global feature vector and the weighted region frequency domain features of the same frame of face image to construct the weighted composite feature matrix of the frame of face image.

[0149] like Figure 2 As shown, firstly, the global feature vector of each frame of the face image is extracted:

[0150]

[0151] Among them, I i Let i be the face image of the i-th frame, i=1,2,...,N; f is the global feature vector of the i-th frame of the face image; g (.) represents the global feature extraction network.

[0152] Next, the Mask R-CNN model is used to process the face image I in the i-th frame. i Pixel-level semantic segmentation is performed to obtain a set of binary masks for K key facial regions:

[0153]

[0154] Where H and W are the face image I i Height and width; M i,t Represents a face image I i The mask matrix for the t-th key facial region, where t=1,2,...,K;

[0155] The mask matrix M i,t Acting on face image I i The local feature vectors are obtained as follows:

[0156]

[0157] in; This represents the local feature vector of the t-th facial key region in the i-th frame of the face image; This indicates element-wise multiplication.

[0158] Then, spatial domain features are extracted based on local feature vectors:

[0159]

[0160] in, f represents the spatial domain features of the t-th facial key region; s (.) represents the spatial feature extraction network.

[0161] Simultaneously, frequency domain features are extracted from each key facial region based on local feature vectors to obtain the corresponding frequency domain feature representations:

[0162]

[0163] in, Local feature vectors Frequency domain representation; DCT is Discrete Cosine Transform;

[0164] Frequency domain representation Input to the frequency domain feature extraction network to extract frequency domain features:

[0165]

[0166] in, f represents the frequency domain features of the t-th key facial region; f (.) represents the frequency domain feature extraction network.

[0167] Fusion of spatial domain features and frequency domain features:

[0168]

[0169] Among them, F i,t represents the fusion feature of the t-th key facial region; Concat is the stitching operation.

[0170] Subsequently, a regional attention mechanism was introduced to weight the fusion features of each key facial region:

[0171]

[0172] Among them, a i,t The expression represents the attention score for the t-th facial key region; w represents the attention weight. Indicates matrix transpose;

[0173] Weighted region frequency domain features are obtained based on attention scores:

[0174]

[0175] in, This represents the weighted region frequency domain feature of the t-th key facial region.

[0176] Finally, construct the weighted composite feature matrix of the face image in this frame:

[0177]

[0178] in, This represents the characteristics of a weighted composite matrix.

[0179] Step 2: Construct a region-level graph structure and a frame-level graph structure based on the weighted composite feature matrix corresponding to N frames of face images to represent the relationships between different key facial regions and different frames of face images; introduce a temporal dynamic weight mechanism into the constructed frame-level graph structure to characterize the temporal change features between video frames; input the region-level graph structure and the frame-level graph structure into a hierarchical graph convolutional network for feature propagation and relationship modeling to obtain the final layer feature vector corresponding to each frame of face image; classify the face images based on the final layer feature vector to determine whether each frame of face image is a fake face image or a real face image.

[0180] The order in which the region-level graph structure and the frame-level graph structure are constructed in step 2 is not restricted, and they can be constructed simultaneously.

[0181] When constructing the region-level graph structure, the weighted composite matrix features of the i-th frame face image are used. The K weighted regional frequency domain features contained therein are integrated into a regional node feature matrix. .

[0182] In the constructed region-level graph structure, each region node v i,t For the t-th facial key region in the i-th frame of the face image, its node features are composed of the corresponding weighted region features. express.

[0183] Calculate the feature similarity between nodes in the region:

[0184]

[0185] in, v represents the t-th region node in the i-th frame of the face image. i,t With the s-th region node v i,s The similarity between them; For region node v i,t Node characteristics; For region node v i,s Node characteristics; This indicates the matrix transpose.

[0186] Construct a region-level adjacency matrix:

[0187]

[0188] in, Represents the region node v in the i-th frame of the face image. i,t With v i,s The connection weights between them.

[0189] For each region node, a Top-k filtering strategy is used to retain only the k neighbor nodes with the highest similarity to establish connection edges:

[0190]

[0191] in, This represents a filtering operation that selects the first k maximum values ​​in the t-th row; Integrate into a regional adjacency matrix A i .

[0192] Finally, the region-level graph structure of the i-th frame face image is constructed:

[0193]

[0194] in, V represents the region-level graph structure of the i-th frame of the face image; i E is the feature matrix of the region nodes. i A is the set of edges between nodes in the region. i It is a region-level adjacency matrix.

[0195] When constructing the frame-level graph structure, the weighted composite feature matrix corresponding to N frames of face images is used. Integrate into a node feature matrix .

[0196] In the constructed dynamic weighted graph structure, each node v i For a given frame of a face image in the video of the face to be tested, its node features are represented by the corresponding weighted composite feature matrix. express.

[0197] Introducing a temporal dynamic weighting mechanism to describe the degree of change between features in different video frames:

[0198]

[0199] in, Represents node v i to node v j Dynamic weights; The parameters are adjusted to control the degree of influence of dynamic weights on the frame-level graph structure. For node v i Node characteristics; For node v j Node characteristics; The L2 norm of a vector is used to represent the vector's L2 norm.

[0200] Calculate the feature similarity between nodes:

[0201]

[0202] Among them, Q i,j Represents node v i to node v j Feature similarity.

[0203] Construct a frame-level adjacency matrix using dynamic weights:

[0204]

[0205] Among them, A i,j Represents node v i to node v j Connection weights.

[0206] The Top-k filtering strategy is used to filter node connection relationships:

[0207]

[0208] Among them, Topk(A i,; This indicates a filtering operation that selects the first k maximum values ​​in the i-th row; They are integrated into a frame-level adjacency matrix A.

[0209] In a feasible manner, this embodiment further verifies the optimal selection of the value of k; please refer to [link to relevant documentation]. Figure 3 This was verified through experiments on the FaceForensics++ dataset (FF++). Figure 3 Figure (a) shows the detection accuracy at different k values. It can be seen that the detection accuracy reaches 99.35% when k=10. Figure 3 (b) shows the computation time for different k values. It can be seen that when k=10, the computation time is about 58.7ms, which is relatively short. Figure 3 (c) shows the memory usage for different k values. It can be seen that when k=10, the memory usage is approximately 486MB, which is relatively small. Therefore, in this embodiment, k=10 is preferred, as it yields the best model performance.

[0210] Finally, the frame-level graph structure is constructed:

[0211]

[0212] Where V is the node feature matrix; E is the set of edges between nodes; and A represents the frame-level adjacency matrix.

[0213] The region-level graph structure is input into the hierarchical graph convolutional network, and the region-level graph structure is in the first... The calculation process for the layer is as follows:

[0214]

[0215] in, Indicates the first Layer-level node feature matrix; Indicates the first Layer-level node feature matrix; The normalized region-level adjacency matrix representing the region-level graph structure is given by the region-level adjacency matrix A. i get; For the first Trainable weight matrices for layered region-level graph structures; It is a non-linear activation function.

[0216] In the regional adjacency matrix A i By introducing a self-join term, we get:

[0217]

[0218] in, I is the region-level adjacency matrix with self-loops; I is the identity matrix.

[0219] Calculate the degree matrix D i Its diagonal element is Finally, symmetric normalization is performed to obtain the normalized region-level adjacency matrix of the region-level graph structure:

[0220]

[0221] The frame-level graph structure is input into the hierarchical graph convolutional network, and the frame-level graph structure is in the... The calculation process for the layer is as follows:

[0222]

[0223] in, Indicates the first Layer-frame level node feature matrix; Indicates the first Layer-frame level node feature matrix; The normalized frame-level adjacency matrix representing the frame-level graph structure is obtained from the frame-level adjacency matrix A; For the first Trainable weight matrix for layer-frame level graph structure.

[0224] Introducing a self-connect term into the frame-level adjacency matrix A yields:

[0225]

[0226] in, is the frame-level adjacency matrix with self-loops; I is the identity matrix.

[0227] Calculate the degree matrix D, whose diagonal elements are Finally, symmetric normalization is performed to obtain the normalized frame-level adjacency matrix of the frame-level graph structure:

[0228]

[0229] By fusing features from the two levels, a more comprehensive node representation can be obtained:

[0230]

[0231] Among them, H r,i H represents the region-level features corresponding to the i-th frame of the face image, output by the L-th layer of the region-level graph structure; f,i The frame-level features corresponding to the face image in the i-th frame are output by the L-th layer of the frame-level graph structure. Both the region-level graph structure and the frame-level graph structure have L layers; H i Let represent the final layer feature vector of the i-th frame face image.

[0232] Frame-level true / false classification based on the final layer feature vector:

[0233]

[0234] Among them, H i W represents the final layer feature vector of the i-th frame of the face image; c b is the trainable weight matrix of the classifier; c is the bias term; Softmax(.) is the classifier, c=1 indicates that the category is fake face image; Let be the predicted category of the face image in the i-th frame. This indicates that the face image in frame i is predicted to be a fake face image. This indicates that the face image in frame i is predicted to be a real face image.

[0235] Step 3: Calculate the tampering confidence of each key facial region in the forged face image and calculate the heat map of the forged face image; generate an overlay image based on the heat map, and extract the boundary points of the tampered region through a dual-condition filtering mechanism, thereby locating the tampered region on the overlay image.

[0236] If step 2 outputs a fake face image I i Regarding the forged face image I in this frame i For each key facial region, the region-level node feature vector P is generated based on the output of the graph convolutional network. i,t Calculate the t-th facial key region tamper confidence S i,t :

[0237]

[0238] in, W is the Sigmoid function; s P is a trainable weight matrix; i,t Let H be the feature vector of the region-level node corresponding to the t-th facial key region in the i-th frame face image, and let H be the final layer feature vector output by the last layer of the region-level graph structure. i The features of the t-th region node are obtained; b s This is the bias matrix.

[0239] Using fake facial images I i Confidence S of tampering in each key facial region i,t Mask matrix M of key facial regions i,t Element-wise multiplication yields a contribution map of tampering signs for each key facial region. The contribution maps of tampering signs in K key facial regions are superimposed and fused at the pixel level to form a forged face image I. i Heatmap:

[0240]

[0241] Heat is a forged face image I i Heatmap.

[0242] Through color mapping function Convert the heatmap (Heat) to a color image (H). color By using color image H color and fake facial images I i Weighted blending generates overlay images:

[0243]

[0244] Among them, I overlay This represents an overlay image; α is the transparency parameter.

[0245] For each pixel (x, y) in the key facial region, if simultaneously satisfying and If the condition is met, then pixel (x,y) is extracted as the boundary point of the key facial region, where Heat(x,y) is the pixel (x,y) in the heatmap, and τ is the confidence threshold. For the heatmap gradient vector, Let L be the L2 norm of the vector. This is the gradient magnitude sensitivity parameter.

[0246] The extracted boundary point set B is plotted on the overlaid image I. overlay This results in the final location of the tampered area.

[0247] Figure 4(a) shows the Intersection over Union (IoU) heatmap under different combinations of transparency parameter α and confidence threshold τ. Darker colors indicate a higher IoU, meaning higher positioning accuracy. It can be seen that when α=0.6 and τ=0.7, the IoU reaches 78.9%. Figure 4(b) shows the Visual Assessment Score (VAS) heatmap under different combinations of transparency parameter α and confidence threshold τ. It can be seen that when α=0.6 and τ=0.7, the VAS score reaches 4.7 / 5.0. Figure 4(c) shows the Three-dimensional IoU analysis under different combinations of transparency parameter α and confidence threshold τ. It can be seen that when α=0.6 and τ=0.7, the IoU is the highest. Figure 4(d) shows the Multi-dimensional Parameter Comparison under different combinations of transparency parameter α and confidence threshold τ. It can be seen that when α=0.6, At τ=0.7, both the Intersection over Union (IoU) and the Visual Quality Score reach their highest values, marked with red pentagrams respectively.

[0248] In summary, the parameter combination of α=0.6 and τ=0.7 strikes the best balance between accurately depicting the boundaries of forged regions and producing a sharp overlay map. Specifically, lower α values ​​(e.g., α=0.5) make the boundaries of forged regions less prominent, while higher α values ​​(e.g., α=0.7) excessively blur the face image. Similarly, lower τ values ​​(e.g., τ=0.6) introduce noisy boundaries, while higher τ values ​​(e.g., τ=0.8) result in discontinuous contours.

[0249] The transparency parameter α and confidence threshold τ further highlight the boundary and spatial distribution of the tampered area through transparency blending and edge enhancement operations, thereby improving the visualization interpretability. This dual-condition screening mechanism effectively overcomes the edge fragmentation and noise sensitivity problems in traditional threshold segmentation, and ensures the continuity and anatomy rationality of the boundary through gradient constraints.

[0250] Example 2:

[0251] To verify the accuracy and universality of this technical solution, this embodiment uses various types of datasets to verify the video forgery detection (RAGE-GCN) model trained by this method.

[0252] This embodiment uses the publicly available FaceForensics++ dataset (FF++) to test the effectiveness of the RAGE-GCN model in DeepFake detection. The FaceForensics++ dataset (FF++) is a large-scale benchmark containing over 1000 original video sequences, providing three compression levels: original, high-quality (HQ, C23), and low-quality (LQ, C40). Four automatic face swapping methods were used to forge the faces in the dataset: Deepfakes (DF), Face2Face (F2F), FaceSwp (FS), and NeuralTextures (NT). To test the robustness of the RAGE-GCN model, it was generalized to Celeb-DeepFake (Celeb-DF) and the Deepfake Detection Challenge (DFDC) to verify its effectiveness. Celeb-DF aims to achieve higher visual realism compared to earlier datasets. It contains 590 real videos and 5639 fake videos generated by an improved face-swapping algorithm that addresses common visual artifacts such as color mismatch and time flicker. DFDC is a large-scale dataset with extensive manipulation methods and a highly diverse set of objects and environmental conditions. Its training set includes over 100,000 videos, while the test set contains 10,000 videos.

[0253] Accuracy (ACC) is used as the basic metric to measure the overall classification ability of the RAGE-GCN model between real and fake videos. The formula for calculating accuracy is:

[0254]

[0255] In this system, TP indicates that a fake sample was correctly identified as fake, TN indicates that a real sample was correctly identified as real, FP indicates that a real sample was misidentified as fake, and FN indicates that a fake sample was misidentified as real. ACC reflects the overall predictive accuracy of the model, but its reliability may be affected in imbalanced scenarios, therefore it needs to be evaluated in conjunction with other metrics.

[0256] To further evaluate the model's stability and generalization ability under different discrimination thresholds, this embodiment introduces the area under the ROC curve (AUC) as a key evaluation metric. The AUC value ranges from 0 to 1, with 0.5 representing random guessing. A value closer to 1 indicates that the model can stably distinguish between real and fake videos under different thresholds, demonstrating stronger anti-interference capabilities. The ROC curve is plotted with the false positive rate (FPR) on the horizontal axis and the true positive rate (TPR) on the vertical axis, based on different thresholds.

[0257]

[0258] AUC measures the overall discriminative power of a model by calculating the area enclosed by the ROC curve and the coordinate axes. Its mathematical formula can be expressed as:

[0259]

[0260] In the discrete case, AUC is usually approximated using the trapezoidal rule:

[0261]

[0262] Where i represents the i-th point on the ROC curve in the discrete case.

[0263] The higher the AUC value, the more capable the model is of consistently distinguishing between real and fake videos under various thresholds, and its evaluation results are not affected by a single threshold. Therefore, it is more suitable for comparing the performance of models across datasets.

[0264] Experiments were conducted on the low-quality (LQ) FF++ dataset. Four automatic face-swapping methods were used to forge low-quality (LQ) videos from the FF++ dataset. The proposed solution (RAGE-GCN) was then compared with 10 other traditional solutions (MesoNet, Mesolnception, EfficientNet-B, Xception, F...). 3 The forged videos were detected using Net, RFM, GFF, SPSL, PRRNET, and UIA-viT*, and the evaluation results are shown in Table 1. Table 1 demonstrates that the RAGE-GCN model of this invention exhibits excellent robustness against heavily compressed forged face images. Traditional methods show significant performance degradation under low-quality conditions, but the method of this invention maintains consistently high accuracy and AUC across all forgery operation types (DF, F2F, FS, NT). It achieves 99.35% ACC and 99.53% AUC on DF, 99.45% ACC and 96.27% AUC on F2F, 96.37% ACC and 96.48% AUC on FS, and 94.76% ACC and 92.57% AUC on NT.

[0265] Table 1. Evaluation results of various schemes after spoofing on a low-quality FF++ dataset.

[0266]

[0267] Experiments were conducted on the high-quality (HQ) and low-quality (LQ) FF++ datasets, using the scheme of this invention and 11 other traditional schemes (MesoNet, Steg.Feature, EfficientNet-B4, Xception, Xception-ELA, MADD, Face-X-ray, DSP-FWA, F...). 3 The Net, Two-stream, and SPSL methods were used to detect high-quality (C23) and low-quality (C40) videos, and the evaluation results are shown in Table 2. As shown in Table 2, the method of this invention achieved an ACC% of 98.43% and an AUC of 99.82% in the HQ (C23) environment, and a ACC% of 94.12% and an AUC of 96.45% in the LQ (C40) environment, demonstrating significant recovery capability against quality degradation. In the LQ (C40) environment, the 94.12% ACC and 96.45% AUC are key advantages, demonstrating significant resilience to quality degradation. The key advantage of the RAGE-GCN of this invention lies in its ability to maintain detection efficiency amidst quality variations, while many existing methods exhibit significant limitations: methods like Face-X-ray and DSP-FWA show significant performance degradation in LQ settings (AUC decreases from 86.43% to 61.60% and from 57.50% to 62.30%, respectively), indicating their reliance on quality-dependent features that are susceptible to compression artifacts; similarly, methods such as Steg.Feature and Xception-ELA exhibit inconsistent performance patterns across quality levels, suggesting that their feature representations are not robust to information loss introduced by compression.

[0268] Table 2 shows the evaluation results on the high-quality (C23) and low-quality (C40) FF++ datasets, respectively.

[0269]

[0270] This embodiment further verifies the generalization ability of the proposed RAGE-GCN model by evaluating the existing Celeb-DF-V1 dataset. It uses the proposed scheme and 14 other traditional schemes (Two-stream, Meso4, Mesoception4, FWA, Xception-raw, Xception-c23, Xception-c40, Multi-task, Capsule, DSP-FWA, F...) 3The AUC of the RAGE-GCN model was evaluated using Net, EfficientNet-B4, SPSL, and RECCE* on the FF++ and Celeb-DF-V1 datasets, respectively, and the results are shown in Table 3. As shown in Table 3, the RAGE-GCN model of this invention achieved near-perfect performance (99.88% AUC) on the FF++ dataset. More notably, it demonstrated better generalization ability than the challenging Celeb-DF-V1 dataset, achieving an AUC of 73.39%, significantly outperforming other traditional approaches. As shown in Table 3, on the Celeb-DF-V1 dataset, the RAGE-GCN model showed a significant performance gap compared to other methods, exceeding the second-best approach by approximately 5 percentage points. This highlights the importance of combining fine-grained region analysis with structural relationship modeling for constructing a deepfake detector that can effectively generalize to real-world scenarios with different manipulation techniques.

[0271] Table 3. AUC evaluation results (%) for the FF++ and Celeb-DF-V1 datasets.

[0272]

[0273] To further evaluate the generalization ability of the proposed RAGE-GCN model, cross-dataset experiments were conducted on the challenging Deep Forgery Detection Challenge (DFDC) dataset using the proposed method and four other traditional methods (Meso4, Recurrent-network, FWA, and MLDG). The evaluation results are shown in Table 4. As shown in Table 4, the proposed method exhibits superior performance, achieving an accuracy of 72.19% and an AUC of 71.34%, significantly outperforming existing methods.

[0274] Table 4. Cross-dataset evaluation results in the DFDC dataset.

[0275]

[0276] The above results demonstrate that the video forgery detection (RAGE-GCN) model trained in this invention exhibits significant superiority in experimental results under various conditions.

[0277] The above description is merely a specific embodiment of the present invention, but the scope of protection of the present invention is not limited thereto. Any variations or substitutions that can be easily conceived by those skilled in the art within the technical scope disclosed in the present invention should be included within the scope of protection of the present invention. Therefore, the scope of protection of the present invention should be determined by the scope of the claims.

Claims

1. A video depth forgery detection method based on region awareness and graph convolutional networks, characterized in that, Includes the following steps: Step 1: Acquire N frames of face images from the face video to be tested, and extract the global feature vector of each frame of face image; perform face region segmentation on each frame of face image to obtain K key facial regions, and extract the local feature vector of each key facial region; extract the weighted region frequency domain features based on the local feature vectors; fuse the global feature vector and the weighted region frequency domain features of the same frame of face image to construct the weighted composite feature matrix of the frame of face image; Step 2: Construct a region-level graph structure and a frame-level graph structure based on the weighted composite feature matrix corresponding to N frames of face images to represent the relationship between different key facial regions and face images in different frames; introduce a temporal dynamic weight mechanism into the constructed frame-level graph structure to characterize the temporal change features between video frames; The region-level graph structure and frame-level graph structure are input into a hierarchical graph convolutional network for feature propagation and relation modeling to obtain the final layer feature vector corresponding to each frame of face image. The final layer feature vector is used to classify the authenticity of face images, and each frame of face image is determined to be a fake face image or a real face image.

2. The video depth forgery detection method based on region awareness and graph convolutional networks according to claim 1, characterized in that, Step 1, the step of extracting the global feature vector of each frame of face image, includes: Among them, I i Let i be the face image of the i-th frame, i=1,2,...,N; f is the global feature vector of the i-th frame of the face image; g (.) represents the global feature extraction network.

3. The video depth forgery detection method based on region awareness and graph convolutional networks according to claim 1, characterized in that, Step 1, which involves segmenting the face region of each frame of the face image to obtain K key facial regions and extracting the local feature vectors of each key facial region, includes: using a Mask R-CNN model on the i-th frame face image I i performing pixel-level semantic segmentation to obtain a binary mask set of K face key regions: Wherein, H, W are the height, width of the face image I i respectively; M i,t represents the mask matrix of the tth face key region in the face image I i , t = 1, 2, …, K; The mask matrix M i,t Acting on face image I i The local feature vectors are obtained as follows: in; This represents the local feature vector of the t-th facial key region in the i-th frame of the face image; This indicates element-wise multiplication.

4. The video depth forgery detection method based on region awareness and graph convolutional networks according to claim 1, characterized in that, Step 1, the step of extracting weighted region frequency domain features based on local feature vectors, includes: Extracting spatial domain features based on local feature vectors: in, f represents the spatial domain features of the t-th facial key region; s (.) represents a spatial feature extraction network; This represents the local feature vector of the t-th facial key region in the i-th frame of the face image; Frequency domain features are extracted from each key facial region based on local feature vectors to obtain the corresponding frequency domain feature representations: in, Local feature vectors Frequency domain representation; DCT is Discrete Cosine Transform; Frequency domain representation Input to the frequency domain feature extraction network to extract frequency domain features: in, f represents the frequency domain features of the t-th key facial region; f (.) represents the frequency domain feature extraction network; Fusion of spatial domain features and frequency domain features: Among them, F i,t The fusion feature of the t-th key facial region; Concat is the stitching operation; Introducing a region attention mechanism to weight the fusion features of each key facial region: Among them, a i,t The expression represents the attention score for the t-th facial key region; w represents the attention weight. Indicates matrix transpose; Weighted region frequency domain features are obtained based on attention scores: in, This represents the weighted region frequency domain feature of the t-th key facial region.

5. The video depth forgery detection method based on region awareness and graph convolutional networks according to claim 4, characterized in that, Step 1, which involves fusing the global feature vector and weighted regional frequency domain features of the same frame of face images to construct a weighted composite feature matrix for that frame of face images, includes: in, Indicates the characteristics of a weighted composite matrix; Let be the global feature vector of the i-th frame face image.

6. The video depth forgery detection method based on region awareness and graph convolutional networks according to claim 5, characterized in that, Step 2, the step of constructing the region-level graph structure, includes: The weighted composite matrix features of the i-th frame face image The K weighted regional frequency domain features contained therein are integrated into a regional node feature matrix. ; In the constructed region-level graph structure, each region node v i,t For the t-th facial key region in the i-th frame of the face image, its node features are composed of the corresponding weighted region features. express; Calculate the feature similarity between nodes in the region: in, v represents the t-th region node in the i-th frame of the face image. i,t With the s-th region node v i,s The similarity between them; For region node v i,t Node characteristics; For region node v i,s Node characteristics; Construct a region-level adjacency matrix: in, Represents the region node v in the i-th frame of the face image. i,t With v i,s Connection weights between them; For each region node, a Top-k filtering strategy is used to retain only the k neighbor nodes with the highest similarity to establish connection edges: in, This represents a filtering operation that selects the first k maximum values ​​in the t-th row; Integrate into a regional adjacency matrix A i ; Finally, the region-level graph structure of the i-th frame face image is constructed: in, V represents the region-level graph structure of the i-th frame of the face image; i E is the feature matrix of the region nodes. i A is the set of edges between nodes in the region. i It is a region-level adjacency matrix.

7. The video depth forgery detection method based on region awareness and graph convolutional networks according to claim 5, characterized in that, Step 2, the step of constructing the frame-level graph structure, includes: The weighted composite feature matrix corresponding to N frames of face images Integrate into a node feature matrix ; In the constructed dynamic weighted graph structure, each node v i For a given frame of a face image in the video of the face to be tested, its node features are represented by the corresponding weighted composite feature matrix. express; Introducing a temporal dynamic weighting mechanism to describe the degree of change between features in different video frames: in, Represents node v i to node v j Dynamic weights; The parameters are adjusted to control the degree of influence of dynamic weights on the frame-level graph structure. For node v i Node characteristics; For node v j Node characteristics; The second norm of a vector; Calculate the feature similarity between nodes: Among them, Q i,j Represents node v i to node v j Feature similarity; Construct a frame-level adjacency matrix using dynamic weights: Among them, A i,j Represents node v i to node v j Connection weights; The Top-k filtering strategy is used to filter node connection relationships: Among them, Topk(A i,; This indicates a filtering operation that selects the first k maximum values ​​in the i-th row; Integrate into a frame-level adjacency matrix A; Finally, the frame-level graph structure is constructed: in, V represents the frame-level graph structure; V is the node feature matrix; E is the set of edges between nodes; and A represents the frame-level adjacency matrix.

8. The video depth forgery detection method based on region awareness and graph convolutional networks according to claim 1, characterized in that, Step 2, which involves classifying face images based on the final layer feature vector to determine whether each frame of face image is a fake or real face image, includes: Frame-level true / false classification based on the final layer feature vector: Among them, H i W represents the final layer feature vector of the i-th frame of the face image; c b is the trainable weight matrix of the classifier; c The bias term is Softmax(.), which is the classifier. Let be the predicted category of the face image in the i-th frame. This indicates that the face image in frame i is predicted to be a fake face image. This indicates that the face image in frame i is predicted to be a real face image.

9. The video depth forgery detection method based on region awareness and graph convolutional networks according to claim 1, characterized in that, The method also includes step 3, which calculates the tampering confidence of each key facial region in the forged face image and calculates the heat map of the forged face image; generates an overlay image based on the heat map, and extracts the boundary points of the tampered region through a dual-condition filtering mechanism, thereby locating the tampered region on the overlay image.