A deep video forgery detection method fusing ViT and spatial features

By integrating ViT with spatial features, a deep video forgery detection method is constructed, which includes a spatial feature extraction network and a channel attention mechanism. Combined with ViT's multi-head self-attention mechanism, this method solves the problems of insufficient detection accuracy and high computational cost in existing technologies, and achieves efficient detection of multiple forgery methods.

CN120126053BActive Publication Date: 2026-06-26BEIJING UNIV OF TECH

Patent Information

Authority / Receiving Office
CN · China
Patent Type
Patents(China)
Current Assignee / Owner
BEIJING UNIV OF TECH
Filing Date
2025-02-23
Publication Date
2026-06-26

AI Technical Summary

Technical Problem

Existing deep video forgery detection technologies suffer from insufficient detection accuracy and high computational costs when faced with complex scenarios and various forgery methods, making it difficult to effectively distinguish high-quality forged videos.

Method used

A deep video forgery detection method that integrates ViT and spatial features is adopted. By constructing a spatial feature extraction network and a channel attention mechanism, combined with the multi-head self-attention mechanism of ViT, the computational cost is reduced and the detection accuracy is improved. This includes the fusion of convolutional layers and Transformer Encoder.

Benefits of technology

Without increasing computational costs, it significantly improves the accuracy of deep video forgery detection, can cope with a variety of forgery techniques, and enhances the robustness and detection accuracy of the model.

✦ Generated by Eureka AI based on patent content.

Smart Images

  • Figure CN120126053B_ABST
    Figure CN120126053B_ABST
Patent Text Reader

Abstract

The application discloses a deep fake video detection method fusing ViT and spatial features, belongs to the technical field of video recognition, and is used for detecting deep fake videos. The application comprises the following steps: constructing a neural network fusing ViT and spatial features; training a deep fake video detection network; acquiring face data information in a deep fake video; inputting processed video information into the trained deep fake video detection network and outputting whether the video belongs to a fake video. The application fully utilizes the advantages of a convolutional neural network in extracting image fake details in deep video fake detection, combines orthogonal convolution, attention mechanism, residual connection and other ideas, maintains the complexity of the model, improves the accuracy of deep fake video detection, and has relatively stable detection performance in videos generated by various fake technologies.
Need to check novelty before this filing date? Find Prior Art

Description

Technical Field

[0001] This invention relates to the fields of machine vision and forgery detection technology, and in particular to a deep video forgery detection method that integrates ViT (VisionTransformer) and spatial features. Background Technology

[0002] In recent years, artificial intelligence technology has developed rapidly and its applications have become increasingly widespread. In video generation, generative adversarial networks (GANs) can be used to create realistic images and videos. However, deepfake technology is a double-edged sword. While it brings convenience to film and television video generation, industrial content production, restoration of old photos, and video production, it also poses serious challenges to the supervision and management of the rationality and authenticity of video content, as well as the identification of forgeries. With the development of the global internet, the rapid spread and influence of fake videos are becoming increasingly prominent. Especially in highly disseminated environments such as social media and news platforms, deepfake videos can easily have a significant negative impact on public perception and social opinion.

[0003] Among various deep video forgery techniques, deep video face forgery is the most widely used, influential, and advanced, encompassing three branches: face replacement, attribute editing, and face generation. With increasing technological integration and modularization, the barrier to entry for face replacement technology has gradually decreased, acquisition methods have become simpler, and its impact on various fields has broadened. Currently, the research and application of deep video forgery detection has become a crucial issue urgently needing to be addressed by all sectors of society and academia.

[0004] While AI-based face forgery detection technologies have made some progress, the continuous evolution of deep video forgery generation techniques presents numerous challenges for detection. On one hand, the quality of generated face forgery videos is increasingly sophisticated, making it difficult to distinguish forgery details using traditional visual detection methods. On the other hand, videos generated by different forgery methods exhibit complex and diverse spatiotemporal features, limiting the generalization ability and robustness of detection models. Therefore, accurately detecting face forgery videos in complex scenes and environments has become a pressing technical challenge. To address these issues, a deep video forgery detection method integrating ViT and spatial features is proposed, which significantly improves the accuracy of deep forgery detection and can handle various forgery techniques. Summary of the Invention

[0005] The purpose of this invention is to provide a deep video forgery detection modeling method that integrates ViT and spatial features, which improves the accuracy of deep video forgery detection without increasing computational costs, and can also cope with various forgery techniques.

[0006] The technical solution adopted in this invention is a deep video forgery detection method that integrates ViT and spatial features, specifically including the following steps:

[0007] Step 1: Construct a spatial feature extraction network

[0008] The spatial feature extraction network consists of a spatial inconsistency module and a channel attention mechanism module. It mainly comprises a convolutional layer 1, a network module 1, a convolutional layer 2, a global average pooling layer, and a convolutional layer 3. Each global extraction convolutional layer includes a normalization layer and a non-linear activation layer.

[0009] Furthermore, the modular network plays a role in feature extraction and parameter reduction. In the spatial inconsistency module, feature extraction is performed using one 1×3 and one 3×1 convolutional kernel, while detailed feature extraction is performed using two 1×1 and two 3×3 convolutional kernels. These are connected via skip connections using the residual concept. Average pooling and bilinear interpolation operations are added before and after the 1×3 and 3×1 convolutions, respectively, for dimensionality reduction and dimensionality enhancement, reducing computational cost. To address the fragmentation caused by facial edges in the image, convolution operations are performed in both the horizontal and vertical directions, while downsampling is used to enhance relevant information in the receptive field. More detailed feature extraction is achieved using 3×3 convolutional kernels. Finally, residual connections are used to preserve the original features, effectively preventing performance degradation of convolutional layers.

[0010] Furthermore, the feature maps are introduced into the 3×3 convolutional layer to further extract and fuse the three feature information, ensuring the network's fitting performance.

[0011] Furthermore, a lightweight channel attention mechanism is introduced to extract and select features along the channel dimension, and the importance of each channel is additionally modeled. Global pooling is used to transform the two dimensions h and w into one-dimensional scalars, thereby reducing the computational redundancy brought about by channel-dimensional convolution. This allows the network model to enhance the utilization of channel information while reducing the amount of computation, thus improving the fitting ability of the network model.

[0012] Step 2: Construct a network that integrates ViT and spatial feature detection

[0013] The design integrates ViT and a spatial feature extraction network. After extracting video frames, ViT performs a slicing operation, splitting the 224×224 pixel frame into 16 14×14 pixel patches. After flattening, the channel dimension is 3. Using 1D positional encoding information, the 16 images are encoded sequentially.

[0014] Furthermore, the embedded position encoding process maps it to 16 tokens with a channel dimension of 128 through a linear mapping operation, adding the 0th token and incorporating it into the position encoding.

[0015] Furthermore, in the Transformer Encoder, a spatial feature extraction network structure is added to the Multi-HeadAttention mechanism. While the multi-head attention mechanism is processing, spatial feature information is extracted in parallel. The extracted feature weights and channel weights are weighted into the query, key, and value matrix for more detailed feature extraction.

[0016] Step 3: Train and build a network model that integrates ViT and spatial feature detection.

[0017] The specific steps for training the ViT and spatial feature detection network model are as follows: use the Dlib face extractor to extract faces from video frames, save the face sampling point information, and crop the face images of the extracted frames; pre-train the network model on the fake video dataset; divide the dataset into training, validation, and test sets and perform rotation and scaling processing; use the validation set to adjust hyperparameters, and finally use the test set to verify the model performance.

[0018] Furthermore, the steps for dividing the fake video dataset into training, validation, and test sets and then standardizing it are as follows: the dataset is divided into training, validation, and test sets in a 3:1:1 ratio.

[0019] Furthermore, to expand the dataset and prevent overfitting, image augmentation techniques, such as random rotation and cropping, were employed to augment the training set images. Additionally, during network model training, the criterion was chosen as the loss function to calculate the loss between the model outputs and the true labels. Adam was selected as the optimized model, supervised by binary cross-entropy loss. After multiple training and validation cycles, it was found that overfitting was likely to occur after 30 epochs. Therefore, the training epoch was set to 30, with an initial learning rate of 0.0002, which decayed by a factor of 10 every 10 training epochs.

[0020] Step 4: Input the processed video data into the trained ViT and spatial feature detection network. The output of the network model is a probability, which shows the accuracy of the final result in judging whether the video is real or fake.

[0021] Compared with the prior art, the present invention has the following advantages:

[0022] By abandoning the multi-layer convolutional operations of CNN and introducing the multi-head self-attention mechanism of ViT, the problem of high hardware pressure and excessive computation caused by the deep feature extraction layer is effectively solved.

[0023] When using spatial feature extraction networks, splitting convolutional kernels along orthogonal directions effectively reduces the number of model parameters while maintaining the model's extraction capabilities. Integrating channel attention mechanisms and residual connections also ensures the accuracy and stability of the network model. Attached Figure Description

[0024] Figure 1 This is a flowchart illustrating the deep video forgery detection method that integrates ViT and spatial features provided by the present invention.

[0025] Figure 2 This is a schematic diagram of the spatial feature extraction network structure of the present invention;

[0026] Figure 3 This is a schematic diagram of the deep video forgery detection network structure that integrates ViT and spatial features according to the present invention; Detailed Implementation

[0027] This invention primarily implements a deep video forgery detection method that integrates ViT and spatial features. The specific method employed in this invention will be described in detail below with reference to the accompanying drawings.

[0028] Specifically, the workflow of a deep video forgery detection method that integrates ViT and spatial features is as follows: Figure 1 As shown, the process includes the following steps: S1: Construct a spatial feature extraction network. S2: Construct a network fusing ViT and spatial feature detection. S3: Train the model that combines ViT and spatial feature detection. S4: Input the processed video data into the trained network combining ViT and spatial feature detection to determine whether the video is fake.

[0029] For S1: Construct a spatial feature extraction network.

[0030] In this invention, the network structure design of the spatial feature extraction network is shown in Table 1. The spatial feature extraction network consists of a spatial inconsistency module and a channel attention mechanism module, and mainly includes a convolutional layer 1, a network module 1, a convolutional layer 2, a global average pooling layer, and a convolutional layer 3. Each global extraction convolutional layer contains a normalization layer and a nonlinear activation layer.

[0031] Convolutional Layer 1: The input layer of the spatial feature extraction network uses a 5×5 convolutional kernel with a stride of 1 and edge padding. The output has 256 channels. Its purpose is to perform preliminary detail extraction on the input data while preserving image detail information in a low dimension.

[0032] Network module: The structure of the network module is as follows Figure 2As shown, the network consists of three interconnected paths: the top path is a skip-path connection structure; the middle path consists of an average pooling layer, a 1×3 convolution, a 3×1 convolution, and bilinear interpolation; and the bottom path consists of a 1×1 convolutional layer, a 3×3 convolutional layer, a 3×3 convolutional layer, and a 1×1 convolutional layer. In the middle path, the average pooling layer uses a convolutional kernel with a stride of 2, primarily performing downsampling to transform the low-dimensional feature map into a high-dimensional feature map, providing a larger receptive field, preserving main information, and ignoring detail information. Then, 1×3 and 3×1 convolutional operations are performed with a stride of 1, aiming to extract texture features from two orthogonal directions (horizontal and vertical) to detect inconsistencies at image edges. Afterward, bilinear interpolation is used for upsampling, restoring the feature map dimension while preserving weight information. In the bottom path, the first 1×1 convolution still serves as a dimensionality increase operation, further expanding the input data channels (by a factor of 2 in this invention), allowing for more in-depth feature extraction in subsequent processes. The two subsequent 3×3 convolution operations extract detailed features. Since the h and w values ​​of the feature maps remain unchanged, detailed features can be extracted. Simultaneously, the number of channels is reduced to lower the dimensionality, discarding some feature channels to prevent overfitting and maintain model stability. Finally, a 1×1 convolution operation is performed for further dimensionality reduction, ensuring the output dimension matches the input dimension, saving network parameters and facilitating subsequent operations. The upper path uses a residual structure with skip connections to prevent network degradation, allowing for deeper network design and stronger fitting capabilities. The feature maps from the upper and middle paths are element-wise summed and then input into a sigmoid function for normalization, obtaining the confidence score of the feature map, which serves as the attention weight. This weight is then element-wise multiplied with the lower path feature map matrix, retaining the main feature information.

[0033] Convolutional layer 2: The above feature map is convolved with a 3×3 convolutional kernel with a stride of 1. This is mainly used to further extract the details of the extracted feature map and fuse the three-way information.

[0034] Global average pooling layer: The kernel size of the global average pooling layer is 14×14. It averages the 14×14 resolution matrix output by the front-end network to reduce the dimension to a 1×1 scalar. Then it concatenates these matrices into a new two-dimensional matrix, which has two dimensions: the scalar of each layer and the number of channels.

[0035] Convolutional layer 3: Features are extracted along the channel dimension through convolution with a kernel of size 1×3. Since the preceding operations operate independently on the feature maps of each channel and do not operate on the correlation between channel layers, channel features are extracted, then normalized by the sigmoid function, and then element-wise matrix multiplication is performed with the channel matrix to obtain the weights of the channel attention mechanism, which are used as the output of the final feature matrix.

[0036] Table 1 Spatial Feature Extraction Network Structure

[0037] Network layer kernel size Input Channel Output Channel Step length Convolutional layer 1 <![CDATA[5 2 ,256]]> 256 256 1 Network Module - Average Pooling Layer <![CDATA[2 2 ,512]]> 256 512 2 Network Module - Convolution Kernel 1 1×3,512 512 512 1 Network Module - Convolution Kernel 2 3×1,256 512 256 1 Network Module - Bilinear Interpolation - 256 256 - Network Module - Convolution Kernel 3 <![CDATA[1 2 ,512]]> 256 512 1 Network Module - Convolution Kernel 4 <![CDATA[3 2 ,512]]> 512 512 1 Network Module - Convolution Kernel 5 <![CDATA[3 2 ,512]]> 512 512 1 Network Module - Convolution Kernel 3 <![CDATA[1 2 ,256]]> 512 256 1 Convolutional layer 2 <![CDATA[3 2 ,256]]> 256 256 1 Global average pooling layer <![CDATA[14 2 ,256]]> 256 256 1 Convolutional layer 3 1×3,256 256 256 1

[0038] For S2: Construct a network model that integrates ViT and spatial feature detection.

[0039] The spatial feature extraction network is fused with the ViT network, and the network structure diagram is as follows: Figure 3 As shown, after extracting video frames, ViT performs a slicing operation. The facial image information cropped in this invention is all 224×224 pixel values. The facial frame is split into 16 14×14 pixel patches with a channel dimension of 3. Using 1D positional encoding information, the 16 images are encoded sequentially. Image block embedding processing is then performed, mapping the images to 16 tokens with a channel dimension of 256 through a linear mapping operation. The 0th token is added, incorporating the positional encoding.

[0040] In the Transformer Encoder, a spatial feature extraction network structure is added to the Multi-HeadAttention mechanism to extract spatial feature information in parallel while processing the multi-head attention mechanism. Since the output of the spatial feature extraction network undergoes feature attention extraction in both the channel and spatial dimensions, the extracted feature weights and channel weights are weighted into the query matrix for more refined feature extraction. Then, the fully connected layers in the Transformer Encoder are modified; due to the binary classification problem, the classification head is adjusted to be 2-dimensional.

[0041] For S3: train and build a network model that integrates ViT and spatial feature detection.

[0042] The Dlib face extractor is used to extract faces from video frames, save face sampling point information, and crop the face images of the extracted frames. Since the main goal of deepfake video detection is to detect fake faces, the deepfake video dataset needs to be preprocessed to extract the required face data information.

[0043] The specific steps for training a network model that integrates ViT and spatial feature detection are as follows: pre-train the network model on a dataset of fake videos; divide the dataset into training, validation, and test sets and perform rotation and scaling; use the validation set to adjust hyperparameters; and finally, use the test set to verify the model's performance.

[0044] Pre-training a network model on a fake video dataset refers to using a deep fake video dataset to pre-train the network. Since the ViT model has a relatively large number of parameters and a relatively long training time, pre-training can be performed first to obtain better initial values, which facilitates faster convergence of subsequent training.

[0045] The steps of dividing the fake video dataset into training, validation, and test sets and standardizing it are to adjust the network hyperparameters and evaluate the network performance. Since overfitting may occur during training, validation on the validation set and comparison of the test results on the test set are used to obtain better training epochs.

[0046] When training the network model, the criterion was selected as the loss function to calculate the loss between the model outputs and the true labels. Adam was selected as the optimizer supervised by the binary cross-entropy loss. After multiple rounds of training and validation, it was found that overfitting was likely to occur after 30 epochs. Therefore, the training epoch was set to 30, the initial learning rate was 0.0002, and it was decayed by a factor of 10 after every 10 training epochs. Finally, the test set was used to evaluate the network performance.

[0047] For S4: The processed video data is input into the trained fusion ViT and spatial feature detection network to determine whether the video is fake.

[0048] After the processed fake video data is inferred by the trained ViT fusion and spatial feature deep fake video detection model, its output is a probability. A threshold of 0.5 is set to determine whether the judgment is accurate. When the probability is higher than 0.5, the video judgment is accurate, that is, real videos are judged as real and fake videos are judged as fake; otherwise, the judgment is inaccurate. The final result is displayed as the accuracy of correctly judging real and fake videos.

[0049] The above specific embodiments are only used to illustrate the technical solutions of the present invention, and are not intended to limit it. Those skilled in the art should understand that the above embodiments do not limit the present invention in any way, and all similar technical solutions obtained by equivalent substitution or equivalent transformation are within the protection scope of the present invention.

Claims

1. A method for detecting deepfake videos that integrates ViT and spatial features, characterized in that: Includes the following steps, Step 1: Construct a spatial feature extraction network: The spatial feature extraction network consists of a spatial inconsistency module and a channel attention mechanism module, including convolutional layer 1, network module, convolutional layer 2, global average pooling layer, and convolutional layer 3. Each global extraction convolutional layer contains a normalization layer and a non-linear activation layer. The module network plays the role of extracting features and reducing the number of parameters. Convolutional layer 1 consists of 5×5 convolutional kernels with a channel dimension of 256 and a stride of 1, with 256 input and 256 output channels, used for initial extraction of image detail features. The network module consists of an average pooling layer, multiple convolutional kernels, and bilinear interpolation. The average pooling layer consists of 2×2 convolutional kernels with a channel dimension of 512 and a stride of 2, with 256 input and 512 output channels, used for downsampling. Then, feature extraction is performed by one 1×3 and one 3×1 layer, with input and output channels of 512 respectively. The system employs a 256-channel convolution kernel. Bilinear interpolation is added after the 3×1 convolution for dimensionality upscaling, reducing computation. Two 1×1 and two 3×3 convolution kernels are then used for detailed feature extraction, connected via skip connections using residuals. The average pooling layer consists of 14×14 convolution kernels with 256 channels, with both input and output channels set to 256 to further reduce computation. Channel attention is then introduced, with convolutional layer 3 set to 1×3 with 256 channels and a stride of 1, and both input and output channels set to 256, for feature extraction and selection along the channel dimension. Step 2: Construct a network that integrates ViT and spatial feature detection: This paper integrates ViT and a designed spatial feature extraction network. After extracting video frames, ViT performs a slicing operation, dividing the 224×224 pixel frame into 16 14×14 pixel patches. After flattening, the channel dimension is 3. Using 1D positional encoding information, the 16 images are encoded sequentially and embedded into the positional encoding process. Through a linear mapping operation, it is mapped into 16 tokens with a channel dimension of 128, and the 0th token is added to incorporate the positional encoding. In the Transformer Encoder, the spatial feature extraction network structure is added to the Multi-Head Attention to extract spatial feature information in parallel. The extracted feature weights and channel weights are weighted into the query, key, and value matrix for more detailed feature extraction and to guide weight updates. Step 3: Train and build a network model that integrates ViT and spatial feature detection: The specific steps for training the network model that integrates ViT and spatial feature detection are as follows: use the Dlib face extractor to extract faces from video frames, save the face sampling point information, and crop the face images of the extracted frames; pre-train the network model on the dataset of fake videos; divide the dataset into training set, validation set and test set and perform rotation and scaling processing. The model performance is then evaluated using a validation set and finally tested on a test set. The steps for dividing the fake video dataset into training, validation, and test sets and standardizing it are as follows: the dataset is divided into training, validation, and test sets in a 3:1:1 ratio. Step 4: Input the processed video data into the trained ViT and spatial feature detection network. The output of the network model is a probability, which shows the accuracy of the final result in judging whether the video is real or fake.

2. The method for detecting deepfake videos that integrates ViT and spatial features according to claim 1, characterized in that: Various image enhancement techniques were used to augment the data of extracted frames from the training set video.

3. The method for detecting deepfake videos that integrates ViT and spatial features according to claim 1, characterized in that: The face data in the video frames was extracted using the Dlib face 68-sampling-point detector.

4. The method for detecting deepfake videos that integrates ViT and spatial features according to claim 1, characterized in that: When training the network model, the criterion was selected as the loss function to calculate the loss between the model outputs and the true labels. Adam was selected as the optimizer supervised by the binary cross-entropy loss for optimization, and the model was trained and validated in multiple rounds.

5. The method for detecting deepfake videos that integrates ViT and spatial features according to claim 1, characterized in that: The spatial feature extraction network consists of two parts: a spatial inconsistency module network and a channel attention module network. It is composed of two 1×1 convolutions, two 3×3 convolutions, 1×3 and 3×1 convolutions, upsampling, normalization, nonlinear activation, and attention mechanism modules.

6. The method for detecting deepfake videos by fusing ViT and spatial features according to claim 1, characterized in that: Each module network uses depthwise separable convolution to reduce network parameters and incorporates residual connections to make the network design deeper. Non-linear activation layers use the sigmoid function.

7. The method for detecting deepfake videos by fusing ViT and spatial features according to claim 1, characterized in that: ViT's Encoder uses a weighted adjustment strategy after extracting the confidence level of spatial features.

8. The method for detecting deepfake videos by fusing ViT and spatial features according to claim 1, characterized in that: The specific method for inputting the processed data into the trained forgery detection network and outputting its category is as follows: The processed video data is input into the trained fusion ViT and spatial feature detection network. The output of the network model is a probability. A threshold of 0.5 is set to determine whether the judgment is accurate. When the probability is higher than 0.5, the video judgment is accurate, that is, real videos are judged as real and forged videos are judged as fake; otherwise, the judgment is inaccurate. Finally, the overall accuracy value is output.