A fake video detection method based on Transformer
A video detection and video technology, applied in the field of deepfake detection, can solve problems such as poor generalization, achieve the effect of improving accuracy and avoiding poor generalization performance
- Summary
- Abstract
- Description
- Claims
- Application Information
AI Technical Summary
Problems solved by technology
Method used
Image
Examples
Embodiment 1
[0040] In step a), use the video reading algorithm VideoReader class in python to extract the video to obtain t consecutive video frames, and use the get_frontal_face_detector function in the face recognition algorithm dlib library to extract the face image for the extracted video frame, and put the obtained face on the Enter the video folder, and obtain t face images of consecutive frames in the video folder.
Embodiment 2
[0042] The width and height of the t face images of the consecutive frames obtained in step a) are adjusted to 224 and 224 respectively, and the average value is [0.4718, 0.3467, 0.3154], and the variance is [0.1656, 0.1432, 0.1364] to normalize the face images. Unify, encapsulate the normalized t face images of consecutive frames into a tensor x of [b,t,c,h,w] i ∈R b×t×c×h×w , R is a vector space, where the video labels are [b, 0 / 1], x i is the i-th video batch, i∈{1,…,K / b}, b is the number of videos in each batch, c is the number of channels of each face image, h is the height of each face image, w is the width of each face image, 0 means fake video, 1 means real video.
Embodiment 3
[0044] Step b) includes the following steps:
[0045] b-1) Establish a feature extraction module composed of five consecutive blocks. The first block, the second block, and the third block are all composed of three consecutive convolutional layers and a maximum pooling layer. The third block The block and the fourth block are both composed of four consecutive convolutional layers and a maximum pooling layer, each convolutional layer is set with a 3×3 kernel, the stride and padding of each convolutional layer are 1, and each convolutional layer is set with a 3×3 kernel. Each max-pooling layer has a window of 2 × 2 pixels, the stride of each max-pooling layer is equal to 2, the first convolutional layer of the first block has 32 channels, and the fourth block of the fifth block has 32 channels. Each convolutional layer has 512 channels.
[0046] b-2) put x i ∈R b×t×c×h×w After the dimension is transformed to [b*t, c, h, w], input the feature extraction module, and the output ...
PUM
Login to View More Abstract
Description
Claims
Application Information
Login to View More 


