skip to main content
10.1145/3581807.3581847acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiccprConference Proceedingsconference-collections
research-article

Video Forgery Detection Using Spatio-Temporal Dual Transformer

Published: 22 May 2023 Publication History

Abstract

The fake videos generated by deep generation technology pose a potential threat to social stability, which makes it critical to detect fake videos. Although the previous detection methods have achieved high accuracy, the generalization to different datasets and in realistic scenes is not effective. We find several novel temporal and spatial clues. In the frequency domain, the inter-frame differences between the real and fake videos are significantly more obvious than the intra-frame differences. In the shallow texture on the CbCr color channels, the forged areas of the fake videos appear more distinct blurring compared to the real videos. And the optical flow of the real video changes gradually, while the optical flow of the fake video changes drastically. This paper proposes a spatio-temporal dual Transformer network for video forgery detection that integrates spatio-temporal clues with the temporal consistency of consecutive frames to improve generalization. Specifically, an EfficientNet is first used to extract spatial artifacts of shallow textures and high-frequency information. We add a new loss function to EfficientNet to extract more robust face features, as well as introduce an attention mechanism to enhance the extracted features. Next, a Swin Transformer is used to capture the subtle temporal artifacts in inter-frame spectrum difference and the optical flow. A feature interaction module is added to fuse local features and global representations. Finally, another Swin Transformer is used to classify the videos according to the extracted spatio-temporal features. We evaluate our method on datasets such as FaceForensics++, Celeb-DF (v2) and DFDC. Extensive experiments show that the proposed framework has high accuracy and generalization, outperforming the current state-of-the-art methods.

References

[1]
D. Afchar, V. Nozick, J. Yamagishi, and I. Echizen. Mesonet: a compact facial video forgery detection network. 2018 IEEE International Workshop on Information Forensics and Security (WIFS), pages 1–7, 2018.
[2]
I. Amerini, L. Galteri, R. Caldelli, and A. D. Bimbo. Deepfake video detection through optical flow based cnn. In 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), 2020.
[3]
R. Caldelli, L. Galteri, I. Amerini, and A. D. Bimbo. Optical flow based cnn for detection of unlearnt deepfake manipulations. Pattern Recognition Letters, (10), 2021.
[4]
L. Chai, D. Bau, S. N. Lim, and P. Isola. What makes fake images detectable? Understanding properties that generalize. Springer International Publishing, 2020.
[5]
D. Chen, J. Li, S. Wang, and S. Li. Identifying computer generated and digital camera images using fractional lower order moments. In IEEE Conference on Industrial Electronics Applications, 2009.
[6]
A. Chintha, A. Rao, S. Sohrawardi, K. Bhatt, and R. Ptucha. Leveraging edges and optical flow on faces for deepfake detection. In 2020 IEEE International Joint Conference on Biometrics (IJCB), 2020.
[7]
F. Chollet. Xception: Deep learning with depthwise separable convolutions. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[8]
V. Conotter. Detecting photographic and computer generated composites. In International Conference on Electronic Mechanical Engineering Information Technology, 2011.
[9]
Deepfakes. https://github.com/deepfakes/faceswap. 2020.
[10]
J. Deng, J. Guo, N. Xue, and S. Zafeiriou. Arcface: Additive angular margin loss for deep face recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
[11]
B. Dolhansky, R. Howes, B. Pflaum, N. Baram, and C. C. Ferrer. The deepfake detection challenge (dfdc) preview dataset. arXiv preprint arXiv:1910.08854, 2019.
[12]
R. Durall, M. Keuper, F.-J. Pfreundt, and J. Keuper. Unmasking deepfakes with simple features. arXiv preprint arXiv:1911.00686, 2019.
[13]
Faceswap. https://github.com/marekkowalski/faceswap/. 2019.
[14]
Fakeapp. https://www.malavida.com/en/soft/fakeapp/. 2020.
[15]
F. Franzen. Image Classification in the Frequency Domain with Neural Networks and Absolute Value DCT. Image and Signal Processing, 2018.
[16]
D. Guera and E. J. Delp. Deepfake video detection using recurrent neural networks. In 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), 2018.
[17]
Z. Huang, X.Wang, L. Huang, C. Huang, Y.Wei, andW. Liu. Ccnet: Criss-cross attention for semantic segmentation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), October 2019.
[18]
H. Jeon, Y. Bang, and S. S. Woo. Fdftnet: Facing off fake images using fake detection fine-tuning network. In ICT Systems Security and Privacy Protection. Springer International Publishing, 2020.
[19]
H. Jie, S. Li, S. Gang, and S. Albanie. Squeeze-andexcitation networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, PP(99), 2017.
[20]
T. Karras, T. Aila, S. Laine, and J. Lehtinen. Progressive growing of gans for improved quality, stability, and variation. CoRR, 2017.
[21]
T. Karras, S. Laine, and T. Aila. A style-based generator architecture for generative adversarial networks. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018.
[22]
M. Kim, S. Tariq, and S. S. Woo. Cored: Generalizing fake media detection with continual representation using distillation. In Proceedings of the 29th ACM International Conference on Multimedia, pages 337–346, 2021.
[23]
H. Li, B. Li, S. Tan, and J. Huang. Detection of deep network generated images using disparities in color components. arXiv preprint arXiv:1808.07276, 2018.
[24]
J. Li, H. Xie, J. Li, Z. Wang, and Y. Zhang. Frequencyaware discriminative feature learning supervised by singlecenter loss for face forgery detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6458–6467, 2021.
[25]
L. Li, J. Bao, T. Zhang, H. Yang, and B. Guo. Face x-ray for more general face forgery detection. In 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020.
[26]
Y. Li and S. Lyu. Exposing deepfake videos by detecting face warping artifacts. CVPR Workshops, 2019.
[27]
Y. Li, X. Yang, P. Sun, H. Qi, and S. Lyu. Celeb-df: A new dataset for deepfake forensics. 2019.
[28]
H. Liu, X. Li, W. Zhou, Y. Chen, Y. He, H. Xue, W. Zhang, and N. Yu. Spatial-phase shallow learning: rethinking face forgery detection in frequency domain. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 772–781, 2021.
[29]
W. Liu, Y. Wen, Z. Yu, M. Li, B. Raj, and L. Song. Sphereface: Deep hypersphere embedding for face recognition. In 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.
[30]
Z. Liu, Y. Lin, Y. Cao, H. Hu, Y. Wei, Z. Zhang, S. Lin, and B. Guo. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 10012– 10022, 2021.
[31]
S. Lyu and H. Farid. How realistic is photorealistic? IEEE Transactions on Signal Processing, 53(2):845–850, 2005.
[32]
F. Marra, D. Gragnaniello, D. Cozzolino, and L. Verdoliva. Detection of gan-generated fake images over social networks. In 2018 IEEE Conference on Multimedia Information Processing and Retrieval (MIPR), 2018.
[33]
I. Masi, A. Killekar, R. M. Mascarenhas, S. P. Gurudatt, and W. Abdalmageed. Two-Branch Recurrent Network for Isolating Deepfakes in Videos. Computer Vision – ECCV 2020, 2020.
[34]
S. McCloskey and M. Albright. Detecting gan-generated imagery using color cues, 2018.
[35]
H. H. Nguyen, F. Fang, J. Yamagishi, and I. Echizen. Multitask learning for detecting and segmenting manipulated facial images and videos. arXiv preprint arXiv:1906.06876, 2019.
[36]
F. Pan, J. B. Chen, and J.W. Huang. Discriminating between photorealistic computer graphics and natural images using fractal geometry. Science in China Series F: Information Sciences, 52(002):329–337, 2009.
[37]
Y. Qian, G. Yin, L. Sheng, Z. Chen, and J. Shao. Thinking in frequency: Face forgery detection by mining frequencyaware clues. In European Conference on Computer Vision, pages 86–103. Springer, 2020.
[38]
A. Rossler, D. Cozzolino, L. Verdoliva, C. Riess, J. Thies, and M. Nießner. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1–11, 2019.
[39]
E. Sabir, J. Cheng, A. Jaiswal, W. AbdAlmageed, I. Masi, and P. Natarajan. Recurrent convolutional strategies for face manipulation detection in videos. Interfaces (GUI), 3(1):80– 87, 2019.
[40]
J. A. Stuchi, M. A. Angeloni, R. F. Pereira, L. Boccato, and R. Attux. Improving image classification with frequency domain layers for feature extraction. In 2017 IEEE 27th International Workshop on Machine Learning for Signal Processing (MLSP), 2017.
[41]
D. Sun, X. Yang, M. Y. Liu, and J. Kautz. Pwc-net: Cnns for optical flow using pyramid, warping, and cost volume. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2018.
[42]
S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher- Shlizerman. What makes tom hanks look like tom hanks. In IEEE International Conference on Computer Vision, 2015.
[43]
S. Suwajanakorn, S. M. Seitz, and I. Kemelmacher- Shlizerman. Synthesizing obama: Learning lip sync from audio. ACM Transactions on Graphics, 36(4CD):95.1–95.13, 2017.
[44]
M. Tan and Q. Le. Efficientnet: Rethinking model scaling for convolutional neural networks. In International conference on machine learning, pages 6105–6114. PMLR, 2019.
[45]
J. Thies, M. Zollhfer, and M. Niener. Deferred neural rendering: Image synthesis using neural textures. ACM Transactions on Graphics, 2019.
[46]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. u. Kaiser, and I. Polosukhin. Attention is all you need. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017.
[47]
S. Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros. Cnn-generated images are surprisingly easy to spot... for now. 2019.
[48]
S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros. Cnn-generated images are surprisingly easy to spot... for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8695–8704, 2020.
[49]
Y.Wang and A. Dantcheva. A video is worth more than 1000 lies. comparing 3dcnn approaches for detecting deepfakes. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), 2020.
[50]
N. Yu, L. S. Davis, and M. Fritz. Attributing fake images to gans: Learning and analyzing gan fingerprints. In Proceedings of the IEEE/CVF international conference on computer vision, pages 7556–7566, 2019.
[51]
K. Zhang, Z. Zhang, Z. Li, and Y. Qiao. Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Processing Letters, 23(10):1499–1503, 2016.
[52]
H. Zhao, W. Zhou, D. Chen, T. Wei, W. Zhang, and N. Yu. Multi-attentional deepfake detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2185–2194, 2021.
[53]
P. Zhou, X. Han, V. I. Morariu, and L. S. Davis. Twostream neural networks for tampered face detection. In 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2018.
[54]
X. Zhu, H. Wang, H. Fei, Z. Lei, and S. Z. Li. Face forgery detection by 3d decomposition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2929–2939, June 2021.

Cited By

View all
  • (2024)Hybrid Deep-Learning Model for Deepfake Detection in Video using Transfer Learning ApproachNational Academy Science Letters10.1007/s40009-024-01480-7Online publication date: 15-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICCPR '22: Proceedings of the 2022 11th International Conference on Computing and Pattern Recognition
November 2022
683 pages
ISBN:9781450397056
DOI:10.1145/3581807
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 22 May 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. EfficientNet
  2. Swin Transformer
  3. deepfakes
  4. dual Transformer
  5. high frequency
  6. optical flow
  7. spatio-temporal clues
  8. texture
  9. video forgery detection

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICCPR 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)18
  • Downloads (Last 6 weeks)1
Reflects downloads up to 18 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Hybrid Deep-Learning Model for Deepfake Detection in Video using Transfer Learning ApproachNational Academy Science Letters10.1007/s40009-024-01480-7Online publication date: 15-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media