ABSTRACT
With the development of convolutional neural networks (CNN) and generative adversarial networks (GAN) in recent years, classifying fake videos produced through DeepFake has become a very difficult task. Most previous studies on DeepFake Detection were focused on finding DeepFake artifacts through CNN. DeepFake detection using CNN has high accuracy, but is vulnerable to noisy inputs such as side faces, shadowed faces, and low-quality images. In addition, although it has the advantage of being able to learn quickly through inductive bias, it tends to be overfitted to specific datasets, showing low accuracy in manipulated videos created with a different type of DeepFake from training datasets.
In this study, we propose the robust DeepFake detection method, which combines vision transformer(ViT) and CNN models. We found through the experiments that the ViT model was highly effective in processing side faces and low-quality videos. Our method where the ResNeSt269 model was combined with the DeiT model using a weighted majority voting ensemble(WMVE) approach had 97.66% accuracy, which outperformed the results of the existing DeepFake Detection Challenge(DFDC)'s state-of-the-art model, which achieved 96.78% accuracy. In addition, when benchmarking is performed on a dataset that is completely different from the training dataset, Our method has the robustness to new dataset, showing more than 10% higher accuracy than the CNN model due to the high generalization performance of ViT.
- Saad Albawi, Tareq Abed Mohammed, and Saad Al-Zawi. 2017. Understanding of a convolutional neural network. In 2017 International Conference on Engineering and Technology (ICET). 1--6. Google ScholarCross Ref
- Brian Dolhansky, Joanna Bitton, Ben Pflaum, Jikuo Lu, Russ Howes, Menglin Wang, and Cristian Canton Ferrer. 2020. The DeepFake Detection Challenge (DFDC) Dataset. arXiv:cs.CV/2006.07397Google Scholar
- Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16×16 Words: Transformers for Image Recognition at Scale. arXiv:cs.CV/2010.11929Google Scholar
- Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. 2014. Generative Adversarial Nets. In Advances in Neural Information Processing Systems, Vol. 27.Google ScholarDigital Library
- Dong Huang and Fernando De La Torre. 2012. Facial Action Transfer with Personalized Bilinear Regression. In Computer Vision - ECCV 2012, Andrew Fitzgibbon, Svetlana Lazebnik, Pietro Perona, Yoichi Sato, and Cordelia Schmid (Eds.). Springer Berlin Heidelberg, Berlin, Heidelberg, 144--158.Google ScholarCross Ref
- Johnson, Justin M., Khoshgoftaar, and Taghi M. 2019. Survey on deep learning with class imbalance. Journal of Big Data 6, 1 (19 Mar 2019), 27. Google ScholarCross Ref
- Tero Karras, Samuli Laine, and Timo Aila. 2019. A Style-Based Generator Architecture for Generative Adversarial Networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).Google ScholarCross Ref
- L.I. Kuncheva. 2014. Combining Pattern Classifiers: Methods and Algorithms. Wiley.Google ScholarDigital Library
- Yisroel Mirsky and Wenke Lee. 2021. The Creation and Detection of Deepfakes: A Survey. ACM Comput. Surv. 54, 1, Article 7 (Jan. 2021), 41 pages. Google ScholarDigital Library
- Yuval Nirkin, Yosi Keller, and Tal Hassner. 2019. FSGAN: Subject Agnostic Face Swapping and Reenactment. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV).Google ScholarCross Ref
- Sinno Jialin Pan and Qiang Yang. 2010. A Survey on Transfer Learning. IEEE Transactions on Knowledge and Data Engineering 22, 10 (2010), 1345--1359. Google ScholarDigital Library
- Andreas Rossler, Davide Cozzolino, Luisa Verdoliva, Christian Riess, Justus Thies, and Matthias Nießner. 2019. Faceforensics++: Learning to detect manipulated facial images. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1--11.Google ScholarCross Ref
- Selim Seferbekov. 2020. dfdc deepfake challenge. Github. https://github.com/selimsef/dfdc_deepfake_challenge. Accessed 16 Jan 2022.Google Scholar
- Connor Shorten and Taghi M. Khoshgoftaar. 2019. A survey on Image Data Augmentation for Deep Learning. Journal of Big Data 6, 1 (06 Jul 2019), 60. Google ScholarCross Ref
- Mingxing Tan and Quoc Le. 2019. EfficientNet: Rethinking Model Scaling for Convolutional Neural Networks. In Proceedings of the 36th International Conference on Machine Learning (Proceedings of Machine Learning Research), Kamalika Chaudhuri and Ruslan Salakhutdinov (Eds.), Vol. 97. PMLR, 6105--6114.Google Scholar
- Ruben Tolosana, Ruben Vera-Rodriguez, Julian Fierrez, Aythami Morales, and Javier Ortega-Garcia. 2020. Deepfakes and beyond: A Survey of face manipulation and fake detection. Information Fusion 64 (2020), 131--148. Google ScholarCross Ref
- Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herve Jegou. 2021. Training data-efficient image transformers & distillation through attention. In Proceedings of the 38th International Conference on Machine Learning (Proceedings of Machine Learning Research), Marina Meila and Tong Zhang (Eds.), Vol. 139. PMLR, 10347--10357.Google Scholar
- Mika Westerlund. 2019. The Emergence of Deepfake Technology: A Review. Technology Innovation Management Review 9 (11/2019 2019), 40--53. Google ScholarCross Ref
- Hang Zhang, Chongruo Wu, Zhongyue Zhang, Yi Zhu, Haibin Lin, Zhi Zhang, Yue Sun, Tong He, Jonas Mueller, R. Manmatha, Mu Li, and Alexander Smola. 2020. ResNeSt: Split-Attention Networks. arXiv:cs.CV/2004.08955Google Scholar
- Kaipeng Zhang, Zhanpeng Zhang, Zhifeng Li, and Yu Qiao. 2016. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23, 10 (2016), 1499--1503. Google ScholarCross Ref
Index Terms
- Robust DeepFake Detection Method based on Ensemble of ViT and CNN
Recommendations
DeepFake detection algorithm based on improved vision transformer
AbstractA DeepFake is a manipulated video made with generative deep learning technologies, such as generative adversarial networks or auto encoders that anyone can utilize. With the increase in DeepFakes, classifiers consisting of convolutional neural ...
Deepfake Detection Using CNN Trained on Eye Region
Advances and Trends in Artificial Intelligence. Theory and Practices in Artificial IntelligenceAbstractIn this work, we will develop a simple convolutional neural network to detect deepfakes in videos on a frame-by-frame level, focusing on the region around the eyes. Since deepfakes are increasingly being created using forms of CNN, it should be ...
Ore Image Classification Based on Improved CNN
AbstractThe identification of ore deposits is an important technical task in mining and excavation. However, conventional techniques are time-consuming and tedious. Therefore, data augmentation and transfer learning were used in this topic to ...
Comments