Abstract
As computer-generated content and deepfakes make steady improvements, semantic approaches to multimedia forensics will become more important. In this paper, we introduce a novel classification architecture for identifying semantic inconsistencies between video appearance and text caption in social media news posts. While similar systems exist for text and images, we aim to detect inconsistencies in a more ambiguous setting, as videos can be long and contain several distinct scenes, in addition to adding audio as an extra modality. We develop a multi-modal fusion framework to identify mismatches between videos and captions in social media posts by leveraging an ensemble method based on textual analysis of the caption, automatic audio transcription, semantic video analysis, object detection, named entity consistency, and facial verification. To train and test our approach, we curate a new video-based dataset of 4,000 real-world Facebook news posts for analysis. Our multi-modal approach achieves 60.5% classification accuracy on random mismatches between caption and appearance, compared to accuracy below 50% for uni-modal models. Further ablation studies confirm the necessity of fusion across modalities for correctly identifying semantic inconsistencies.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abu-El-Haija, S., et al.: YouTube-8M: a large-scale video classification benchmark. ArXiv arXiv:1609.08675 (2016)
Agarwal, S., et al.: Detecting deep-fake videos from appearance and behavior. In: 2020 IEEE International Workshop on Information Forensics and Security (WIFS), pp. 1–6 (2020)
Aneja, S., Bregler, C., Nießner, M.: COSMOS: catching out-of-context misinformation with self-supervised learning. ArXiv arXiv: 2101.06278 [cs.CV] (2021)
Antol, S., et al.: VQA: Visual Question Answering. In: International Conference on Computer Vision (ICCV) (2015)
Araujo, A., et al.: Stanford I2V: a news video dataset for query-by-image experiments. In: ACM Multimedia Systems Conference (2015)
Bertasius, G., Wang, H., Torresani, L.: Is space-time attention all you need for video understanding? arXiv: 2102.05095 (2021)
CrowdTangle Team: CrowdTangle. Facebook, CA, United States (2021)
Devlin, J., et al.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv: 1810.04805 [cs.CL] (2019)
FFmpeg Developers. Version 4.3.1 (2020). http://ffmpeg.org/
Guarnera, L., Giudice, O., Battiato, S.: DeepFake detection by analyzing convolutional traces. In: CVPR, June 2020
Güera, D., Delp, E.J.: DeepFake video detection using recurrent neural networks. In: 2018 15th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS), pp. 1–6 (2018)
Habibian, A., et al.: Video2vec embeddings recognize events when examples are scarce. In: IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 2089–2103 (2017)
Hannun, A., et al.: Deep speech: scaling up end-to-end speech recognition. arXiv: 1412.5567 [cs.CL] (2014)
He, K., et al.: Deep residual learning for image recognition. arXiv: 1512.03385 [cs.CV] (2015)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). ISSN: 0899-7667
Honnibal, M., et al.: spaCy: industrial-strength natural language processing in Python (2020). https://doi.org/10.5281/zenodo.1212303
Kay, W., et al.: The kinetics human action video dataset. ArXiv arXiv:1705.06950 (2017)
Li, G., et al.: Unicoder-VL: a universal encoder for vision and language by cross-modal pre-training. arXiv: 1908.06066 [cs.CV] (2019)
Li, L., et al.: HERO: hierarchical encoder for video+language omni-representation pre-training. In: EMNLP, pp. 2046–2065. Association for Computational Linguistics, November 2020
Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv: 1907.11692 [cs.CL] (2019)
Lu, J., et al.: ViLBERT: pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. arXiv: 1908.02265 (2019)
Luo, G., Darrell, T., Rohrbach, A.: NewsCLIPpings: automatic generation of out-of-context multimodal media. arXiv: 2104.05893 (2021)
Luo, H., et al.: UniVL: a unified video and language pre-training model for multimodal understanding and generation. arXiv:2002.06353 (2020)
Miech, A., et al.: End-to-end learning of visual representations from uncurated instructional videos. In: CVPR (2020)
Miech, A., et al.: HowTo100M: learning a text-video embedding by watching hundred million narrated video clips. In: ICCV (2019)
Mittal, T., et al.: Emotions don’t lie: an audio-visual deepfake detection method using affective cues. In: Proceedings of the 28th ACM International Conference on Multimedia, MM 2020, Seattle, WA, USA, pp. 2823–2832. Association for Computing Machinery (2020). ISBN: 9781450379885
Peiser, J.: The rise of the robot reporter. The New York Times, February, 2019
Popescu, A.C., Farid, H.: Exposing digital forgeries by detecting traces of resampling. IEEE Trans. Sig. Process. 53(2), 758–767 (2005)
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2015)
Shekhar, R., et al.: FOIL it! Find One mismatch between Image and Language caption. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 255–265. Association for Computational Linguistics, July 2017
Su, W., et al.: VL-BERT: pre-training of generic visual-linguistic representations. arXiv: 1908.08530 [cs.CV] (2020)
Tan, H., Bansal, M.: LXMERT: learning cross-modality encoder representations from transformers. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (2019)
Tan, R., Plummer, B.A., Saenko, K.: Detecting cross-modal inconsistency to defend against neural fake news. In: Empirical Methods in Natural Language Processing (EMNLP) (2020)
Wang, S., et al.: CNN-generated images are surprisingly easy to spot...for now. In: CVPR (2020)
Wang, S., et al.: Detecting photoshopped faces by scripting photoshop. In: ICCV, October 2019
Wolf, T., et al.: Transformers: state-of-the-art natural language processing. In: EMNLP: System Demonstrations, pp. 38–45. Association for Computational Linguistics, October 2020
Xie, S., et al.: Rethinking spatiotemporal feature learning: speed-accuracy trade-offs in video classification. arXiv: 1712.04851 [cs.CV] (2018)
Xu, H., et al.: Multilevel language and vision integration for Text-to-Clip retrieval. In: AAAI, vol. 33, no. 01, pp. 9062–9069, July 2019
youtube-dl. Version 2021.01.24.1 (2021). https://youtube-dl.org
Zellers, R., et al.: Defending against neural fake news. In: Advances in Neural Information Processing Systems, vol. 32 (2019)
Zhao, H., et al.: Multi-attentional deepfake detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2185–2194, June 2021
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
McCrae, S., Wang, K., Zakhor, A. (2022). Multi-modal Semantic Inconsistency Detection in Social Media News Posts. In: Þór Jónsson, B., et al. MultiMedia Modeling. MMM 2022. Lecture Notes in Computer Science, vol 13142. Springer, Cham. https://doi.org/10.1007/978-3-030-98355-0_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-98355-0_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-98354-3
Online ISBN: 978-3-030-98355-0
eBook Packages: Computer ScienceComputer Science (R0)