skip to main content
10.1145/3536221.3556625acmconferencesArticle/Chapter ViewAbstractPublication Pagesicmi-mlmiConference Proceedingsconference-collections
research-article

Does Audio help in deep Audio-Visual Saliency prediction models?

Published: 07 November 2022 Publication History

Abstract

Despite existing works of Audio-Visual Saliency Prediction (AVSP) models claiming to achieve promising results by fusing audio modality over visual-only models, these models fail to leverage audio information. In this paper, we investigate the relevance of audio cues in conjunction with the visual ones and conduct extensive analysis by employing well-established audio modules and fusion techniques from diverse correlated audio-visual tasks. Our analysis on ten diverse saliency datasets suggests that none of the methods worked for incorporating audio. Furthermore, we bring to light, why AVSP models show a gain in performance over visual-only models, though the audio branch is agnostic at inference. Our work questions the role of audio in current deep AVSP models and motivates the community to a clear avenue for reconsideration of the complex architectures by demonstrating that simpler alternatives work equally well.

References

[1]
Yusuf Aytar, Carl Vondrick, and Antonio Torralba. 2016. Soundnet: Learning sound representations from unlabeled video. Advances in neural information processing systems (2016).
[2]
Nicholas J Butko, Lingyun Zhang, Garrison W Cottrell, and Javier R Movellan. 2008. Visual saliency model for robot cameras. In 2008 IEEE International Conference on Robotics and Automation.
[3]
Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Frédo Durand. 2018. What do different evaluation metrics tell us about saliency models?IEEE transactions on pattern analysis and machine intelligence (2018).
[4]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[5]
Jiazhong Chen, Qingqing Li, Hefei Ling, Dakai Ren, and Ping Duan. 2021. Audiovisual saliency prediction via deep learning. Neurocomputing (2021).
[6]
Yanxiang Chen, Tam V Nguyen, Mohan Kankanhalli, Jun Yuan, Shuicheng Yan, and Meng Wang. 2014. Audio matters in visual attention. IEEE Transactions on Circuits and Systems for Video Technology (2014).
[7]
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. VoxCeleb2: Deep Speaker Recognition. In Proc. Interspeech 2018.
[8]
Mert Cokelek, Nevrez Imamoglu, Cagri Ozcinar, Erkut Erdem, and Aykut Erdem. 2021. Leveraging Frequency Based Salient Spatial Sound Localization to Improve 360° Video Saliency Prediction. In 2021 17th International Conference on Machine Vision and Applications (MVA).
[9]
Antoine Coutrot and Nathalie Guyader. 2014. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of vision (2014).
[10]
Antoine Coutrot and Nathalie Guyader. 2015. An efficient audiovisual saliency model to predict eye positions when looking at conversations. In 2015 23rd European Signal Processing Conference (EUSIPCO).
[11]
Antoine Coutrot, Nathalie Guyader, Gelu Ionescu, and Alice Caplier. 2012. Influence of soundtrack on eye movements during video exploration. Journal of Eye Movement Research(2012).
[12]
Antoine Coutrot, Nathalie Guyader, Gelu Ionescu, and Alice Caplier. 2014. Video viewing: do auditory salient events capture visual attention?annals of telecommunications-annales des télécommunications (2014).
[13]
Richard Droste, Jianbo Jiao, and J Alison Noble. 2020. Unified image and video saliency modeling. In European Conference on Computer Vision.
[14]
Joao Filipe Ferreira and Jorge Dias. 2014. Attentional mechanisms for socially interactive robots–a survey. IEEE Transactions on Autonomous Mental Development (2014).
[15]
Michael Gygli, Helmut Grabner, Hayko Riemenschneider, and Luc Van Gool. 2014. Creating summaries from user videos. In European conference on computer vision.
[16]
Hadi Hadizadeh and Ivan V Bajić. 2013. Saliency-aware video compression. IEEE Transactions on Image Processing(2013).
[17]
Laurent Itti, Christof Koch, and Ernst Niebur. 1998. A model of saliency-based visual attention for rapid scene analysis. IEEE Transactions on pattern analysis and machine intelligence (1998).
[18]
Samyak Jain, Pradeep Yarlagadda, Shreyank Jyoti, Shyamgopal Karthik, Ramanathan Subramanian, and Vineet Gandhi. 2020. Vinet: Pushing the limits of visual modality for audio-visual saliency prediction. In 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
[19]
Pattaraporn Khuwuthyakorn, Antonio Robles-Kelly, and Jun Zhou. 2010. Object of interest detection by saliency learning. In European conference on Computer vision.
[20]
Petros Koutras and Petros Maragos. 2015. A perceptually based spatio-temporal computational framework for visual saliency estimation. Signal Processing: Image Communication(2015).
[21]
Petros Koutras and Petros Maragos. 2019. Susinet: See, understand and summarize it. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops.
[22]
Yufan Liu, Minglang Qiao, Mai Xu, Bing Li, Weiming Hu, and Ali Borji. 2020. Learning to Predict Salient Faces: A Novel Visual-Audio Saliency Model. In European Conference on Computer Vision.
[23]
Matei Mancas, Vincent P Ferrera, Nicolas Riche, and John G Taylor. 2016. From Human Attention to Computational Attention.
[24]
Pierre Marighetto, Antoine Coutrot, Nicolas Riche, Nathalie Guyader, Matei Mancas, Bernard Gosselin, and Robert Laganiere. 2017. Audio-visual attention: Eye-tracking dataset and analysis toolbox. In 2017 IEEE International Conference on Image Processing (ICIP).
[25]
Marcin Marszalek, Ivan Laptev, and Cordelia Schmid. 2009. Actions in context. In 2009 IEEE Conference on Computer Vision and Pattern Recognition.
[26]
Viraj Mavani, Shanmuganathan Raman, and Krishna P Miyapuram. 2017. Facial expression recognition using visual saliency and deep learning. In Proceedings of the IEEE international conference on computer vision workshops.
[27]
Kyle Min and Jason J Corso. 2019. Tased-net: Temporally-aggregating spatial encoder-decoder network for video saliency detection. In Proceedings of the IEEE/CVF International Conference on Computer Vision.
[28]
Xiongkuo Min, Guangtao Zhai, Zhongpai Gao, Chunjia Hu, and Xiaokang Yang. 2014. Sound influences visual attention discriminately in videos. In 2014 Sixth International Workshop on Quality of Multimedia Experience (QoMEX).
[29]
Xiongkuo Min, Guangtao Zhai, Ke Gu, and Xiaokang Yang. 2016. Fixation prediction through multimodal analysis. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) (2016).
[30]
Xiongkuo Min, Guangtao Zhai, Jiantao Zhou, Xiao-Ping Zhang, Xiaokang Yang, and Xinping Guan. 2020. A multimodal saliency model for videos with high audio-visual correspondence. IEEE Transactions on Image Processing(2020).
[31]
Parag K Mital, Tim J Smith, Robin L Hill, and John M Henderson. 2011. Clustering of gaze during dynamic scene viewing is predicted by motion. Cognitive computation(2011).
[32]
KL Bhanu Moorthy, Moneish Kumar, Ramanathan Subramanian, and Vineet Gandhi. 2020. Gazed–gaze-guided cinematic editing of wide-angle monocular video recordings. In Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems.
[33]
Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-Visual Instance Discrimination with Cross-Modal Agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[34]
Tam V Nguyen, Mengdi Xu, Guangyu Gao, Mohan Kankanhalli, Qi Tian, and Shuicheng Yan. 2013. Static saliency vs. dynamic saliency: a comparative study. In Proceedings of the 21st ACM international conference on Multimedia.
[35]
Márta Szabina Pápai and Salvador Soto-Faraco. 2017. Sounds can boost the awareness of visual events through attention without cross-modal integration. Scientific reports (2017).
[36]
David R Perrott, Kourosh Saberi, Kathleen Brown, and Thomas Z Strybel. 1990. Auditory psychomotor coordination and visual search performance. Perception & psychophysics(1990).
[37]
Mirco Planamente, Chiara Plizzari, Emanuele Alberti, and Barbara Caputo. 2022. Domain Generalization through Audio-Visual Relative Norm Alignment in First Person Action Recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision.
[38]
Minglang Qiao, Yufan Liu, Mai Xu, Xin Deng, Bing Li, Weiming Hu, and Ali Borji. 2021. Joint Learning of Visual-Audio Saliency Prediction and Sound Source Localization on Multi-face Videos. arXiv preprint arXiv:2111.08567(2021).
[39]
Mikel D Rodriguez, Javed Ahmed, and Mubarak Shah. 2008. Action mach a spatio-temporal maximum average correlation height filter for action recognition. In 2008 IEEE conference on computer vision and pattern recognition.
[40]
Olaf Ronneberger, Philipp Fischer, and Thomas Brox. 2015. U-net: Convolutional networks for biomedical image segmentation. In International Conference on Medical image computing and computer-assisted intervention.
[41]
Guido Schillaci, Saša Bodiroža, and Verena Vanessa Hafner. 2013. Evaluating the effect of saliency detection and attention manipulation in human-robot interaction. International Journal of Social Robotics(2013).
[42]
Vincent Sitzmann, Ana Serrano, Amy Pavel, Maneesh Agrawala, Diego Gutierrez, Belen Masia, and Gordon Wetzstein. 2018. Saliency in VR: How do people explore virtual environments?IEEE transactions on visualization and computer graphics (2018).
[43]
Guanghan Song, Denis Pellerin, and Lionel Granjon. 2011. Sound effect on visual gaze when looking at videos. In 2011 19th European Signal Processing Conference.
[44]
Guanghan Song, Denis Pellerin, and Lionel Granjon. 2013. Different types of sounds influence gaze differently in videos. Journal of Eye Movement Research(2013).
[45]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research(2014).
[46]
Ruijie Tao, Zexu Pan, Rohan Kumar Das, Xinyuan Qian, Mike Zheng Shou, and Haizhou Li. 2021. Is someone speaking? exploring long-term temporal features for audio-visual active speaker detection. In Proceedings of the 29th ACM International Conference on Multimedia.
[47]
Hamed R Tavakoli, Ali Borji, Esa Rahtu, and Juho Kannala. 2019. Dave: A deep audio-visual embedding for dynamic saliency prediction. arXiv preprint arXiv:1905.10693(2019).
[48]
Antigoni Tsiami, Athanasias Katsamanis, Petros Maragos, and Argiro Vatakis. 2016. Towards a behaviorally-validated computational audiovisual saliency model. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).
[49]
Antigoni Tsiami, Petros Koutras, and Petros Maragos. 2020. STAViS: Spatio-Temporal AudioVisual Saliency Network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR).
[50]
Jean Vroomen and Beatrice de Gelder. 2000. Sound enhances visual perception: cross-modal effects of auditory organization on vision.Journal of experimental psychology: Human perception and performance (2000).
[51]
Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. 2018. Revisiting video saliency: A large-scale benchmark and a new model. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.
[52]
Xinyi Wu, Zhenyao Wu, Jinglin Zhang, Lili Ju, and Song Wang. 2020. Salsac: A video saliency prediction model with shuffled attentions and correlation-based convlstm. In Proceedings of the AAAI Conference on Artificial Intelligence.
[53]
Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. 2018. Rethinking spatiotemporal feature learning: Speed-accuracy trade-offs in video classification. In Proceedings of the European conference on computer vision (ECCV).
[54]
Tong Yubing, Faouzi Alaya Cheikh, Fahad Fazal Elahi Guraya, Hubert Konik, and Alain Trémeau. 2011. A spatiotemporal saliency model for video surveillance. Cognitive Computation(2011).
[55]
Dandan Zhu, Defang Zhao, Xiongkuo Min, Tian Han, Qiangqiang Zhou, Shaobo Yu, Yongqing Chen, Guangtao Zhai, and Xiaokang Yang. 2021. Lavs: A Lightweight Audio-Visual Saliency Prediction Model. In 2021 IEEE International Conference on Multimedia and Expo (ICME).

Cited By

View all
  • (2024)Listen to Look Into the Future: Audio-Visual Egocentric Gaze AnticipationComputer Vision – ECCV 202410.1007/978-3-031-72673-6_11(192-210)Online publication date: 22-Oct-2024

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ICMI '22: Proceedings of the 2022 International Conference on Multimodal Interaction
November 2022
830 pages
ISBN:9781450393904
DOI:10.1145/3536221
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 November 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Human Visual Attention
  2. Multi-modal Learning
  3. Saliency Prediction

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICMI '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 453 of 1,080 submissions, 42%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)49
  • Downloads (Last 6 weeks)5
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Listen to Look Into the Future: Audio-Visual Egocentric Gaze AnticipationComputer Vision – ECCV 202410.1007/978-3-031-72673-6_11(192-210)Online publication date: 22-Oct-2024

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media