skip to main content
10.1145/3379156.3391337acmconferencesArticle/Chapter ViewAbstractPublication PagesetraConference Proceedingsconference-collections
short-paper

Deep Audio-Visual Saliency: Baseline Model and Data

Published: 02 June 2020 Publication History

Abstract

This paper introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed “DAVE” in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named “AVE”. Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous in dynamic scenes. Here, we propose a baseline deep audio-visual saliency model for multi-modal saliency prediction in the wild. Thus the proposed model is intentionally designed to be simple. A video baseline model is also developed on the same architecture to assess effectiveness of the audio-visual models on a fair basis. We demonstrate that audio-visual saliency model outperforms the video saliency models. The data and code are available at https://hrtavakoli.github.io/AVE/ and https://github.com/hrtavakoli/DAVE

References

[1]
Jami Bartgis, Alesha R. Lilly, and David G. Thomas. 2003. Event-Related Potential and Behavioral Measures of Attention in 5-, 7-, and 9-Year-Olds. The Journal of General Psychology 130, 3 (2003), 311–335. https://doi.org/10.1080/00221300309601162
[2]
P. Bertolino. 2012. Sensarea: An authoring tool to create accurate clickable videos. In International Workshop on Content-Based Multimedia Indexing (CBMI). https://doi.org/10.1109/CBMI.2012.6269804
[3]
Giuseppe Boccignone, Vittorio Cuculo, Alessandro D’Amelio, Giuliano Grossi, and Raffaella Lanzarotti. 2018. Give Ear to My Face: Modelling Multimodal Attention to Social Interactions. In European Conference on Computer Vision Workshops.
[4]
Ali Borji. 2018. Saliency Prediction in the Deep Learning Era: An Empirical Investigation. CoRR abs/1810.03716(2018).
[5]
Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Frédo Durand. 2016. What do different evaluation metrics tell us about saliency models?arXiv preprint arXiv:1604.03605(2016).
[6]
Marisa Carrasco. 2011. Visual attention: the past 25 years. Vision research 41, 13 (2011), 1484–1525.
[7]
E. Colin Cherry. 1953. Some Experiments on the Recognition of Speech, with One and with Two Ears. The Journal of the Acoustical Society of America 25, 5 (1953), 975–979. https://doi.org/10.1121/1.1907229
[8]
J. S. Chung and A. Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.
[9]
Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2016. A Deep Multi-Level Network for Saliency Prediction. In International Conference on Pattern Recognition (ICPR).
[10]
Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. IEEE Transactions on Image Processing 27, 10 (2018), 5142–5154.
[11]
A. Coutrot and N. Guyader. 2014a. An audiovisual attention model for natural conversation scenes. In IEEE International Conference on Image Processing (ICIP).
[12]
Antoine Coutrot and Nathalie Guyader. 2014b. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of Vision 14, 8 (2014), 5. https://doi.org/10.1167/14.8.5
[13]
A. Coutrot and N. Guyader. 2015. An efficient audiovisual saliency model to predict eye positions when looking at conversations. In European Signal Processing Conference (EUSIPCO). https://doi.org/10.1109/EUSIPCO.2015.7362640
[14]
Antoine Coutrot and Nathalie Guyader. 2016. Multimodal Saliency Models for Videos. In From Human Attention to Computational Attention: A Multidisciplinary Approach. Springer New York, New York, NY, 291–304. https://doi.org/10.1007/978-1-4939-3435-5_16
[15]
H. Hadizadeh and I. V. Bajic. 2014. Saliency-Aware Video Compression. IEEE Transactions on Image Processing 23, 1 (2014), 19–33. https://doi.org/10.1109/TIP.2013.2282897
[16]
Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[17]
Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks. In IEEE International Conference on Computer Vision (ICCV).
[18]
Laurent Itti and Christof Koch. 2000. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research 40, 10 (2000), 1489 – 1506.
[19]
Sen Jia. 2018. EML-NET: An Expandable Multi-Layer NETwork for Saliency Prediction. CoRR abs/1805.01047(2018).
[20]
Lai Jiang, Mai Xu, Tie Liu, Minglang Qiao, and Zulin Wang. 2018. DeepVS: A Deep Learning Based Video Saliency Prediction Approach. In European Conference on Computer Vision (ECCV).
[21]
Tilke Judd, Frédo Durand, and Antonio Torralba. 2012. A Benchmark of Computational Models of Saliency to Predict Human Fixations. In MIT Technical Report.
[22]
M. Kummerer, L. Theis, and M. Bethge. 2015. Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet. In ICLR Workshop.
[23]
Víctor Leborán Alvarez, Antón García-Díaz, Xosé R. Fdez-Vidal, and Xosé M. Pardo. 2017. Dynamic Whitening Saliency. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 5(2017), 893–907. https://doi.org/10.1109/TPAMI.2016.2567391
[24]
David J. Lewkowicz. 1998. Sensory dominance in infants: II. Ten-month-old infants’ response to auditory-visual compounds. Developmental Psychology 24, 2 (1998), 172–182.
[25]
Eleanor E. Maccoby and Karl W. Konrad. 1966. Age trends in selective listening. Journal of Experimental Child Psychology 3, 2 (1966), 113 – 122. https://doi.org/10.1016/0022-0965(66)90086-5
[26]
S. Marat, M. Guironnet, and D. Pellerin. 2007. Video summarization using a visual attention model. In European Signal Processing Conference.
[27]
N. Mesgarani and EF. Chang. 2012. Selective cortical representation of attended speaker in multi-talker speech perception. Nature 485, 7397 (2012), 233–242.
[28]
Parag K. Mital, Tim J. Smith, Robin L. Hill, and John M. Henderson. 2011. Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion. Cognitive Computation 3, 1 (2011), 5–24. https://doi.org/10.1007/s12559-010-9074-z
[29]
Hamed Rezazadegan Tavakoli. 2014. Visual saliency and eye movement: modeling and applications. Ph.D. Dissertation. University of Oulu.
[30]
Hamed Rezazadegan Tavakoli, Esa Rahtu, and Janne Heikkilä. 2013. Spherical Center-Surround for Video Saliency Detection Using Sparse Sampling. In Advanced Concepts for Intelligent Vision Systems.
[31]
JE Richards. 2000. Development of multimodal attention in young infants: modification of the startle reflex by attention. Psychophysiology 37, 1 (2000), 65–75.
[32]
C. Sandor, A. Cunningham, A. Dey, and V. Mattila. 2010. An Augmented Reality X-Ray system based on visual saliency. In IEEE International Symposium on Mixed and Augmented Reality. https://doi.org/10.1109/ISMAR.2010.5643547
[33]
Hae Jong Seo and Peyman Milanfar. 2009. Static and space-time visual saliency detection by self-resemblance. Journal of Vision 9, 12 (2009), 15–15. https://doi.org/10.1167/9.12.15
[34]
Anne M. Treisman and Garry Gelade. 1980. A feature-integration theory of attention. Cognitive Psychology 12, 1 (1980), 97 – 136.
[35]
John K. Tsotsos. 2011. A Computational Perspective on Visual Attention. The MIT Press.
[36]
Erik Van der Burg, Christian N. L. Olivers, Adelbert W. Bronkhorst, and Jan Theeuwes. 2008. Audiovisual events capture attention: Evidence from temporal order judgments. Journal of Vision 8, 5 (2008), 2–2. https://doi.org/10.1167/8.5.2
[37]
Julius Wang, Hamed R. Tavakoli, and Jorma Laaksonen. 2017. Fixation Prediction in Videos Using Unsupervised Hierarchical Features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.
[38]
Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. 2018. Revisiting Video Saliency: A Large-scale Benchmark and a New Model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
[39]
Lingyun Zhang, Matthew H. Tong, Tim K. Marks, Honghao Shan, and Garrison W. Cottrell. 2008. SUN: A Bayesian framework for saliency using natural statistics. Journal of Vision 8, 7 (2008), 32–32. https://doi.org/10.1167/8.7.32
[40]
Ziheng Zhang, Yanyu Xu, Jingyi Yu, and Shenghua Gao. 2018. Saliency Detection in 360° Videos. In European Conference on Computer Vision (ECCV).
[41]
Qi Zhao and Christof Koch. 2011. Learning a saliency map using fixated locations in natural scenes. Journal of Vision 11, 3 (2011), 9–9.

Cited By

View all
  • (2024)Multi-Modal Gaze Following in Conversational Scenarios2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00122(1175-1184)Online publication date: 3-Jan-2024
  • (2024)Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial AudioIEEE Transactions on Multimedia10.1109/TMM.2023.327102226(764-775)Online publication date: 2024
  • (2023)Multi-modal cognitive computingSCIENTIA SINICA Informationis10.1360/SSI-2022-022653:1(1)Online publication date: 11-Jan-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
ETRA '20 Short Papers: ACM Symposium on Eye Tracking Research and Applications
June 2020
305 pages
ISBN:9781450371346
DOI:10.1145/3379156
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2020

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Audio-Visual Saliency
  2. Deep Learning
  3. Dynamic Visual Attention

Qualifiers

  • Short-paper
  • Research
  • Refereed limited

Conference

ETRA '20

Acceptance Rates

Overall Acceptance Rate 69 of 137 submissions, 50%

Upcoming Conference

ETRA '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)30
  • Downloads (Last 6 weeks)3
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Multi-Modal Gaze Following in Conversational Scenarios2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00122(1175-1184)Online publication date: 3-Jan-2024
  • (2024)Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial AudioIEEE Transactions on Multimedia10.1109/TMM.2023.327102226(764-775)Online publication date: 2024
  • (2023)Multi-modal cognitive computingSCIENTIA SINICA Informationis10.1360/SSI-2022-022653:1(1)Online publication date: 11-Jan-2023
  • (2023)A Trained Humanoid Robot can Perform Human-Like Crossmodal Social Attention and Conflict ResolutionInternational Journal of Social Robotics10.1007/s12369-023-00993-315:8(1325-1340)Online publication date: 2-Apr-2023
  • (2023)Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learningApplied Intelligence10.1007/s10489-023-04714-153:19(22615-22634)Online publication date: 30-Jun-2023
  • (2021)Gazing at Social Interactions Between Foraging and Decision TheoryFrontiers in Neurorobotics10.3389/fnbot.2021.63999915Online publication date: 30-Mar-2021
  • (2020)A Biologically Motivated, Proto-Object-Based Audiovisual Saliency ModelAI10.3390/ai10400301:4(487-509)Online publication date: 3-Nov-2020
  • (2020)On Gaze Deployment to Audio-Visual Cues of Social InteractionsIEEE Access10.1109/ACCESS.2020.30212118(161630-161654)Online publication date: 2020

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media