short-paper

Deep Audio-Visual Saliency: Baseline Model and Data

Authors:

Hamed Rezazadegan Tavakoli,

Esa RahtuAuthors Info & Claims

ETRA '20 Short Papers: ACM Symposium on Eye Tracking Research and Applications

Article No.: 23, Pages 1 - 5

https://doi.org/10.1145/3379156.3391337

Published: 02 June 2020 Publication History

Abstract

This paper introduces a conceptually simple and effective Deep Audio-Visual Embedding for dynamic saliency prediction dubbed “DAVE” in conjunction with our efforts towards building an Audio-Visual Eye-tracking corpus named “AVE”. Despite existing a strong relation between auditory and visual cues for guiding gaze during perception, video saliency models only consider visual cues and neglect the auditory information that is ubiquitous in dynamic scenes. Here, we propose a baseline deep audio-visual saliency model for multi-modal saliency prediction in the wild. Thus the proposed model is intentionally designed to be simple. A video baseline model is also developed on the same architecture to assess effectiveness of the audio-visual models on a fair basis. We demonstrate that audio-visual saliency model outperforms the video saliency models. The data and code are available at https://hrtavakoli.github.io/AVE/ and https://github.com/hrtavakoli/DAVE

References

[1]

Jami Bartgis, Alesha R. Lilly, and David G. Thomas. 2003. Event-Related Potential and Behavioral Measures of Attention in 5-, 7-, and 9-Year-Olds. The Journal of General Psychology 130, 3 (2003), 311–335. https://doi.org/10.1080/00221300309601162

[2]

P. Bertolino. 2012. Sensarea: An authoring tool to create accurate clickable videos. In International Workshop on Content-Based Multimedia Indexing (CBMI). https://doi.org/10.1109/CBMI.2012.6269804

[3]

Giuseppe Boccignone, Vittorio Cuculo, Alessandro D’Amelio, Giuliano Grossi, and Raffaella Lanzarotti. 2018. Give Ear to My Face: Modelling Multimodal Attention to Social Interactions. In European Conference on Computer Vision Workshops.

[4]

Ali Borji. 2018. Saliency Prediction in the Deep Learning Era: An Empirical Investigation. CoRR abs/1810.03716(2018).

[5]

Zoya Bylinskii, Tilke Judd, Aude Oliva, Antonio Torralba, and Frédo Durand. 2016. What do different evaluation metrics tell us about saliency models?arXiv preprint arXiv:1604.03605(2016).

[6]

Marisa Carrasco. 2011. Visual attention: the past 25 years. Vision research 41, 13 (2011), 1484–1525.

[7]

E. Colin Cherry. 1953. Some Experiments on the Recognition of Speech, with One and with Two Ears. The Journal of the Acoustical Society of America 25, 5 (1953), 975–979. https://doi.org/10.1121/1.1907229

[8]

J. S. Chung and A. Zisserman. 2016. Out of time: automated lip sync in the wild. In Workshop on Multi-view Lip-reading, ACCV.

[9]

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2016. A Deep Multi-Level Network for Saliency Prediction. In International Conference on Pattern Recognition (ICPR).

[10]

Marcella Cornia, Lorenzo Baraldi, Giuseppe Serra, and Rita Cucchiara. 2018. Predicting Human Eye Fixations via an LSTM-based Saliency Attentive Model. IEEE Transactions on Image Processing 27, 10 (2018), 5142–5154.

[11]

A. Coutrot and N. Guyader. 2014a. An audiovisual attention model for natural conversation scenes. In IEEE International Conference on Image Processing (ICIP).

[12]

Antoine Coutrot and Nathalie Guyader. 2014b. How saliency, faces, and sound influence gaze in dynamic social scenes. Journal of Vision 14, 8 (2014), 5. https://doi.org/10.1167/14.8.5

[13]

A. Coutrot and N. Guyader. 2015. An efficient audiovisual saliency model to predict eye positions when looking at conversations. In European Signal Processing Conference (EUSIPCO). https://doi.org/10.1109/EUSIPCO.2015.7362640

[14]

Antoine Coutrot and Nathalie Guyader. 2016. Multimodal Saliency Models for Videos. In From Human Attention to Computational Attention: A Multidisciplinary Approach. Springer New York, New York, NY, 291–304. https://doi.org/10.1007/978-1-4939-3435-5_16

[15]

H. Hadizadeh and I. V. Bajic. 2014. Saliency-Aware Video Compression. IEEE Transactions on Image Processing 23, 1 (2014), 19–33. https://doi.org/10.1109/TIP.2013.2282897

Digital Library

[16]

Kensho Hara, Hirokatsu Kataoka, and Yutaka Satoh. 2018. Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[17]

Xun Huang, Chengyao Shen, Xavier Boix, and Qi Zhao. 2015. SALICON: Reducing the Semantic Gap in Saliency Prediction by Adapting Deep Neural Networks. In IEEE International Conference on Computer Vision (ICCV).

[18]

Laurent Itti and Christof Koch. 2000. A saliency-based search mechanism for overt and covert shifts of visual attention. Vision Research 40, 10 (2000), 1489 – 1506.

[19]

Sen Jia. 2018. EML-NET: An Expandable Multi-Layer NETwork for Saliency Prediction. CoRR abs/1805.01047(2018).

[20]

Lai Jiang, Mai Xu, Tie Liu, Minglang Qiao, and Zulin Wang. 2018. DeepVS: A Deep Learning Based Video Saliency Prediction Approach. In European Conference on Computer Vision (ECCV).

[21]

Tilke Judd, Frédo Durand, and Antonio Torralba. 2012. A Benchmark of Computational Models of Saliency to Predict Human Fixations. In MIT Technical Report.

[22]

M. Kummerer, L. Theis, and M. Bethge. 2015. Deep Gaze I: Boosting Saliency Prediction with Feature Maps Trained on ImageNet. In ICLR Workshop.

[23]

Víctor Leborán Alvarez, Antón García-Díaz, Xosé R. Fdez-Vidal, and Xosé M. Pardo. 2017. Dynamic Whitening Saliency. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 5(2017), 893–907. https://doi.org/10.1109/TPAMI.2016.2567391

Digital Library

[24]

David J. Lewkowicz. 1998. Sensory dominance in infants: II. Ten-month-old infants’ response to auditory-visual compounds. Developmental Psychology 24, 2 (1998), 172–182.

[25]

Eleanor E. Maccoby and Karl W. Konrad. 1966. Age trends in selective listening. Journal of Experimental Child Psychology 3, 2 (1966), 113 – 122. https://doi.org/10.1016/0022-0965(66)90086-5

[26]

S. Marat, M. Guironnet, and D. Pellerin. 2007. Video summarization using a visual attention model. In European Signal Processing Conference.

[27]

N. Mesgarani and EF. Chang. 2012. Selective cortical representation of attended speaker in multi-talker speech perception. Nature 485, 7397 (2012), 233–242.

[28]

Parag K. Mital, Tim J. Smith, Robin L. Hill, and John M. Henderson. 2011. Clustering of Gaze During Dynamic Scene Viewing is Predicted by Motion. Cognitive Computation 3, 1 (2011), 5–24. https://doi.org/10.1007/s12559-010-9074-z

[29]

Hamed Rezazadegan Tavakoli. 2014. Visual saliency and eye movement: modeling and applications. Ph.D. Dissertation. University of Oulu.

[30]

Hamed Rezazadegan Tavakoli, Esa Rahtu, and Janne Heikkilä. 2013. Spherical Center-Surround for Video Saliency Detection Using Sparse Sampling. In Advanced Concepts for Intelligent Vision Systems.

[31]

JE Richards. 2000. Development of multimodal attention in young infants: modification of the startle reflex by attention. Psychophysiology 37, 1 (2000), 65–75.

[32]

C. Sandor, A. Cunningham, A. Dey, and V. Mattila. 2010. An Augmented Reality X-Ray system based on visual saliency. In IEEE International Symposium on Mixed and Augmented Reality. https://doi.org/10.1109/ISMAR.2010.5643547

[33]

Hae Jong Seo and Peyman Milanfar. 2009. Static and space-time visual saliency detection by self-resemblance. Journal of Vision 9, 12 (2009), 15–15. https://doi.org/10.1167/9.12.15

[34]

Anne M. Treisman and Garry Gelade. 1980. A feature-integration theory of attention. Cognitive Psychology 12, 1 (1980), 97 – 136.

[35]

John K. Tsotsos. 2011. A Computational Perspective on Visual Attention. The MIT Press.

[36]

Erik Van der Burg, Christian N. L. Olivers, Adelbert W. Bronkhorst, and Jan Theeuwes. 2008. Audiovisual events capture attention: Evidence from temporal order judgments. Journal of Vision 8, 5 (2008), 2–2. https://doi.org/10.1167/8.5.2

[37]

Julius Wang, Hamed R. Tavakoli, and Jorma Laaksonen. 2017. Fixation Prediction in Videos Using Unsupervised Hierarchical Features. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops.

[38]

Wenguan Wang, Jianbing Shen, Fang Guo, Ming-Ming Cheng, and Ali Borji. 2018. Revisiting Video Saliency: A Large-scale Benchmark and a New Model. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR).

[39]

Lingyun Zhang, Matthew H. Tong, Tim K. Marks, Honghao Shan, and Garrison W. Cottrell. 2008. SUN: A Bayesian framework for saliency using natural statistics. Journal of Vision 8, 7 (2008), 32–32. https://doi.org/10.1167/8.7.32

[40]

Ziheng Zhang, Yanyu Xu, Jingyi Yu, and Shenghua Gao. 2018. Saliency Detection in 360° Videos. In European Conference on Computer Vision (ECCV).

[41]

Qi Zhao and Christof Koch. 2011. Learning a saliency map using fixated locations in natural scenes. Journal of Vision 11, 3 (2011), 9–9.

Cited By

Hou YZhang ZHoranyi NMoon JCheng YChang H(2024)Multi-Modal Gaze Following in Conversational Scenarios2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00122(1175-1184)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00122
Zhu DZhang KZhang NZhou QMin XZhai GYang X(2024)Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial AudioIEEE Transactions on Multimedia10.1109/TMM.2023.327102226(764-775)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3271022
李学(2023)Multi-modal cognitive computingSCIENTIA SINICA Informationis10.1360/SSI-2022-022653:1(1)Online publication date: 11-Jan-2023
https://doi.org/10.1360/SSI-2022-0226
Show More Cited By

Deep Audio-Visual Saliency: Baseline Model and Data
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
  2. Machine learning
    1. Machine learning approaches

Recommendations

A Novel Lightweight Audio-visual Saliency Model for Videos
Audio information has not been considered an important factor in visual attention models regardless of many psychological studies that have shown the importance of audio information in the human visual perception system. Since existing visual attention ...
Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learning
Abstract
Since the audio information is fully explored and leveraged in omnidirectional videos (ODVs), the performance of existing audio-visual saliency models has been improving dramatically and significantly. However, these models are still in their ...
A depth perception and visual comfort guided computational model for stereoscopic 3D visual saliency

With the emerging development of three-dimensional (3D) related technologies, 3D visual saliency modeling is becoming particularly important and challenging. This paper presents a new depth perception and visual comfort guided saliency computational ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

ETRA '20 Short Papers: ACM Symposium on Eye Tracking Research and Applications

June 2020

305 pages

ISBN:9781450371346

DOI:10.1145/3379156

Editors:
Andreas Bulling
University of Stuttgart, Germany
,
Anke Huckauf
Ulm University, Germany
,
Eakta Jain
University of Florida, USA
,
Ralph Radach
University of Wuppertal, Germany
,
Daniel Weiskopf
University of Stuttgart, Germany

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 June 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper
Research
Refereed limited

Conference

ETRA '20

Sponsor:

ETRA '20: 2020 Symposium on Eye Tracking Research and Applications

June 2 - 5, 2020

Stuttgart, Germany

Acceptance Rates

Overall Acceptance Rate 69 of 137 submissions, 50%

Upcoming Conference

ETRA '25

Sponsor:
sigchi
sigchi

2025 Symposium on Eye Tracking Research and Applications

May 26 - 29, 2025

Tokyo , Japan

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
286
Total Downloads

Downloads (Last 12 months)30
Downloads (Last 6 weeks)3

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Hou YZhang ZHoranyi NMoon JCheng YChang H(2024)Multi-Modal Gaze Following in Conversational Scenarios2024 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)10.1109/WACV57701.2024.00122(1175-1184)Online publication date: 3-Jan-2024
https://doi.org/10.1109/WACV57701.2024.00122
Zhu DZhang KZhang NZhou QMin XZhai GYang X(2024)Unified Audio-Visual Saliency Model for Omnidirectional Videos With Spatial AudioIEEE Transactions on Multimedia10.1109/TMM.2023.327102226(764-775)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3271022
李学(2023)Multi-modal cognitive computingSCIENTIA SINICA Informationis10.1360/SSI-2022-022653:1(1)Online publication date: 11-Jan-2023
https://doi.org/10.1360/SSI-2022-0226
Fu DAbawi FCarneiro HKerzel MChen ZStrahl ELiu XWermter S(2023)A Trained Humanoid Robot can Perform Human-Like Crossmodal Social Attention and Conflict ResolutionInternational Journal of Social Robotics10.1007/s12369-023-00993-315:8(1325-1340)Online publication date: 2-Apr-2023
https://doi.org/10.1007/s12369-023-00993-3
Zhu DShao XZhang KMin XZhai GYang X(2023)Audio-visual aligned saliency model for omnidirectional video with implicit neural representation learningApplied Intelligence10.1007/s10489-023-04714-153:19(22615-22634)Online publication date: 30-Jun-2023
https://doi.org/10.1007/s10489-023-04714-1
D'Amelio ABoccignone G(2021)Gazing at Social Interactions Between Foraging and Decision TheoryFrontiers in Neurorobotics10.3389/fnbot.2021.63999915Online publication date: 30-Mar-2021
https://doi.org/10.3389/fnbot.2021.639999
Ramenahalli S(2020)A Biologically Motivated, Proto-Object-Based Audiovisual Saliency ModelAI10.3390/ai10400301:4(487-509)Online publication date: 3-Nov-2020
https://doi.org/10.3390/ai1040030
Boccignone GCuculo VD'Amelio AGrossi GLanzarotti R(2020)On Gaze Deployment to Audio-Visual Cues of Social InteractionsIEEE Access10.1109/ACCESS.2020.30212118(161630-161654)Online publication date: 2020
https://doi.org/10.1109/ACCESS.2020.3021211

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Figures

Tables

Media

View Table of Conten