Audio-Visual Salient Object Detection

Cheng, Shuaiyang; Song, Liang; Tang, Jingjing; Guo, Shihui

doi:10.1007/978-3-030-84529-2_43

Shuaiyang Cheng¹³,
Liang Song¹³,
Jingjing Tang¹³ &
…
Shihui Guo¹³

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12837))

Included in the following conference series:

International Conference on Intelligent Computing

1445 Accesses

Abstract

This paper studies audio-visual salient object detection. The task of salient object detection is to detect and mark the objects that are most concerned by people in the visual scene. Traditionally, visual salient object detection uses only images or video frames to detect salient objects, without modeling human multi-modal perception which includes the interaction between vision and hearing. Therefore, in order to improve the visual salient object detection, we incorporate audio modality into the traditional visual salient object detection task by applying a two-stream audio-visual deep learning network. To this end, we also build an audio-visual salient object detection dataset called AVSOD based on the existing dataset. To verify the effectiveness of audio modality in salient object detection, we compare the experimental performance of the deep learning model with and without audio modality. The experimental results demonstrate that audio modality has a good supplementary effect on the task of visual salient object detection, and also verified the effectiveness of the proposed dataset.

This work was supported by the Fundamental Research Funds for the Central Universities of China under Grant No. 20720190028, and the National Natural Science Foundation of China under Grant 62072383.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. In: Matters of Intelligence, pp. 115–141. Springer, Cham (1987). https://doi.org/10.1007/978-94-009-3833-5_5
Huang, T., Tian, Y., Li, J., Yu, H.: Salient region detection and segmentation for general object recognition and image understanding. Sci. Chin. Inf. Sci. 54(12), 2461–2470 (2011)
Article MathSciNet Google Scholar
Lian, G., Lai, J., Yuan, Y.: Fast pedestrian detection using a modified WLD detector in salient region. In: Proceedings 2011 International Conference on System Science and Engineering, pp. 564–569. IEEE (2011)
Google Scholar
Guo, C., Zhang, L.: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans. Image Process. 19(1), 185–198 (2009)
MathSciNet MATH Google Scholar
Hadizadeh, H., Bajić, I.V.: Saliency-aware video compression. IEEE Trans. Image Process. 23(1), 19–33 (2013)
Article MathSciNet Google Scholar
Ren, Z., Gao, S., Chia, L.T., Tsang, I.W.H.: Region-based saliency detection and its application in object recognition. IEEE Trans. Circuits Syst. Video Technol. 24(5), 769–779 (2013)
Article Google Scholar
Zhang, D., Meng, D., Zhao, L., Han, J.: Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning. arXiv:1703.01290 (2017)
Kapoor, A., Biswas, K.K., Hanmandlu, M.: An evolutionary learning based fuzzy theoretic approach for salient object detection. Vis. Comput. 33(5), 665–685 (2016). https://doi.org/10.1007/s00371-016-1216-1
Article Google Scholar
Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: Proceedings of the Tenth ACM International Conference on Multimedia, pp. 533–542 (2002)
Google Scholar
Simakov, D., Caspi, Y., Shechtman, E., Irani, M.: Summarizing visual data using bidirectional similarity. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)
Google Scholar
Sugano, Y., Matsushita, Y., Sato, Y.: Calibration-free gaze sensing using saliency maps. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2667–2674. IEEE (2010)
Google Scholar
Borji, A., Itti, L.: Defending yarbus: eye movements reveal observers’ task. J. Vis. 14(3), 29 (2014)
Article Google Scholar
Ren, S., Han, C., Yang, X., Han, G., He, S.: Tenet: triple excitation network for video salient object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 212–228. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_13
Chapter Google Scholar
Fan, D.P., Wang, W., Cheng, M.M., Shen, J.: Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8554–8564 (2019)
Google Scholar
Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting video saliency: a large-scale benchmark and a new model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4894–4903 (2018)
Google Scholar
Tsiami, A., Koutras, P., Maragos, P.: Stavis: spatio-temporal audiovisual saliency network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4766–4776 (2020)
Google Scholar
Wang, W., Lai, Q., Fu, H., Shen, J., Ling, H., Yang, R.: Salient object detection in the deep learning era: An in-depth survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)
Google Scholar
Borji, A.: Saliency prediction in the deep learning era: Successes, limitations, and future challenges. arXiv:1810.03716 (2018)
Lee, G., Nho, K., Kang, B., Sohn, K.A., Kim, D.: Predicting Alzheimer’s disease progression using multi-modal deep learning approach. Sci. Rep. 9(1), 1–12 (2019)
Google Scholar
Wang, A., Lu, J., Cai, J., Cham, T.J., Wang, G.: Large-margin multi-modal deep learning for RGB-D object recognition. IEEE Trans. Multimedia 17(11), 1887–1898 (2015)
Article Google Scholar
Gené-Mola, J., Vilaplana, V., Rosell-Polo, J.R., Morros, J.R., Ruiz-Hidalgo, J., Gregorio, E.: Multi-modal deep learning for Fuji apple detection using RGB-D cameras and their radiometric capabilities. Comput. Electron. Agric. 162, 689–698 (2019)
Article Google Scholar
Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., Li, J.: Salient object detection: a survey. Comput. Vis. Media 5(2), 117–150 (2019). https://doi.org/10.1007/s41095-019-0149-9
Article Google Scholar
Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)
Article Google Scholar
Liu, T., et al.: Learning to detect a salient object. IEEE Trans. Pattern Anal. Mach. Intell. 33(2), 353–367 (2010)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Google Scholar
Fan, D.-P., Zhai, Y., Borji, A., Yang, J., Shao, L.: BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 275–292. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_17
Chapter Google Scholar
Luo, A., Li, X., Yang, F., Jiao, Z., Cheng, H., Lyu, S.: Cascade graph neural networks for rgb-d salient object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 346–364. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_21
Chapter Google Scholar
Ling, H.: Cross-modal weighting network for RGB-D salient object detection (2020)
Google Scholar
Ji, W., Li, J., Zhang, M., Piao, Y., Lu, H.: Accurate rgb-d salient object detection via collaborative learning. arXiv:2007.11782 (2020)
Wirth, N.: Pascal-s: a subset and its implementation. Berichte des Instituts fürInformatik, vol. 12 (1975)
Google Scholar
Movahedi, V., Elder, J.H.: Design and perceptual validation of performance measures for salient object segmentation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 49–56. IEEE (2010)
Google Scholar
Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10457–10467 (2020)
Google Scholar
Meishvili, G., Jenni, S., Favaro, P.: Learning to have an ear for face super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1364–1374 (2020)
Google Scholar
Tavakoli, H.R., Borji, A., Rahtu, E., Kannala, J.: Dave: A deep audio-visual embedding for dynamic saliency prediction. arXiv:1905.10693 (2019)
Jain, S., Yarlagadda, P., Subramanian, R., Gandhi, V.: Avinet: Diving deep into audio-visual saliency prediction. arXiv:2012.06170 (2020)
Tomar, S.: Converting video formats with FFmpeg. Linux J. 2006(146), 10 (2006)
Google Scholar
Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representations from unlabeled video. arXiv:1610.09001 (2016)
Cheng, S., Gao, X., Song, L., Xiahou, J.: Audio-visual saliency network with audio attention module, unpublished
Google Scholar
Fan, D., Cheng, M., Liu, Y., Li, T., Borji, A.: A new way to evaluate foreground maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p. 245484557 (2017)
Google Scholar
Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 733–740. IEEE (2012)
Google Scholar
Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1597–1604. IEEE (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Informatics, Xiamen University, Xiamen, 361005, China
Shuaiyang Cheng, Liang Song, Jingjing Tang & Shihui Guo

Authors

Shuaiyang Cheng
View author publications
You can also search for this author in PubMed Google Scholar
Liang Song
View author publications
You can also search for this author in PubMed Google Scholar
Jingjing Tang
View author publications
You can also search for this author in PubMed Google Scholar
Shihui Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Liang Song .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
De-Shuang Huang
University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo
Shenzhen University, Shenzhen, China
Jianqiang Li
Far Eastern Branch of the Russian Academy of Sciences, Vladivostok, Russia
Valeriya Gribova
Department of Computer Science, Liverpool John Moores University, Liverpool, UK
Abir Hussain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Cheng, S., Song, L., Tang, J., Guo, S. (2021). Audio-Visual Salient Object Detection. In: Huang, DS., Jo, KH., Li, J., Gribova, V., Hussain, A. (eds) Intelligent Computing Theories and Application. ICIC 2021. Lecture Notes in Computer Science(), vol 12837. Springer, Cham. https://doi.org/10.1007/978-3-030-84529-2_43

Download citation

DOI: https://doi.org/10.1007/978-3-030-84529-2_43
Published: 09 August 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-84528-5
Online ISBN: 978-3-030-84529-2
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics