Skip to main content

Audio-Visual Salient Object Detection

  • Conference paper
  • First Online:
Intelligent Computing Theories and Application (ICIC 2021)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 12837))

Included in the following conference series:

  • 1445 Accesses

Abstract

This paper studies audio-visual salient object detection. The task of salient object detection is to detect and mark the objects that are most concerned by people in the visual scene. Traditionally, visual salient object detection uses only images or video frames to detect salient objects, without modeling human multi-modal perception which includes the interaction between vision and hearing. Therefore, in order to improve the visual salient object detection, we incorporate audio modality into the traditional visual salient object detection task by applying a two-stream audio-visual deep learning network. To this end, we also build an audio-visual salient object detection dataset called AVSOD based on the existing dataset. To verify the effectiveness of audio modality in salient object detection, we compare the experimental performance of the deep learning model with and without audio modality. The experimental results demonstrate that audio modality has a good supplementary effect on the task of visual salient object detection, and also verified the effectiveness of the proposed dataset.

This work was supported by the Fundamental Research Funds for the Central Universities of China under Grant No. 20720190028, and the National Natural Science Foundation of China under Grant 62072383.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. In: Matters of Intelligence, pp. 115–141. Springer, Cham (1987). https://doi.org/10.1007/978-94-009-3833-5_5

  2. Huang, T., Tian, Y., Li, J., Yu, H.: Salient region detection and segmentation for general object recognition and image understanding. Sci. Chin. Inf. Sci. 54(12), 2461–2470 (2011)

    Article  MathSciNet  Google Scholar 

  3. Lian, G., Lai, J., Yuan, Y.: Fast pedestrian detection using a modified WLD detector in salient region. In: Proceedings 2011 International Conference on System Science and Engineering, pp. 564–569. IEEE (2011)

    Google Scholar 

  4. Guo, C., Zhang, L.: A novel multiresolution spatiotemporal saliency detection model and its applications in image and video compression. IEEE Trans. Image Process. 19(1), 185–198 (2009)

    MathSciNet  MATH  Google Scholar 

  5. Hadizadeh, H., Bajić, I.V.: Saliency-aware video compression. IEEE Trans. Image Process. 23(1), 19–33 (2013)

    Article  MathSciNet  Google Scholar 

  6. Ren, Z., Gao, S., Chia, L.T., Tsang, I.W.H.: Region-based saliency detection and its application in object recognition. IEEE Trans. Circuits Syst. Video Technol. 24(5), 769–779 (2013)

    Article  Google Scholar 

  7. Zhang, D., Meng, D., Zhao, L., Han, J.: Bridging saliency detection to weakly supervised object detection based on self-paced curriculum learning. arXiv:1703.01290 (2017)

  8. Kapoor, A., Biswas, K.K., Hanmandlu, M.: An evolutionary learning based fuzzy theoretic approach for salient object detection. Vis. Comput. 33(5), 665–685 (2016). https://doi.org/10.1007/s00371-016-1216-1

    Article  Google Scholar 

  9. Ma, Y.F., Lu, L., Zhang, H.J., Li, M.: A user attention model for video summarization. In: Proceedings of the Tenth ACM International Conference on Multimedia, pp. 533–542 (2002)

    Google Scholar 

  10. Simakov, D., Caspi, Y., Shechtman, E., Irani, M.: Summarizing visual data using bidirectional similarity. In: 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8. IEEE (2008)

    Google Scholar 

  11. Sugano, Y., Matsushita, Y., Sato, Y.: Calibration-free gaze sensing using saliency maps. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 2667–2674. IEEE (2010)

    Google Scholar 

  12. Borji, A., Itti, L.: Defending yarbus: eye movements reveal observers’ task. J. Vis. 14(3), 29 (2014)

    Article  Google Scholar 

  13. Ren, S., Han, C., Yang, X., Han, G., He, S.: Tenet: triple excitation network for video salient object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12350, pp. 212–228. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58558-7_13

    Chapter  Google Scholar 

  14. Fan, D.P., Wang, W., Cheng, M.M., Shen, J.: Shifting more attention to video salient object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8554–8564 (2019)

    Google Scholar 

  15. Wang, W., Shen, J., Guo, F., Cheng, M.M., Borji, A.: Revisiting video saliency: a large-scale benchmark and a new model. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4894–4903 (2018)

    Google Scholar 

  16. Tsiami, A., Koutras, P., Maragos, P.: Stavis: spatio-temporal audiovisual saliency network. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4766–4776 (2020)

    Google Scholar 

  17. Wang, W., Lai, Q., Fu, H., Shen, J., Ling, H., Yang, R.: Salient object detection in the deep learning era: An in-depth survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2021)

    Google Scholar 

  18. Borji, A.: Saliency prediction in the deep learning era: Successes, limitations, and future challenges. arXiv:1810.03716 (2018)

  19. Lee, G., Nho, K., Kang, B., Sohn, K.A., Kim, D.: Predicting Alzheimer’s disease progression using multi-modal deep learning approach. Sci. Rep. 9(1), 1–12 (2019)

    Google Scholar 

  20. Wang, A., Lu, J., Cai, J., Cham, T.J., Wang, G.: Large-margin multi-modal deep learning for RGB-D object recognition. IEEE Trans. Multimedia 17(11), 1887–1898 (2015)

    Article  Google Scholar 

  21. Gené-Mola, J., Vilaplana, V., Rosell-Polo, J.R., Morros, J.R., Ruiz-Hidalgo, J., Gregorio, E.: Multi-modal deep learning for Fuji apple detection using RGB-D cameras and their radiometric capabilities. Comput. Electron. Agric. 162, 689–698 (2019)

    Article  Google Scholar 

  22. Borji, A., Cheng, M.-M., Hou, Q., Jiang, H., Li, J.: Salient object detection: a survey. Comput. Vis. Media 5(2), 117–150 (2019). https://doi.org/10.1007/s41095-019-0149-9

    Article  Google Scholar 

  23. Itti, L., Koch, C., Niebur, E.: A model of saliency-based visual attention for rapid scene analysis. IEEE Trans. Pattern Anal. Mach. Intell. 20(11), 1254–1259 (1998)

    Article  Google Scholar 

  24. Liu, T., et al.: Learning to detect a salient object. IEEE Trans. Pattern Anal. Mach. Intell. 33(2), 353–367 (2010)

    Google Scholar 

  25. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556 (2014)

  26. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)

    Google Scholar 

  27. Fan, D.-P., Zhai, Y., Borji, A., Yang, J., Shao, L.: BBS-Net: RGB-D salient object detection with a bifurcated backbone strategy network. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 275–292. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_17

    Chapter  Google Scholar 

  28. Luo, A., Li, X., Yang, F., Jiao, Z., Cheng, H., Lyu, S.: Cascade graph neural networks for rgb-d salient object detection. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12357, pp. 346–364. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58610-2_21

    Chapter  Google Scholar 

  29. Ling, H.: Cross-modal weighting network for RGB-D salient object detection (2020)

    Google Scholar 

  30. Ji, W., Li, J., Zhang, M., Piao, Y., Lu, H.: Accurate rgb-d salient object detection via collaborative learning. arXiv:2007.11782 (2020)

  31. Wirth, N.: Pascal-s: a subset and its implementation. Berichte des Instituts fürInformatik, vol. 12 (1975)

    Google Scholar 

  32. Movahedi, V., Elder, J.H.: Design and perceptual validation of performance measures for salient object segmentation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition-Workshops, pp. 49–56. IEEE (2010)

    Google Scholar 

  33. Gao, R., Oh, T.H., Grauman, K., Torresani, L.: Listen to look: action recognition by previewing audio. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10457–10467 (2020)

    Google Scholar 

  34. Meishvili, G., Jenni, S., Favaro, P.: Learning to have an ear for face super-resolution. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1364–1374 (2020)

    Google Scholar 

  35. Tavakoli, H.R., Borji, A., Rahtu, E., Kannala, J.: Dave: A deep audio-visual embedding for dynamic saliency prediction. arXiv:1905.10693 (2019)

  36. Jain, S., Yarlagadda, P., Subramanian, R., Gandhi, V.: Avinet: Diving deep into audio-visual saliency prediction. arXiv:2012.06170 (2020)

  37. Tomar, S.: Converting video formats with FFmpeg. Linux J. 2006(146), 10 (2006)

    Google Scholar 

  38. Aytar, Y., Vondrick, C., Torralba, A.: Soundnet: Learning sound representations from unlabeled video. arXiv:1610.09001 (2016)

  39. Cheng, S., Gao, X., Song, L., Xiahou, J.: Audio-visual saliency network with audio attention module, unpublished

    Google Scholar 

  40. Fan, D., Cheng, M., Liu, Y., Li, T., Borji, A.: A new way to evaluate foreground maps. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), p. 245484557 (2017)

    Google Scholar 

  41. Perazzi, F., Krähenbühl, P., Pritch, Y., Hornung, A.: Saliency filters: contrast based filtering for salient region detection. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 733–740. IEEE (2012)

    Google Scholar 

  42. Achanta, R., Hemami, S., Estrada, F., Susstrunk, S.: Frequency-tuned salient region detection. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1597–1604. IEEE (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Liang Song .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Cheng, S., Song, L., Tang, J., Guo, S. (2021). Audio-Visual Salient Object Detection. In: Huang, DS., Jo, KH., Li, J., Gribova, V., Hussain, A. (eds) Intelligent Computing Theories and Application. ICIC 2021. Lecture Notes in Computer Science(), vol 12837. Springer, Cham. https://doi.org/10.1007/978-3-030-84529-2_43

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-84529-2_43

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-84528-5

  • Online ISBN: 978-3-030-84529-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics