Skip to main content
Log in

Dynamic interactive learning network for audio-visual event localization

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Availability of data and materials

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

  1. Adnan SM, Irtaza A et al (2018) Fall detection through acoustic local ternary patterns. Appl Acoust 140:296–300

    Article  Google Scholar 

  2. Yang F, Wu Q et al (2021) Internet-of-things-enabled data fusion method for sleep healthcare applications. IEEE Internet Things J 8(21):15892–15905

    Article  Google Scholar 

  3. Cruz-Sandoval D, Beltran-Marquez J et al (2019) Semi-automated data labeling for activity recognition in pervasive healthcare. Sensors 19(14):3035

    Article  Google Scholar 

  4. Zeng R, Huang W et al (2019) Graph convolutional networks for temporal action localization. In: ICCV, pp 7094–7103

  5. Zhang C, Xu Y et al (2019) Adversarial seeded sequence growing for weakly-supervised temporal action localization. In: ACM MM, pp 738–746

  6. Zhang C, Li G et al (2023) Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In: CVPR, pp 16271–16280

  7. Tian Y, Shi J et al (2018) Audio-visual event localization in unconstrained videos. In: ECCV, pp 247–263

  8. Lin Y-B, Li Y-J, Wang Y-CF (2019) Dual-modality seq2seq network for audio-visual event localization. In: ICASSP. IEEE, pp 2002–2006

  9. Ramaswamy J (2020) What makes the sound? A dual-modality interacting network for audio-visual event localization. In: ICASSP. IEEE, pp 4372–4376

  10. Yu J, Cheng Y, Feng R (2021) Mpn: multimodal parallel network for audio-visual event localization. In: ICME. IEEE, pp 1–6

  11. Zhou J, Zheng L et al (2021) Positive sample propagation along the audio-visual event line. In: CVPR, pp 8436–8444

  12. Hazarika D, Zimmermann R, Poria S (2020) Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In: ACM MM, pp 1122–1131

  13. Zhao H, Gan C et al (2019) The sound of motions. In: ICCV, pp 1735–1744

  14. Gan C, Huang D et al (2020) Music gesture for visual sound separation. In: CVPR, pp 10478–10487

  15. Majumder S, Al-Halah Z, Grauman K (2021) Move2hear: active audio-visual source separation. In: ICCV, pp 275–285

  16. Zhou Y, Wang Z et al (2018) Visual to sound: generating natural sound for videos in the wild. In: CVPR, pp 3550–3558

  17. Gan C, Huang D et al (2020) Foley music: learning to generate music from videos. In: ECCV. Springer, pp 758–775

  18. Hao W, Guan H, Zhang Z (2022) Vag: a uniform model for cross-modal visual-audio mutual generation. IEEE Trans Neural Netw Learn Syst

  19. Nagrani A, Albanie S, Zisserman A (2018) Seeing voices and hearing faces: cross-modal biometric matching. In: CVPR, pp 8427–8436

  20. Zheng A, Hu M et al (2021) Adversarial-metric learning for audio-visual cross-modal matching. IEEE Trans Multimed 24:338–351

    Article  Google Scholar 

  21. Afouras T, Asano YM et al (2022) Self-supervised object detection from audio-visual correspondence. In: CVPR, pp 10575–10586

  22. Feng F, Ming Y et al (2023) See, move and hear: a local-to-global multi-modal interaction network for video action recognition. Appl Intell 1–20

  23. Yu J, Cheng Y et al (2022) Mm-pyramid: multimodal pyramid attentional network for audio-visual event localization and video parsing. In: ACM MM, pp 6241–6249

  24. Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. Adv Neural Inf Process Syst 29

  25. Alwassel H, Mahajan D, Korbar B, Torresani L, Ghanem B, Tran D (2020) Self-supervised learning by cross-modal audio-video clustering. Adv Neural Inf Process Syst 33:9758–9770

    Google Scholar 

  26. Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: ICCV, pp 609–617

  27. Arandjelovic R, Zisserman A (2018) Objects that sound. In: ECCV, pp 435–451

  28. Aggarwal AK (2022) Learning texture features from glcm for classification of brain tumor mri images using random forest classifier. Trans Signal Process 18:60–63

    Article  Google Scholar 

  29. Wu Y, Zhu L et al (2019) Dual attention matching for audio-visual event localization. In: ICCV, pp 6292–6300

  30. Xuan H, Zhang Z et al (2020) Cross-modal attention network for temporal inconsistent audio-visual event localization. AAAI 34:279–286

    Article  Google Scholar 

  31. Xu H, Zeng R et al (2020) Cross-modal relation-aware networks for audio-visual event localization. In: ACM MM, pp 3893–3901

  32. Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT (2020) Cross-modal attention with semantic consistence for image-text matching. IEEE Trans Neural Netw Learn Syst 31(12):5412–5425

    Article  Google Scholar 

  33. Guo Y (2022) A mutual attention based multimodal fusion for fake news detection on social network. Appl Intell 1–10

  34. Hershey S, Chaudhuri S et al (2017) Cnn architectures for large-scale audio classification. In: ICASSP. IEEE, pp 131–135

  35. Gemmeke JF, Ellis DP et al (2017) Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP. IEEE, pp 776–780

  36. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CVPR

  37. Russakovsky O, Deng J et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252

    Article  MathSciNet  Google Scholar 

  38. Liu F, Ren X et al (2018) Simnet: stepwise image-topic merging network for generating detailed and comprehensive image captions. EMNLP

  39. Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: CVPR, pp 7132–7141

  40. Xu K, Wang Z et al (2019) A2-net: molecular structure estimation from cryo-em density volumes. AAAI 33:1230–1237

    Article  Google Scholar 

  41. Ramaswamy J, Das S (2020) See the sound, hear the pixels. In: WACV, pp 2970–2979

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 62102159, No. 62272178 and the Humanities and Social Science Fund of Ministry of Education of China under Grant No. 21YJC870002 and the Fundamental Research Funds for the Central Universities under Grant No. CCNU22QN017 and Knowledge Innovation Program of Wuhan-Shuguang Project under Grant No. 2022010801020287 and the Natural Science Foundation of Hubei Province under grant No. 2023AFB1018. The authors gratelfully acknowledge financial support from China Scholarship Council (CSC).

Author information

Authors and Affiliations

Authors

Contributions

Jincai Chen: Conceptualization, Funding acquisition Han Liang: Methodology, Software, Writing original draft and editing Ruili Wang: Resources, Writing review Jiangfeng Zeng: Project administration, Funding acquisition, Writing review Ping Lu: Writing review

Corresponding author

Correspondence to Jiangfeng Zeng.

Ethics declarations

Competing interests

We declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent for data used

The datset used in this study is openly available at https://sites.google.com/view/audiovisualresearch and does not relate to ethics.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Chen, J., Liang, H., Wang, R. et al. Dynamic interactive learning network for audio-visual event localization. Appl Intell 53, 30431–30442 (2023). https://doi.org/10.1007/s10489-023-05146-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-023-05146-7

Keywords

Navigation