Abstract
Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN.
Similar content being viewed by others
Availability of data and materials
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Adnan SM, Irtaza A et al (2018) Fall detection through acoustic local ternary patterns. Appl Acoust 140:296–300
Yang F, Wu Q et al (2021) Internet-of-things-enabled data fusion method for sleep healthcare applications. IEEE Internet Things J 8(21):15892–15905
Cruz-Sandoval D, Beltran-Marquez J et al (2019) Semi-automated data labeling for activity recognition in pervasive healthcare. Sensors 19(14):3035
Zeng R, Huang W et al (2019) Graph convolutional networks for temporal action localization. In: ICCV, pp 7094–7103
Zhang C, Xu Y et al (2019) Adversarial seeded sequence growing for weakly-supervised temporal action localization. In: ACM MM, pp 738–746
Zhang C, Li G et al (2023) Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In: CVPR, pp 16271–16280
Tian Y, Shi J et al (2018) Audio-visual event localization in unconstrained videos. In: ECCV, pp 247–263
Lin Y-B, Li Y-J, Wang Y-CF (2019) Dual-modality seq2seq network for audio-visual event localization. In: ICASSP. IEEE, pp 2002–2006
Ramaswamy J (2020) What makes the sound? A dual-modality interacting network for audio-visual event localization. In: ICASSP. IEEE, pp 4372–4376
Yu J, Cheng Y, Feng R (2021) Mpn: multimodal parallel network for audio-visual event localization. In: ICME. IEEE, pp 1–6
Zhou J, Zheng L et al (2021) Positive sample propagation along the audio-visual event line. In: CVPR, pp 8436–8444
Hazarika D, Zimmermann R, Poria S (2020) Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In: ACM MM, pp 1122–1131
Zhao H, Gan C et al (2019) The sound of motions. In: ICCV, pp 1735–1744
Gan C, Huang D et al (2020) Music gesture for visual sound separation. In: CVPR, pp 10478–10487
Majumder S, Al-Halah Z, Grauman K (2021) Move2hear: active audio-visual source separation. In: ICCV, pp 275–285
Zhou Y, Wang Z et al (2018) Visual to sound: generating natural sound for videos in the wild. In: CVPR, pp 3550–3558
Gan C, Huang D et al (2020) Foley music: learning to generate music from videos. In: ECCV. Springer, pp 758–775
Hao W, Guan H, Zhang Z (2022) Vag: a uniform model for cross-modal visual-audio mutual generation. IEEE Trans Neural Netw Learn Syst
Nagrani A, Albanie S, Zisserman A (2018) Seeing voices and hearing faces: cross-modal biometric matching. In: CVPR, pp 8427–8436
Zheng A, Hu M et al (2021) Adversarial-metric learning for audio-visual cross-modal matching. IEEE Trans Multimed 24:338–351
Afouras T, Asano YM et al (2022) Self-supervised object detection from audio-visual correspondence. In: CVPR, pp 10575–10586
Feng F, Ming Y et al (2023) See, move and hear: a local-to-global multi-modal interaction network for video action recognition. Appl Intell 1–20
Yu J, Cheng Y et al (2022) Mm-pyramid: multimodal pyramid attentional network for audio-visual event localization and video parsing. In: ACM MM, pp 6241–6249
Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. Adv Neural Inf Process Syst 29
Alwassel H, Mahajan D, Korbar B, Torresani L, Ghanem B, Tran D (2020) Self-supervised learning by cross-modal audio-video clustering. Adv Neural Inf Process Syst 33:9758–9770
Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: ICCV, pp 609–617
Arandjelovic R, Zisserman A (2018) Objects that sound. In: ECCV, pp 435–451
Aggarwal AK (2022) Learning texture features from glcm for classification of brain tumor mri images using random forest classifier. Trans Signal Process 18:60–63
Wu Y, Zhu L et al (2019) Dual attention matching for audio-visual event localization. In: ICCV, pp 6292–6300
Xuan H, Zhang Z et al (2020) Cross-modal attention network for temporal inconsistent audio-visual event localization. AAAI 34:279–286
Xu H, Zeng R et al (2020) Cross-modal relation-aware networks for audio-visual event localization. In: ACM MM, pp 3893–3901
Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT (2020) Cross-modal attention with semantic consistence for image-text matching. IEEE Trans Neural Netw Learn Syst 31(12):5412–5425
Guo Y (2022) A mutual attention based multimodal fusion for fake news detection on social network. Appl Intell 1–10
Hershey S, Chaudhuri S et al (2017) Cnn architectures for large-scale audio classification. In: ICASSP. IEEE, pp 131–135
Gemmeke JF, Ellis DP et al (2017) Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP. IEEE, pp 776–780
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CVPR
Russakovsky O, Deng J et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Liu F, Ren X et al (2018) Simnet: stepwise image-topic merging network for generating detailed and comprehensive image captions. EMNLP
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: CVPR, pp 7132–7141
Xu K, Wang Z et al (2019) A2-net: molecular structure estimation from cryo-em density volumes. AAAI 33:1230–1237
Ramaswamy J, Das S (2020) See the sound, hear the pixels. In: WACV, pp 2970–2979
Acknowledgements
This work was supported by the National Natural Science Foundation of China under Grant No. 62102159, No. 62272178 and the Humanities and Social Science Fund of Ministry of Education of China under Grant No. 21YJC870002 and the Fundamental Research Funds for the Central Universities under Grant No. CCNU22QN017 and Knowledge Innovation Program of Wuhan-Shuguang Project under Grant No. 2022010801020287 and the Natural Science Foundation of Hubei Province under grant No. 2023AFB1018. The authors gratelfully acknowledge financial support from China Scholarship Council (CSC).
Author information
Authors and Affiliations
Contributions
Jincai Chen: Conceptualization, Funding acquisition Han Liang: Methodology, Software, Writing original draft and editing Ruili Wang: Resources, Writing review Jiangfeng Zeng: Project administration, Funding acquisition, Writing review Ping Lu: Writing review
Corresponding author
Ethics declarations
Competing interests
We declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical and informed consent for data used
The datset used in this study is openly available at https://sites.google.com/view/audiovisualresearch and does not relate to ethics.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Chen, J., Liang, H., Wang, R. et al. Dynamic interactive learning network for audio-visual event localization. Appl Intell 53, 30431–30442 (2023). https://doi.org/10.1007/s10489-023-05146-7
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-023-05146-7