Dynamic interactive learning network for audio-visual event localization

Chen, Jincai; Liang, Han; Wang, Ruili; Zeng, Jiangfeng; Lu, Ping

doi:10.1007/s10489-023-05146-7

Dynamic interactive learning network for audio-visual event localization

Published: 18 November 2023

Volume 53, pages 30431–30442, (2023)
Cite this article

Applied Intelligence Aims and scope Submit manuscript

Jincai Chen^1,2,
Han Liang^1,3,
Ruili Wang³,
Jiangfeng Zeng^4,5 &
…
Ping Lu²

202 Accesses
Explore all metrics

Abstract

Audio-visual event (AVE) localization aims to detect whether an event exists in each video segment and predict its category. Only when the event is audible and visible can it be recognized as an AVE. However, sometimes the information from auditory and visual modalities is asymmetrical in a video sequence, leading to incorrect predictions. To address this challenge, we introduce a dynamic interactive learning network designed to dynamically explore the intra- and inter-modal relationships depending on the other modality for better AVE localization. Specifically, our approach involves a dynamic fusion attention of intra- and inter-modalities module, enabling the auditory and visual modalities to focus more on regions deemed informative by the other modality while focusing less on regions that the other modality considers noise. In addition, we introduce an audio-visual difference loss to reduce the distance between auditory and visual representations. Our proposed method has been demonstrated to have superior performance by extensive experimental results on the AVE dataset. The source code will be available at https://github.com/hanliang/DILN.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization

GLTCM: Global-Local Temporal and Cross-Modal Network for Audio-Visual Event Localization

Audio-Visual Event Localization in Unconstrained Videos

Availability of data and materials

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Adnan SM, Irtaza A et al (2018) Fall detection through acoustic local ternary patterns. Appl Acoust 140:296–300
Article Google Scholar
Yang F, Wu Q et al (2021) Internet-of-things-enabled data fusion method for sleep healthcare applications. IEEE Internet Things J 8(21):15892–15905
Article Google Scholar
Cruz-Sandoval D, Beltran-Marquez J et al (2019) Semi-automated data labeling for activity recognition in pervasive healthcare. Sensors 19(14):3035
Article Google Scholar
Zeng R, Huang W et al (2019) Graph convolutional networks for temporal action localization. In: ICCV, pp 7094–7103
Zhang C, Xu Y et al (2019) Adversarial seeded sequence growing for weakly-supervised temporal action localization. In: ACM MM, pp 738–746
Zhang C, Li G et al (2023) Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection. In: CVPR, pp 16271–16280
Tian Y, Shi J et al (2018) Audio-visual event localization in unconstrained videos. In: ECCV, pp 247–263
Lin Y-B, Li Y-J, Wang Y-CF (2019) Dual-modality seq2seq network for audio-visual event localization. In: ICASSP. IEEE, pp 2002–2006
Ramaswamy J (2020) What makes the sound? A dual-modality interacting network for audio-visual event localization. In: ICASSP. IEEE, pp 4372–4376
Yu J, Cheng Y, Feng R (2021) Mpn: multimodal parallel network for audio-visual event localization. In: ICME. IEEE, pp 1–6
Zhou J, Zheng L et al (2021) Positive sample propagation along the audio-visual event line. In: CVPR, pp 8436–8444
Hazarika D, Zimmermann R, Poria S (2020) Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In: ACM MM, pp 1122–1131
Zhao H, Gan C et al (2019) The sound of motions. In: ICCV, pp 1735–1744
Gan C, Huang D et al (2020) Music gesture for visual sound separation. In: CVPR, pp 10478–10487
Majumder S, Al-Halah Z, Grauman K (2021) Move2hear: active audio-visual source separation. In: ICCV, pp 275–285
Zhou Y, Wang Z et al (2018) Visual to sound: generating natural sound for videos in the wild. In: CVPR, pp 3550–3558
Gan C, Huang D et al (2020) Foley music: learning to generate music from videos. In: ECCV. Springer, pp 758–775
Hao W, Guan H, Zhang Z (2022) Vag: a uniform model for cross-modal visual-audio mutual generation. IEEE Trans Neural Netw Learn Syst
Nagrani A, Albanie S, Zisserman A (2018) Seeing voices and hearing faces: cross-modal biometric matching. In: CVPR, pp 8427–8436
Zheng A, Hu M et al (2021) Adversarial-metric learning for audio-visual cross-modal matching. IEEE Trans Multimed 24:338–351
Article Google Scholar
Afouras T, Asano YM et al (2022) Self-supervised object detection from audio-visual correspondence. In: CVPR, pp 10575–10586
Feng F, Ming Y et al (2023) See, move and hear: a local-to-global multi-modal interaction network for video action recognition. Appl Intell 1–20
Yu J, Cheng Y et al (2022) Mm-pyramid: multimodal pyramid attentional network for audio-visual event localization and video parsing. In: ACM MM, pp 6241–6249
Aytar Y, Vondrick C, Torralba A (2016) Soundnet: learning sound representations from unlabeled video. Adv Neural Inf Process Syst 29
Alwassel H, Mahajan D, Korbar B, Torresani L, Ghanem B, Tran D (2020) Self-supervised learning by cross-modal audio-video clustering. Adv Neural Inf Process Syst 33:9758–9770
Google Scholar
Arandjelovic R, Zisserman A (2017) Look, listen and learn. In: ICCV, pp 609–617
Arandjelovic R, Zisserman A (2018) Objects that sound. In: ECCV, pp 435–451
Aggarwal AK (2022) Learning texture features from glcm for classification of brain tumor mri images using random forest classifier. Trans Signal Process 18:60–63
Article Google Scholar
Wu Y, Zhu L et al (2019) Dual attention matching for audio-visual event localization. In: ICCV, pp 6292–6300
Xuan H, Zhang Z et al (2020) Cross-modal attention network for temporal inconsistent audio-visual event localization. AAAI 34:279–286
Article Google Scholar
Xu H, Zeng R et al (2020) Cross-modal relation-aware networks for audio-visual event localization. In: ACM MM, pp 3893–3901
Xu X, Wang T, Yang Y, Zuo L, Shen F, Shen HT (2020) Cross-modal attention with semantic consistence for image-text matching. IEEE Trans Neural Netw Learn Syst 31(12):5412–5425
Article Google Scholar
Guo Y (2022) A mutual attention based multimodal fusion for fake news detection on social network. Appl Intell 1–10
Hershey S, Chaudhuri S et al (2017) Cnn architectures for large-scale audio classification. In: ICASSP. IEEE, pp 131–135
Gemmeke JF, Ellis DP et al (2017) Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP. IEEE, pp 776–780
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. CVPR
Russakovsky O, Deng J et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115:211–252
Article MathSciNet Google Scholar
Liu F, Ren X et al (2018) Simnet: stepwise image-topic merging network for generating detailed and comprehensive image captions. EMNLP
Hu J, Shen L, Sun G (2018) Squeeze-and-excitation networks. In: CVPR, pp 7132–7141
Xu K, Wang Z et al (2019) A2-net: molecular structure estimation from cryo-em density volumes. AAAI 33:1230–1237
Article Google Scholar
Ramaswamy J, Das S (2020) See the sound, hear the pixels. In: WACV, pp 2970–2979

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 62102159, No. 62272178 and the Humanities and Social Science Fund of Ministry of Education of China under Grant No. 21YJC870002 and the Fundamental Research Funds for the Central Universities under Grant No. CCNU22QN017 and Knowledge Innovation Program of Wuhan-Shuguang Project under Grant No. 2022010801020287 and the Natural Science Foundation of Hubei Province under grant No. 2023AFB1018. The authors gratelfully acknowledge financial support from China Scholarship Council (CSC).

Author information

Authors and Affiliations

Wuhan National Laboratory for Optoelectronics, Huazhong University of Science and Technology, Wuhan, China
Jincai Chen & Han Liang
Institute of Natural and Mathematical Sciences, Huazhong University of Science and Technology, Wuhan, China
Jincai Chen & Ping Lu
Institute of Natural and Mathematical Sciences, Massey University, Auckland, New Zealand
Han Liang & Ruili Wang
School of Information Management, Central China Normal University, Wuhan, China
Jiangfeng Zeng
Center for Data Governance and Intelligent Decision-Making of Hubei Province, Wuhan, China
Jiangfeng Zeng

Authors

Jincai Chen
View author publications
You can also search for this author in PubMed Google Scholar
Han Liang
View author publications
You can also search for this author in PubMed Google Scholar
Ruili Wang
View author publications
You can also search for this author in PubMed Google Scholar
Jiangfeng Zeng
View author publications
You can also search for this author in PubMed Google Scholar
Ping Lu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

Jincai Chen: Conceptualization, Funding acquisition Han Liang: Methodology, Software, Writing original draft and editing Ruili Wang: Resources, Writing review Jiangfeng Zeng: Project administration, Funding acquisition, Writing review Ping Lu: Writing review

Corresponding author

Correspondence to Jiangfeng Zeng.

Ethics declarations

Competing interests

We declare that we have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Ethical and informed consent for data used

The datset used in this study is openly available at https://sites.google.com/view/audio visualresearch and does not relate to ethics.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Chen, J., Liang, H., Wang, R. et al. Dynamic interactive learning network for audio-visual event localization. Appl Intell 53, 30431–30442 (2023). https://doi.org/10.1007/s10489-023-05146-7

Download citation

Accepted: 01 November 2023
Published: 18 November 2023
Issue Date: December 2023
DOI: https://doi.org/10.1007/s10489-023-05146-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Dynamic interactive learning network for audio-visual event localization

Abstract

Access this article

Similar content being viewed by others

Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization

GLTCM: Global-Local Temporal and Cross-Modal Network for Audio-Visual Event Localization

Audio-Visual Event Localization in Unconstrained Videos

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Dynamic interactive learning network for audio-visual event localization

Abstract

Access this article

Similar content being viewed by others

Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization

GLTCM: Global-Local Temporal and Cross-Modal Network for Audio-Visual Event Localization

Audio-Visual Event Localization in Unconstrained Videos

Availability of data and materials

References

Acknowledgements

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Competing interests

Ethical and informed consent for data used

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation