Abstract
Audio-visual event localization (AVE) task focuses on localizing audio-visual events where event signals occur in both audio and visual modalities. Existing approaches primarily emphasize multimodal (i.e. audio-visual fused) feature processing to capture high-level event semantics, while overlooking the potential of unimodal (i.e. audio or visual) features in distinguishing unimodal event segments where only the visual or audio event signal appears within the segment. To overcome this limitation, we propose the Unimodal-Multimodal Collaborative Enhancement (UMCE) framework for audio-visual event localization. The framework consists of several key steps. Firstly, audio and visual features are enhanced by multimodal features, and then adaptively fused to further enhance the multimodal features. Simultaneously, the unimodal features collaborate with multimodal features to filter unimodal events. Lastly, by considering the collaborative emphasis on event content at both the segment and video levels, a dual interaction mechanism is established to exchange information, and video features are utilized for event classification. Experimental results demonstrate the significant superiority of our UMCE framework over state-of-the-art methods in both supervised and weakly supervised AVE settings.
Supported by National Science Foundation for Young Scientists of China (62206315), China Postdoctoral Science Foundation (2022 M 713574) and Fundamental Research Funds for the Central Universities, Sun Yat-sen University (23 ptpy 112).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Cao, Y., Min, X., Sun, W., Zhai, G.: Attention-guided neural networks for full-reference and no-reference audio-visual quality assessment. TIP 32, 1882ā1896 (2023)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84ā90 (2017)
Lin, Y., Li, Y., Wang, Y.F.: Dual-modality seq2seq network for audio-visual event localization. In: ICASSP (2019)
Lin, Y.B., Tseng, H.Y., Lee, H.Y., Lin, Y.Y., Yang, M.H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. NIPS (2021)
Lin, Y.B., Wang, Y.C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: ACCV (2020)
Lin, Z., et al.: A structured self-attentive sentence embedding. In: ICLR (2017)
Liu, S., Quan, W., Liu, Y., Yan, D.: Bi-directional modality fusion network for audio-visual event localization. In: ICASSP (2022)
Liu, S., Quan, W., Wang, C., Liu, Y., Liu, B., Yan, D.M.: Dense modality interaction network for audio-visual event localization. TMM (2022)
Mercea, O.B., Riesch, L., Koepke, A., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: CVPR (2022)
Qin, S., Li, Z., Liu, L.: Robust 3D shape classification via non-local graph attention network. In: CVPR (2023)
Ramaswamy, J.: What makes the sound?: A dual-modality interacting network for audio-visual event localization. In: ICASSP (2020)
Ramaswamy, J., Das, S.: See the sound, hear the pixels. In: WACV (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Stergiou, A., Damen, D.: The wisdom of crowds: temporal progressive attention for early action prediction. In: CVPR (2023)
Tang, Z., Qiu, Z., Hao, Y., Hong, R., Yao, T.: 3D human pose estimation with spatio-temporal criss-cross attention. In: CVPR (2023)
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Wang, H., Zha, Z., Li, L., Chen, X., Luo, J.: Multi-modulation network for audio-visual event localization. CoRR (2021)
Wu, Y., Zhu, L., Yan, Y., Yang, Y.: Dual attention matching for audio-visual event localization. In: ICCV (2019)
Xia, Y., Zhao, Z.: Cross-modal background suppression for audio-visual event localization. In: CVPR (2022)
Xu, H., Zeng, R., Wu, Q., Tan, M., Gan, C.: Cross-modal relation-aware networks for audio-visual event localization. In: ACM MM (2020)
Xuan, H., Luo, L., Zhang, Z., Yang, J., Yan, Y.: Discriminative cross-modality attention network for temporal inconsistent audio-visual event localization. TIP 30, 7878ā7888 (2021)
Xuan, H., Zhang, Z., Chen, S., Yang, J., Yan, Y.: Cross-modal attention network for temporal inconsistent audio-visual event localization. In: AAAI (2020)
Yang, J., et al.: Modeling point clouds with self-attention and gumbel subset sampling. In: CVPR (2019)
Yu, J., Cheng, Y., Feng, R.: MPN: multimodal parallel network for audio-visual event localization. In: ICME (2021)
Zhou, J., Guo, D., Wang, M.: Contrastive positive sample propagation along the audio-visual event line. TPAMI (2023)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Tian, H., Meng, J., Yao, Y., Zheng, W. (2024). Unimodal-Multimodal Collaborative Enhancement forĀ Audio-Visual Event Localization. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14430. Springer, Singapore. https://doi.org/10.1007/978-981-99-8537-1_17
Download citation
DOI: https://doi.org/10.1007/978-981-99-8537-1_17
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8536-4
Online ISBN: 978-981-99-8537-1
eBook Packages: Computer ScienceComputer Science (R0)