Unimodal-Multimodal Collaborative Enhancement for Audio-Visual Event Localization

Tian, Huilin; Meng, Jingke; Yao, Yuhan; Zheng, Weishi

doi:10.1007/978-981-99-8537-1_17

Huilin Tian¹⁵,
Jingke Meng¹⁵,
Yuhan Yao¹⁵ &
…
Weishi Zheng¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14430))

Included in the following conference series:

Chinese Conference on Pattern Recognition and Computer Vision (PRCV)

406 Accesses

Abstract

Audio-visual event localization (AVE) task focuses on localizing audio-visual events where event signals occur in both audio and visual modalities. Existing approaches primarily emphasize multimodal (i.e. audio-visual fused) feature processing to capture high-level event semantics, while overlooking the potential of unimodal (i.e. audio or visual) features in distinguishing unimodal event segments where only the visual or audio event signal appears within the segment. To overcome this limitation, we propose the Unimodal-Multimodal Collaborative Enhancement (UMCE) framework for audio-visual event localization. The framework consists of several key steps. Firstly, audio and visual features are enhanced by multimodal features, and then adaptively fused to further enhance the multimodal features. Simultaneously, the unimodal features collaborate with multimodal features to filter unimodal events. Lastly, by considering the collaborative emphasis on event content at both the segment and video levels, a dual interaction mechanism is established to exchange information, and video features are utilized for event classification. Experimental results demonstrate the significant superiority of our UMCE framework over state-of-the-art methods in both supervised and weakly supervised AVE settings.

Supported by National Science Foundation for Young Scientists of China (62206315), China Postdoctoral Science Foundation (2022 M 713574) and Fundamental Research Funds for the Central Universities, Sun Yat-sen University (23 ptpy 112).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 69.99; Price excludes VAT (USA)

Softcover Book: USD 89.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Cao, Y., Min, X., Sun, W., Zhai, G.: Attention-guided neural networks for full-reference and no-reference audio-visual quality assessment. TIP 32, 1882–1896 (2023)
Google Scholar
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)
Google Scholar
Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017)
Article Google Scholar
Lin, Y., Li, Y., Wang, Y.F.: Dual-modality seq2seq network for audio-visual event localization. In: ICASSP (2019)
Google Scholar
Lin, Y.B., Tseng, H.Y., Lee, H.Y., Lin, Y.Y., Yang, M.H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. NIPS (2021)
Google Scholar
Lin, Y.B., Wang, Y.C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: ACCV (2020)
Google Scholar
Lin, Z., et al.: A structured self-attentive sentence embedding. In: ICLR (2017)
Google Scholar
Liu, S., Quan, W., Liu, Y., Yan, D.: Bi-directional modality fusion network for audio-visual event localization. In: ICASSP (2022)
Google Scholar
Liu, S., Quan, W., Wang, C., Liu, Y., Liu, B., Yan, D.M.: Dense modality interaction network for audio-visual event localization. TMM (2022)
Google Scholar
Mercea, O.B., Riesch, L., Koepke, A., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: CVPR (2022)
Google Scholar
Qin, S., Li, Z., Liu, L.: Robust 3D shape classification via non-local graph attention network. In: CVPR (2023)
Google Scholar
Ramaswamy, J.: What makes the sound?: A dual-modality interacting network for audio-visual event localization. In: ICASSP (2020)
Google Scholar
Ramaswamy, J., Das, S.: See the sound, hear the pixels. In: WACV (2020)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Google Scholar
Stergiou, A., Damen, D.: The wisdom of crowds: temporal progressive attention for early action prediction. In: CVPR (2023)
Google Scholar
Tang, Z., Qiu, Z., Hao, Y., Hong, R., Yao, T.: 3D human pose estimation with spatio-temporal criss-cross attention. In: CVPR (2023)
Google Scholar
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)
Google Scholar
Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)
Google Scholar
Wang, H., Zha, Z., Li, L., Chen, X., Luo, J.: Multi-modulation network for audio-visual event localization. CoRR (2021)
Google Scholar
Wu, Y., Zhu, L., Yan, Y., Yang, Y.: Dual attention matching for audio-visual event localization. In: ICCV (2019)
Google Scholar
Xia, Y., Zhao, Z.: Cross-modal background suppression for audio-visual event localization. In: CVPR (2022)
Google Scholar
Xu, H., Zeng, R., Wu, Q., Tan, M., Gan, C.: Cross-modal relation-aware networks for audio-visual event localization. In: ACM MM (2020)
Google Scholar
Xuan, H., Luo, L., Zhang, Z., Yang, J., Yan, Y.: Discriminative cross-modality attention network for temporal inconsistent audio-visual event localization. TIP 30, 7878–7888 (2021)
MathSciNet Google Scholar
Xuan, H., Zhang, Z., Chen, S., Yang, J., Yan, Y.: Cross-modal attention network for temporal inconsistent audio-visual event localization. In: AAAI (2020)
Google Scholar
Yang, J., et al.: Modeling point clouds with self-attention and gumbel subset sampling. In: CVPR (2019)
Google Scholar
Yu, J., Cheng, Y., Feng, R.: MPN: multimodal parallel network for audio-visual event localization. In: ICME (2021)
Google Scholar
Zhou, J., Guo, D., Wang, M.: Contrastive positive sample propagation along the audio-visual event line. TPAMI (2023)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
Huilin Tian, Jingke Meng, Yuhan Yao & Weishi Zheng

Authors

Huilin Tian
View author publications
You can also search for this author in PubMed Google Scholar
Jingke Meng
View author publications
You can also search for this author in PubMed Google Scholar
Yuhan Yao
View author publications
You can also search for this author in PubMed Google Scholar
Weishi Zheng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Jingke Meng .

Editor information

Editors and Affiliations

Nanjing University of Information Science and Technology, Nanjing, China
Qingshan Liu
Xiamen University, Xiamen, China
Hanzi Wang
Beijing University of Posts and Telecommunications, Beijing, China
Zhanyu Ma
Sun Yat-sen University, Guangzhou, China
Weishi Zheng
Peking University, Beijing, China
Hongbin Zha
Chinese Academy of Sciences, Beijing, China
Xilin Chen
Chinese Academy of Sciences, Beijing, China
Liang Wang
Xiamen University, Xiamen, China
Rongrong Ji

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, H., Meng, J., Yao, Y., Zheng, W. (2024). Unimodal-Multimodal Collaborative Enhancement for Audio-Visual Event Localization. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14430. Springer, Singapore. https://doi.org/10.1007/978-981-99-8537-1_17

Download citation

DOI: https://doi.org/10.1007/978-981-99-8537-1_17
Published: 26 December 2023
Publisher Name: Springer, Singapore
Print ISBN: 978-981-99-8536-4
Online ISBN: 978-981-99-8537-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Unimodal-Multimodal Collaborative Enhancement for Audio-Visual Event Localization