Skip to main content

Unimodal-Multimodal Collaborative Enhancement forĀ Audio-Visual Event Localization

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14430))

Included in the following conference series:

  • 406 Accesses

Abstract

Audio-visual event localization (AVE) task focuses on localizing audio-visual events where event signals occur in both audio and visual modalities. Existing approaches primarily emphasize multimodal (i.e. audio-visual fused) feature processing to capture high-level event semantics, while overlooking the potential of unimodal (i.e. audio or visual) features in distinguishing unimodal event segments where only the visual or audio event signal appears within the segment. To overcome this limitation, we propose the Unimodal-Multimodal Collaborative Enhancement (UMCE) framework for audio-visual event localization. The framework consists of several key steps. Firstly, audio and visual features are enhanced by multimodal features, and then adaptively fused to further enhance the multimodal features. Simultaneously, the unimodal features collaborate with multimodal features to filter unimodal events. Lastly, by considering the collaborative emphasis on event content at both the segment and video levels, a dual interaction mechanism is established to exchange information, and video features are utilized for event classification. Experimental results demonstrate the significant superiority of our UMCE framework over state-of-the-art methods in both supervised and weakly supervised AVE settings.

Supported by National Science Foundation for Young Scientists of China (62206315), China Postdoctoral Science Foundation (2022 M 713574) and Fundamental Research Funds for the Central Universities, Sun Yat-sen University (23 ptpy 112).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 69.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 89.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Cao, Y., Min, X., Sun, W., Zhai, G.: Attention-guided neural networks for full-reference and no-reference audio-visual quality assessment. TIP 32, 1882ā€“1896 (2023)

    Google ScholarĀ 

  2. Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: ICASSP (2017)

    Google ScholarĀ 

  3. Hershey, S., et al.: CNN architectures for large-scale audio classification. In: ICASSP (2017)

    Google ScholarĀ 

  4. Krizhevsky, A., Sutskever, I., Hinton, G.E.: ImageNet classification with deep convolutional neural networks. Commun. ACM 60, 84ā€“90 (2017)

    ArticleĀ  Google ScholarĀ 

  5. Lin, Y., Li, Y., Wang, Y.F.: Dual-modality seq2seq network for audio-visual event localization. In: ICASSP (2019)

    Google ScholarĀ 

  6. Lin, Y.B., Tseng, H.Y., Lee, H.Y., Lin, Y.Y., Yang, M.H.: Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. NIPS (2021)

    Google ScholarĀ 

  7. Lin, Y.B., Wang, Y.C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: ACCV (2020)

    Google ScholarĀ 

  8. Lin, Z., et al.: A structured self-attentive sentence embedding. In: ICLR (2017)

    Google ScholarĀ 

  9. Liu, S., Quan, W., Liu, Y., Yan, D.: Bi-directional modality fusion network for audio-visual event localization. In: ICASSP (2022)

    Google ScholarĀ 

  10. Liu, S., Quan, W., Wang, C., Liu, Y., Liu, B., Yan, D.M.: Dense modality interaction network for audio-visual event localization. TMM (2022)

    Google ScholarĀ 

  11. Mercea, O.B., Riesch, L., Koepke, A., Akata, Z.: Audio-visual generalised zero-shot learning with cross-modal attention and language. In: CVPR (2022)

    Google ScholarĀ 

  12. Qin, S., Li, Z., Liu, L.: Robust 3D shape classification via non-local graph attention network. In: CVPR (2023)

    Google ScholarĀ 

  13. Ramaswamy, J.: What makes the sound?: A dual-modality interacting network for audio-visual event localization. In: ICASSP (2020)

    Google ScholarĀ 

  14. Ramaswamy, J., Das, S.: See the sound, hear the pixels. In: WACV (2020)

    Google ScholarĀ 

  15. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)

    Google ScholarĀ 

  16. Stergiou, A., Damen, D.: The wisdom of crowds: temporal progressive attention for early action prediction. In: CVPR (2023)

    Google ScholarĀ 

  17. Tang, Z., Qiu, Z., Hao, Y., Hong, R., Yao, T.: 3D human pose estimation with spatio-temporal criss-cross attention. In: CVPR (2023)

    Google ScholarĀ 

  18. Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-visual event localization in unconstrained videos. In: ECCV (2018)

    Google ScholarĀ 

  19. Vaswani, A., et al.: Attention is all you need. In: NIPS (2017)

    Google ScholarĀ 

  20. Wang, H., Zha, Z., Li, L., Chen, X., Luo, J.: Multi-modulation network for audio-visual event localization. CoRR (2021)

    Google ScholarĀ 

  21. Wu, Y., Zhu, L., Yan, Y., Yang, Y.: Dual attention matching for audio-visual event localization. In: ICCV (2019)

    Google ScholarĀ 

  22. Xia, Y., Zhao, Z.: Cross-modal background suppression for audio-visual event localization. In: CVPR (2022)

    Google ScholarĀ 

  23. Xu, H., Zeng, R., Wu, Q., Tan, M., Gan, C.: Cross-modal relation-aware networks for audio-visual event localization. In: ACM MM (2020)

    Google ScholarĀ 

  24. Xuan, H., Luo, L., Zhang, Z., Yang, J., Yan, Y.: Discriminative cross-modality attention network for temporal inconsistent audio-visual event localization. TIP 30, 7878ā€“7888 (2021)

    MathSciNetĀ  Google ScholarĀ 

  25. Xuan, H., Zhang, Z., Chen, S., Yang, J., Yan, Y.: Cross-modal attention network for temporal inconsistent audio-visual event localization. In: AAAI (2020)

    Google ScholarĀ 

  26. Yang, J., et al.: Modeling point clouds with self-attention and gumbel subset sampling. In: CVPR (2019)

    Google ScholarĀ 

  27. Yu, J., Cheng, Y., Feng, R.: MPN: multimodal parallel network for audio-visual event localization. In: ICME (2021)

    Google ScholarĀ 

  28. Zhou, J., Guo, D., Wang, M.: Contrastive positive sample propagation along the audio-visual event line. TPAMI (2023)

    Google ScholarĀ 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jingke Meng .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

Ā© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Tian, H., Meng, J., Yao, Y., Zheng, W. (2024). Unimodal-Multimodal Collaborative Enhancement forĀ Audio-Visual Event Localization. In: Liu, Q., et al. Pattern Recognition and Computer Vision. PRCV 2023. Lecture Notes in Computer Science, vol 14430. Springer, Singapore. https://doi.org/10.1007/978-981-99-8537-1_17

Download citation

  • DOI: https://doi.org/10.1007/978-981-99-8537-1_17

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-99-8536-4

  • Online ISBN: 978-981-99-8537-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics