Skip to main content

GLTCM: Global-Local Temporal and Cross-Modal Network for Audio-Visual Event Localization

  • Conference paper
  • First Online:
Image and Graphics (ICIG 2023)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14356))

Included in the following conference series:

  • 571 Accesses

Abstract

With the explosive growth of network video, how to better solve the problem of video understanding has become a hot topic, and the task of Audio-Visual Event Localization will help us solve more higher-semantic and challenging video understanding problems in the future. The existing methods in AVE lack utilizing local temporal information fully and ignore constructing cross-modal fusion relationships well with different scales. In this paper, we propose a Global-Local Temporal and Cross-Modal Network(GLTCM) for supervised/weakly-supervised audio-visual event localization task, which is composed of a feature extraction module, global-local temporal module, cross-modality module, and localization module. The global-local temporal module is exploited to model the temporal relationship between the entire and surrounding segments, the cross-modality module is utilized to model the cross-modal information of multi-modal features, and the localization module is based on multi-task learning. Our proposed method is verified for two tasks of supervised and weakly-supervised audio-visual event localization. The experimental results demonstrated that our method is competitive on the public AVE dataset.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR 2016, pp. 5288–5296 (2016)

    Google Scholar 

  2. Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR 2016, pp. 4631–4640 (2016)

    Google Scholar 

  3. Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-Visual Event Localization in Unconstrained Videos. In: ECCV, vol. 2, pp. 252–268 (2018)

    Google Scholar 

  4. Lin, Y.B., Li, Y.J., Wang, Y.C.F. Dual-modality Seq2Seq network for audio-visual event localization. In: ICASSP 2019, pp. 2002–2006 (2019)

    Google Scholar 

  5. Xuan, H., Zhang, Z., Chen, S., Yang, J., Yan, Y.: Cross-modal attention network for temporal inconsistent audio-visual event localization. In: AAAI 2020, pp. 279–286 (2020)

    Google Scholar 

  6. Duan, B., Tang, H., Wang, W., Zong, Z., Yang, G., Yan, Y.: Audio-visual event localization via recursive fusion by joint co-attention. In: WACV 2021, pp. 4012–4021 (2021)

    Google Scholar 

  7. Yu, J., Cheng, Y., Feng, R.: MPN: multimodal parallel network for audio-visual event localization. In: ICME 2021, pp. 1–6 (2021)

    Google Scholar 

  8. Wu, Y., Zhu, L., Yan, Y., Yang, Y.: Dual attention matching for audio-visual event localization. In: ICCV 2019, pp. 6291–6299 (2019)

    Google Scholar 

  9. Ramaswamy, J.: What makes the sound?: A dual-modality interacting network for audio-visual event localization. In: ICASSP 2020, pp. 4372–4376 (2020)

    Google Scholar 

  10. Lin, Y.B., Wang, Y.C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: ACCV, vol. 6, pp. 274–290 (2020)

    Google Scholar 

  11. Xu, H., Zeng, R., Wu, Q., Tan, M., Gan, C.: Cross-modal relation-aware networks for audio-visual event localization. In: ACM Multimedia 2020, pp. 3893–3901 (2020)

    Google Scholar 

  12. Zhou, J., Zheng, L., Zhong, Y., Hao, S., Wang, M.: Positive sample propagation along the audio-visual event line. In: CVPR 2021, pp. 8436–8444 (2021)

    Google Scholar 

  13. Zhou, J., Guo, D., Wang, M.: Contrastive positive sample propagation along the audio-visual event line. IEEE Trans. Pattern Anal. Mach. Intell. (2022)

    Google Scholar 

  14. Lin, Y.B., Sung, Y.L., Lei, J., et al. Vision transformers are parameter-efficient audio-visual learners. arXiv preprint arXiv:2212.07983 (2022)

  15. Wang, H., Zha, Z.J., Li, L., et al.: Multi-modulation network for audio-visual event localization. arXiv preprint arXiv:2108.11773 (2021)

  16. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML 2021, pp. 8748–8763 (2017)

    Google Scholar 

  17. Hershey, S., et al.: CNN architectures for large-scale audio classification. ICASSP 2017, pp. 131–135 (2017)

    Google Scholar 

  18. Qing, Z., et al.: Temporal context aggregation network for temporal action proposal refinement. In: CVPR 2021, pp. 485–494 (2021)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoyu Wu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Wu, X., Qiu, J., Yue, Q. (2023). GLTCM: Global-Local Temporal and Cross-Modal Network for Audio-Visual Event Localization. In: Lu, H., et al. Image and Graphics. ICIG 2023. Lecture Notes in Computer Science, vol 14356. Springer, Cham. https://doi.org/10.1007/978-3-031-46308-2_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-46308-2_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-46307-5

  • Online ISBN: 978-3-031-46308-2

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics