Abstract
With the explosive growth of network video, how to better solve the problem of video understanding has become a hot topic, and the task of Audio-Visual Event Localization will help us solve more higher-semantic and challenging video understanding problems in the future. The existing methods in AVE lack utilizing local temporal information fully and ignore constructing cross-modal fusion relationships well with different scales. In this paper, we propose a Global-Local Temporal and Cross-Modal Network(GLTCM) for supervised/weakly-supervised audio-visual event localization task, which is composed of a feature extraction module, global-local temporal module, cross-modality module, and localization module. The global-local temporal module is exploited to model the temporal relationship between the entire and surrounding segments, the cross-modality module is utilized to model the cross-modal information of multi-modal features, and the localization module is based on multi-task learning. Our proposed method is verified for two tasks of supervised and weakly-supervised audio-visual event localization. The experimental results demonstrated that our method is competitive on the public AVE dataset.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Xu, J., Mei, T., Yao, T., Rui, Y.: MSR-VTT: a large video description dataset for bridging video and language. In: CVPR 2016, pp. 5288–5296 (2016)
Tapaswi, M., Zhu, Y., Stiefelhagen, R., Torralba, A., Urtasun, R., Fidler, S.: MovieQA: understanding stories in movies through question-answering. In: CVPR 2016, pp. 4631–4640 (2016)
Tian, Y., Shi, J., Li, B., Duan, Z., Xu, C.: Audio-Visual Event Localization in Unconstrained Videos. In: ECCV, vol. 2, pp. 252–268 (2018)
Lin, Y.B., Li, Y.J., Wang, Y.C.F. Dual-modality Seq2Seq network for audio-visual event localization. In: ICASSP 2019, pp. 2002–2006 (2019)
Xuan, H., Zhang, Z., Chen, S., Yang, J., Yan, Y.: Cross-modal attention network for temporal inconsistent audio-visual event localization. In: AAAI 2020, pp. 279–286 (2020)
Duan, B., Tang, H., Wang, W., Zong, Z., Yang, G., Yan, Y.: Audio-visual event localization via recursive fusion by joint co-attention. In: WACV 2021, pp. 4012–4021 (2021)
Yu, J., Cheng, Y., Feng, R.: MPN: multimodal parallel network for audio-visual event localization. In: ICME 2021, pp. 1–6 (2021)
Wu, Y., Zhu, L., Yan, Y., Yang, Y.: Dual attention matching for audio-visual event localization. In: ICCV 2019, pp. 6291–6299 (2019)
Ramaswamy, J.: What makes the sound?: A dual-modality interacting network for audio-visual event localization. In: ICASSP 2020, pp. 4372–4376 (2020)
Lin, Y.B., Wang, Y.C.F.: Audiovisual transformer with instance attention for audio-visual event localization. In: ACCV, vol. 6, pp. 274–290 (2020)
Xu, H., Zeng, R., Wu, Q., Tan, M., Gan, C.: Cross-modal relation-aware networks for audio-visual event localization. In: ACM Multimedia 2020, pp. 3893–3901 (2020)
Zhou, J., Zheng, L., Zhong, Y., Hao, S., Wang, M.: Positive sample propagation along the audio-visual event line. In: CVPR 2021, pp. 8436–8444 (2021)
Zhou, J., Guo, D., Wang, M.: Contrastive positive sample propagation along the audio-visual event line. IEEE Trans. Pattern Anal. Mach. Intell. (2022)
Lin, Y.B., Sung, Y.L., Lei, J., et al. Vision transformers are parameter-efficient audio-visual learners. arXiv preprint arXiv:2212.07983 (2022)
Wang, H., Zha, Z.J., Li, L., et al.: Multi-modulation network for audio-visual event localization. arXiv preprint arXiv:2108.11773 (2021)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: ICML 2021, pp. 8748–8763 (2017)
Hershey, S., et al.: CNN architectures for large-scale audio classification. ICASSP 2017, pp. 131–135 (2017)
Qing, Z., et al.: Temporal context aggregation network for temporal action proposal refinement. In: CVPR 2021, pp. 485–494 (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Wu, X., Qiu, J., Yue, Q. (2023). GLTCM: Global-Local Temporal and Cross-Modal Network for Audio-Visual Event Localization. In: Lu, H., et al. Image and Graphics. ICIG 2023. Lecture Notes in Computer Science, vol 14356. Springer, Cham. https://doi.org/10.1007/978-3-031-46308-2_17
Download citation
DOI: https://doi.org/10.1007/978-3-031-46308-2_17
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-46307-5
Online ISBN: 978-3-031-46308-2
eBook Packages: Computer ScienceComputer Science (R0)