skip to main content
10.1145/3581783.3611947acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-Modal and Multi-Scale Temporal Fusion Architecture Search for Audio-Visual Video Parsing

Published:27 October 2023Publication History

ABSTRACT

The weakly supervised audio-visual video parsing (AVVP) task aims to parse a video into a set of modality-wise events (i.e., audible, visible, or both), recognize categories of these events, and localize their temporal boundaries. Given the prevalence of audio-visual synchronous and asynchronous contents in multi-modal videos, it is crucial to capture and integrate the contextual events occurring at different moments and temporal scales. Although some researchers have made preliminary attempts at modeling event semantics with various temporal lengths, they mostly only perform a late fusion of multi-scale features across modalities. A comprehensive cross-modal and multi-scale temporal fusion strategy remains largely unexplored in the literature. To address this gap, we propose a novel framework named Audio-Visual Fusion Architecture Search (AVFAS) that can automatically find the optimal multi-scale temporal fusion strategy within and between modalities. Our framework generates a set of audio and visual features with distinct temporal scales and employs three modality-wise modules to search multi-scale feature selection and fusion strategies, jointly modeling modality-specific discriminative information. Furthermore, to enhance the alignment of audio-visual asynchrony, we introduce a Position- and Length-Adaptive Temporal Attention (PLATA) mechanism for cross-modal feature fusion. Extensive quantitative and qualitative experimental results demonstrate the effectiveness and efficiency of our framework.

Skip Supplemental Material Section

Supplemental Material

1120-video.mp4

mp4

36.1 MB

References

  1. Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. 2021. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16867--16876.Google ScholarGoogle ScholarCross RefCross Ref
  2. Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, and Limin Wang. 2022. Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing. Proceedings of the European Conference on Computer Vision (2022), 431--448.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning. 933--941.Google ScholarGoogle Scholar
  4. Bin Duan, Hao Tang, Wei Wang, Ziliang Zong, Guowei Yang, and Yan Yan. 2021. Audio-visual event localization via recursive fusion by joint co-attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4013--4022.Google ScholarGoogle ScholarCross RefCross Ref
  5. Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. Proceedings of the Journal of Machine Learning Research 20, 1 (2019), 1997--2017.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Ruohan Gao, Rogerio Feris, and Kristen Grauman. 2018. Learning to Separate Object Sounds by Watching Unlabeled Video. In Proceedings of the European Conference on Computer Vision. 36--54.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9180--9192.Google ScholarGoogle ScholarCross RefCross Ref
  8. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  9. Xun Jiang, Xing Xu, Zhiguo Chen, Jingran Zhang, Jingkuan Song, Fumin Shen, Huimin Lu, and Heng Tao Shen. 2022. DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing. In Proceedings of the 30th ACM International Conference on Multimedia. 719--727.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric P Xing. 2018. Neural architecture search with bayesian optimisation and optimal transport. Proceedings of the Advances in Neural Information Processing Systems 31 (2018).Google ScholarGoogle Scholar
  11. Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, and Ming-Hsuan Yang. 2021. Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. Advances in Neural Information Processing Systems 34 (2021), 11449--11461.Google ScholarGoogle Scholar
  12. Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable architecture search. In Proceedings of the International Conference on Learning Representations.Google ScholarGoogle Scholar
  13. Shentong Mo and Pedro Morgado. 2022. Localizing visual sounds the easy way. In Proceedings of the European Conference on Computer Vision. 218--234.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Shentong Mo and Yapeng Tian. 2022. Multi-modal Grouping Network for Weakly- Supervised Audio-Visual Video Parsing. In Proceedings of the Advances in Neural Information Processing Systems.Google ScholarGoogle Scholar
  15. Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).Google ScholarGoogle Scholar
  16. Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. 2019. MFAS: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6966--6975.Google ScholarGoogle ScholarCross RefCross Ref
  17. Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. Efficient neural architecture search via parameters sharing. In Proceedings of the International Conference on Machine Learning. 4095--4104.Google ScholarGoogle Scholar
  18. Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4780--4789.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, and Antonio Torralba. 2019. Self-supervised audio-visual co-segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2357--2361.Google ScholarGoogle ScholarCross RefCross Ref
  20. Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2018. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4358--4366.Google ScholarGoogle ScholarCross RefCross Ref
  21. Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, and Nick Barnes. 2023. Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6420--6429.Google ScholarGoogle ScholarCross RefCross Ref
  22. Yapeng Tian, Dingzeyu Li, and Chenliang Xu. 2020. Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing. In Proceedings of the European Conference on Computer Vision. 436--454.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audiovisual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision. 247--263.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarGoogle ScholarCross RefCross Ref
  25. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30.Google ScholarGoogle Scholar
  26. YuWu and Yi Yang. 2021. Exploring Heterogeneous Clues forWeakly-Supervised Audio-Visual Video Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition). 1326--1335.Google ScholarGoogle Scholar
  27. Yihang Yin, Siyu Huang, and Xiang Zhang. 2022. BM-NAS: Bilevel Multimodal Neural Architecture Search. In Proceedings of the AAAI Conference on Artificial Intelligence. 8901--8909.Google ScholarGoogle ScholarCross RefCross Ref
  28. Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, and Yuejie Zhang. 2022. MMPyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing. In Proceedings of the 30th ACM International Conference on Multimedia. 6241--6249.Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Zhou Yu, Yuhao Cui, Jun Yu, Meng Wang, Dacheng Tao, and Qi Tian. 2020. Deep Multimodal Neural Architecture Search. Proceedings of the 28th ACM International Conference on Multimedia (2020), 3743--3752.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015).Google ScholarGoogle Scholar
  31. Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba. 2019. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1735--1744.Google ScholarGoogle ScholarCross RefCross Ref
  32. Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision. 570--586.Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Jinxing Zhou, Dan Guo, and Meng Wang. 2022. Contrastive positive sample propagation along the audio-visual event line. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).Google ScholarGoogle Scholar
  34. Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, MengWang, et al. 2023. Audio-Visual Segmentation with Semantics. arXiv preprint arXiv:2301.13190 (2023).Google ScholarGoogle Scholar
  35. Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, MengWang, and Yiran Zhong. 2022. Audio-visual segmentation. In Proceedings of the European Conference on Computer Vision. 386--403.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8436--8444.Google ScholarGoogle ScholarCross RefCross Ref
  37. Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).Google ScholarGoogle Scholar

Index Terms

  1. Multi-Modal and Multi-Scale Temporal Fusion Architecture Search for Audio-Visual Video Parsing
        Index terms have been assigned to the content through auto-classification.

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '23: Proceedings of the 31st ACM International Conference on Multimedia
          October 2023
          9913 pages
          ISBN:9798400701085
          DOI:10.1145/3581783

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)136
          • Downloads (Last 6 weeks)28

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader