ABSTRACT
The weakly supervised audio-visual video parsing (AVVP) task aims to parse a video into a set of modality-wise events (i.e., audible, visible, or both), recognize categories of these events, and localize their temporal boundaries. Given the prevalence of audio-visual synchronous and asynchronous contents in multi-modal videos, it is crucial to capture and integrate the contextual events occurring at different moments and temporal scales. Although some researchers have made preliminary attempts at modeling event semantics with various temporal lengths, they mostly only perform a late fusion of multi-scale features across modalities. A comprehensive cross-modal and multi-scale temporal fusion strategy remains largely unexplored in the literature. To address this gap, we propose a novel framework named Audio-Visual Fusion Architecture Search (AVFAS) that can automatically find the optimal multi-scale temporal fusion strategy within and between modalities. Our framework generates a set of audio and visual features with distinct temporal scales and employs three modality-wise modules to search multi-scale feature selection and fusion strategies, jointly modeling modality-specific discriminative information. Furthermore, to enhance the alignment of audio-visual asynchrony, we introduce a Position- and Length-Adaptive Temporal Attention (PLATA) mechanism for cross-modal feature fusion. Extensive quantitative and qualitative experimental results demonstrate the effectiveness and efficiency of our framework.
Supplemental Material
- Honglie Chen, Weidi Xie, Triantafyllos Afouras, Arsha Nagrani, Andrea Vedaldi, and Andrew Zisserman. 2021. Localizing visual sounds the hard way. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 16867--16876.Google ScholarCross Ref
- Haoyue Cheng, Zhaoyang Liu, Hang Zhou, Chen Qian, Wayne Wu, and Limin Wang. 2022. Joint-Modal Label Denoising for Weakly-Supervised Audio-Visual Video Parsing. Proceedings of the European Conference on Computer Vision (2022), 431--448.Google ScholarDigital Library
- Yann N Dauphin, Angela Fan, Michael Auli, and David Grangier. 2017. Language modeling with gated convolutional networks. In Proceedings of the International Conference on Machine Learning. 933--941.Google Scholar
- Bin Duan, Hao Tang, Wei Wang, Ziliang Zong, Guowei Yang, and Yan Yan. 2021. Audio-visual event localization via recursive fusion by joint co-attention. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 4013--4022.Google ScholarCross Ref
- Thomas Elsken, Jan Hendrik Metzen, and Frank Hutter. 2019. Neural architecture search: A survey. Proceedings of the Journal of Machine Learning Research 20, 1 (2019), 1997--2017.Google ScholarDigital Library
- Ruohan Gao, Rogerio Feris, and Kristen Grauman. 2018. Learning to Separate Object Sounds by Watching Unlabeled Video. In Proceedings of the European Conference on Computer Vision. 36--54.Google ScholarDigital Library
- Wei Han, Hui Chen, and Soujanya Poria. 2021. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. 9180--9192.Google ScholarCross Ref
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.Google ScholarCross Ref
- Xun Jiang, Xing Xu, Zhiguo Chen, Jingran Zhang, Jingkuan Song, Fumin Shen, Huimin Lu, and Heng Tao Shen. 2022. DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing. In Proceedings of the 30th ACM International Conference on Multimedia. 719--727.Google ScholarDigital Library
- Kirthevasan Kandasamy, Willie Neiswanger, Jeff Schneider, Barnabas Poczos, and Eric P Xing. 2018. Neural architecture search with bayesian optimisation and optimal transport. Proceedings of the Advances in Neural Information Processing Systems 31 (2018).Google Scholar
- Yan-Bo Lin, Hung-Yu Tseng, Hsin-Ying Lee, Yen-Yu Lin, and Ming-Hsuan Yang. 2021. Exploring cross-video and cross-modality signals for weakly-supervised audio-visual video parsing. Advances in Neural Information Processing Systems 34 (2021), 11449--11461.Google Scholar
- Hanxiao Liu, Karen Simonyan, and Yiming Yang. 2019. DARTS: Differentiable architecture search. In Proceedings of the International Conference on Learning Representations.Google Scholar
- Shentong Mo and Pedro Morgado. 2022. Localizing visual sounds the easy way. In Proceedings of the European Conference on Computer Vision. 218--234.Google ScholarDigital Library
- Shentong Mo and Yapeng Tian. 2022. Multi-modal Grouping Network for Weakly- Supervised Audio-Visual Video Parsing. In Proceedings of the Advances in Neural Information Processing Systems.Google Scholar
- Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).Google Scholar
- Juan-Manuel Pérez-Rúa, Valentin Vielzeuf, Stéphane Pateux, Moez Baccouche, and Frédéric Jurie. 2019. MFAS: Multimodal fusion architecture search. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6966--6975.Google ScholarCross Ref
- Hieu Pham, Melody Guan, Barret Zoph, Quoc Le, and Jeff Dean. 2018. Efficient neural architecture search via parameters sharing. In Proceedings of the International Conference on Machine Learning. 4095--4104.Google Scholar
- Esteban Real, Alok Aggarwal, Yanping Huang, and Quoc V Le. 2019. Regularized evolution for image classifier architecture search. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 4780--4789.Google ScholarDigital Library
- Andrew Rouditchenko, Hang Zhao, Chuang Gan, Josh McDermott, and Antonio Torralba. 2019. Self-supervised audio-visual co-segmentation. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 2357--2361.Google ScholarCross Ref
- Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2018. Learning to localize sound source in visual scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4358--4366.Google ScholarCross Ref
- Weixuan Sun, Jiayi Zhang, Jianyuan Wang, Zheyuan Liu, Yiran Zhong, Tianpeng Feng, Yandong Guo, Yanhao Zhang, and Nick Barnes. 2023. Learning Audio-Visual Source Localization via False Negative Aware Contrastive Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6420--6429.Google ScholarCross Ref
- Yapeng Tian, Dingzeyu Li, and Chenliang Xu. 2020. Unified Multisensory Perception: Weakly-Supervised Audio-Visual Video Parsing. In Proceedings of the European Conference on Computer Vision. 436--454.Google ScholarDigital Library
- Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audiovisual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision. 247--263.Google ScholarDigital Library
- Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. 2018. A closer look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition. 6450--6459.Google ScholarCross Ref
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30.Google Scholar
- YuWu and Yi Yang. 2021. Exploring Heterogeneous Clues forWeakly-Supervised Audio-Visual Video Parsing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition). 1326--1335.Google Scholar
- Yihang Yin, Siyu Huang, and Xiang Zhang. 2022. BM-NAS: Bilevel Multimodal Neural Architecture Search. In Proceedings of the AAAI Conference on Artificial Intelligence. 8901--8909.Google ScholarCross Ref
- Jiashuo Yu, Ying Cheng, Rui-Wei Zhao, Rui Feng, and Yuejie Zhang. 2022. MMPyramid: Multimodal Pyramid Attentional Network for Audio-Visual Event Localization and Video Parsing. In Proceedings of the 30th ACM International Conference on Multimedia. 6241--6249.Google ScholarDigital Library
- Zhou Yu, Yuhao Cui, Jun Yu, Meng Wang, Dacheng Tao, and Qi Tian. 2020. Deep Multimodal Neural Architecture Search. Proceedings of the 28th ACM International Conference on Multimedia (2020), 3743--3752.Google ScholarDigital Library
- Ye Zhang and Byron Wallace. 2015. A sensitivity analysis of (and practitioners' guide to) convolutional neural networks for sentence classification. arXiv preprint arXiv:1510.03820 (2015).Google Scholar
- Hang Zhao, Chuang Gan, Wei-Chiu Ma, and Antonio Torralba. 2019. The sound of motions. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 1735--1744.Google ScholarCross Ref
- Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The sound of pixels. In Proceedings of the European Conference on Computer Vision. 570--586.Google ScholarDigital Library
- Jinxing Zhou, Dan Guo, and Meng Wang. 2022. Contrastive positive sample propagation along the audio-visual event line. IEEE Transactions on Pattern Analysis and Machine Intelligence (2022).Google Scholar
- Jinxing Zhou, Xuyang Shen, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, MengWang, et al. 2023. Audio-Visual Segmentation with Semantics. arXiv preprint arXiv:2301.13190 (2023).Google Scholar
- Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, MengWang, and Yiran Zhong. 2022. Audio-visual segmentation. In Proceedings of the European Conference on Computer Vision. 386--403.Google ScholarDigital Library
- Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8436--8444.Google ScholarCross Ref
- Barret Zoph and Quoc V Le. 2016. Neural architecture search with reinforcement learning. arXiv preprint arXiv:1611.01578 (2016).Google Scholar
Index Terms
- Multi-Modal and Multi-Scale Temporal Fusion Architecture Search for Audio-Visual Video Parsing
Recommendations
Multi-modal bioelectrical signal fusion analysis based on different acquisition devices and scene settings: Overview, challenges, and novel orientation
Highlights- Multi-modal bioelectrical signal fusion is reviewed.
- Rationality and challenge ...
AbstractMulti-modal fusion combines multiple modal information to overcome the limitation of incomplete information expressed by a single modality, so as to realize the complementarity of modal information and enhance feature representation. ...
Multi-level Multi-modal Feature Fusion for Action Recognition in Videos
HCMA '22: Proceedings of the 3rd International Workshop on Human-Centric Multimedia AnalysisSeveral multi-modal feature fusion approaches have been proposed in recent years in order to improve action recognition in videos. These approaches do not take full advantage of the multi-modal information in the videos, since they are biased towards a ...
Describing Videos using Multi-modal Fusion
MM '16: Proceedings of the 24th ACM international conference on MultimediaDescribing videos with natural language is one of the ultimate goals of video understanding. Video records multi-modal information including image, motion, aural, speech and so on. MSR Video to Language Challenge provides a good chance to study multi-...
Comments