ABSTRACT
Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.
- Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-Supervised Learning of Audio-Visual Objects from Video. In Proceedings of the European Conference on Computer Vision. 208--224.Google ScholarDigital Library
- Mohammed M Alghamdi, He Wang, Andrew J Bulpitt, and David C Hogg. 2022. Talking Head from Speech Audio Using a Pre-trained Image Generator. In Proceedings of the ACM International Conference on Multimedia. 5228--5236.Google ScholarDigital Library
- Relja Arandjelovic and Andrew Zisserman. 2017. Look, Listen and Learn. In Proceedings of the IEEE International Conference on Computer Vision. 609--617.Google ScholarCross Ref
- Relja Arandjelovic and Andrew Zisserman. 2018. Objects that Sound. In Proceedings of the European Conference on Computer Vision. 435--451.Google ScholarDigital Library
- Gemma A Calvert and Thomas Thesen. 2004. Multisensory Integration: Methodological Approaches and Emerging Principles in the Human Brain. Journal of Physiology-Paris, Vol. 98, 1--3 (2004), 191--205.Google ScholarCross Ref
- Francesca Frassinetti, Francesco Pavani, and Elisabetta Ladavas. 2002. Acoustical Vision of Neglected Stimuli: Interaction Among Spatially Converging Audiovisual Inputs in Neglect Patients. Journal of Cognitive Neuroscience, Vol. 14, 1 (2002), 62--69.Google ScholarDigital Library
- Ruohan Gao and Kristen Grauman. 2019. 2.5D Visual Sound. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 324--333.Google ScholarCross Ref
- Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 776--780.Google ScholarDigital Library
- Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
- Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN Architectures for Large-Scale Audio Classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 131--135.Google ScholarDigital Library
- Elad Hoffer and Nir Ailon. 2015. Deep Metric Learning Using Triplet Network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3. 84--92.Google Scholar
- Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep Multimodal Clustering for Unsupervised Audiovisual Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9248--9257.Google ScholarCross Ref
- Xun Jiang, Xing Xu, Zhiguo Chen, Jingran Zhang, Jingkuan Song, Fumin Shen, Huimin Lu, and Heng Tao Shen. 2022. DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing. In Proceedings of the ACM International Conference on Multimedia. 719--727.Google ScholarDigital Library
- Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, and Kwanghoon Sohn. 2021. Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1336--1345.Google ScholarCross Ref
- Borong Liang, Yan Pan, Zhizhi Guo, Hang Zhou, Zhibin Hong, Xiaoguang Han, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. 2022. Expressive Talking Head Generation with Granular Audio-Visual Control. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3387--3396.Google ScholarCross Ref
- Alexander H Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, and James Glass. 2021. Cross-Modal Discrete Representation Learning. arXiv preprint arXiv:2106.05438 (2021).Google Scholar
- Jinxiang Liu, Chen Ju, Weidi Xie, and Ya Zhang. 2022. Exploiting Transformation Invariance and Equivariance for Self-Supervised Sound Localisation. In Proceedings of the ACM International Conference on Multimedia. 3742--3753.Google ScholarDigital Library
- Sagnik Majumder and Kristen Grauman. 2022. Active Audio-Visual Separation of Dynamic Sound Sources. In Proceedings of the European Conference on Computer Vision. Springer, 551--569.Google ScholarDigital Library
- Hailong Ning, Bin Zhao, and Yuan Yuan. 2021. Semantics-Consistent Representation Learning for Remote Sensing Image--Voice Retrieval. IEEE Transactions on Geoscience and Remote Sensing, Vol. 60 (2021), 1--14.Google ScholarCross Ref
- Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In Proceedings of the European Conference on Computer Vision. 631--648.Google ScholarDigital Library
- Monique Radeau. 1994. Auditory-Visual Spatial Interaction and Modularity. Cahiers de Psychologie Cognitive/Current Psychology of Cognition (1994).Google Scholar
- Tanzila Rahman, Mengyu Yang, and Leonid Sigal. 2021. TriBERT: Human-Centric Audio-Visual Representation Learning. Advances in Neural Information Processing Systems, Vol. 34 (2021), 9774--9787.Google Scholar
- Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2018. Learning to Localize Sound Source in Visual Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4358--4366.Google ScholarCross Ref
- Francesca Setti, Giacomo Handjaras, Davide Bottari, Andrea Leo, Matteo Diano, Valentina Bruno, Carla Tinti, Luca Cecchetti, Francesca Garbarini, Pietro Pietrini, et al. 2023. A Modality-Independent Proto-Organization of Human Multisensory Areas. Nature Human Behaviour (2023), 1--14.Google Scholar
- Yapeng Tian, Di Hu, and Chenliang Xu. 2021. Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2745--2754.Google ScholarCross Ref
- Eric Zhongcong Xu, Zeyang Song, Satoshi Tsutsui, Chao Feng, Mang Ye, and Mike Zheng Shou. 2022. AVA-AVD: Audio-Visual Speaker Diarization in the Wild. In Proceedings of the ACM International Conference on Multimedia. 3838--3847.Google ScholarDigital Library
- Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin. 2021. Visually Informed Binaural Audio Generation without Binaural Audios. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15485--15494.Google ScholarCross Ref
- Karren Yang, Bryan Russell, and Justin Salamon. 2020. Telling Left from Right: Learning Spatial Correspondence of Sight and Sound. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9932--9941.Google ScholarCross Ref
- Youshan Zhang and Jialu Li. 2023. BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2248--2257.Google ScholarCross Ref
- Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The Sound of Pixels. In Proceedings of the European Conference on Computer Vision. 570--586.Google ScholarDigital Library
- Tao Zheng, Sunny Verma, and Wei Liu. 2022. Interpretable Binaural Ratio for Visually Guided Binaural Audio Generation. In Proceedings of the International Joint Conference on Neural Networks. 1--8.Google ScholarCross Ref
- Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. 2022. Audio-Visual Segmentation. In Proceedings of the European Conference on Computer Vision. 386--403.Google Scholar
- Lingyu Zhu and Esa Rahtu. 2022. Visually Guided Sound Source Separation and Localization Using Self-Supervised Motion Representations. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1289--1299.Google ScholarCross Ref
Index Terms
- Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning
Recommendations
Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
MM '20: Proceedings of the 28th ACM International Conference on MultimediaWhen watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as ...
Contrastive Self-supervised Representation Learning Using Synthetic Data
AbstractLearning discriminative representations with deep neural networks often relies on massive labeled data, which is expensive and difficult to obtain in many real scenarios. As an alternative, self-supervised learning that leverages input itself as ...
Learning Self-Supervised Multimodal Representations of Human Behaviour
MM '20: Proceedings of the 28th ACM International Conference on MultimediaSelf-supervised learning of representations has important potential applications in human behaviour understanding. The ability to learn useful representations from large unlabeled datasets by modeling intrinsic properties of the data has been ...
Comments