skip to main content
10.1145/3581783.3612428acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

Published:27 October 2023Publication History

ABSTRACT

Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.

References

  1. Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-Supervised Learning of Audio-Visual Objects from Video. In Proceedings of the European Conference on Computer Vision. 208--224.Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Mohammed M Alghamdi, He Wang, Andrew J Bulpitt, and David C Hogg. 2022. Talking Head from Speech Audio Using a Pre-trained Image Generator. In Proceedings of the ACM International Conference on Multimedia. 5228--5236.Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Relja Arandjelovic and Andrew Zisserman. 2017. Look, Listen and Learn. In Proceedings of the IEEE International Conference on Computer Vision. 609--617.Google ScholarGoogle ScholarCross RefCross Ref
  4. Relja Arandjelovic and Andrew Zisserman. 2018. Objects that Sound. In Proceedings of the European Conference on Computer Vision. 435--451.Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Gemma A Calvert and Thomas Thesen. 2004. Multisensory Integration: Methodological Approaches and Emerging Principles in the Human Brain. Journal of Physiology-Paris, Vol. 98, 1--3 (2004), 191--205.Google ScholarGoogle ScholarCross RefCross Ref
  6. Francesca Frassinetti, Francesco Pavani, and Elisabetta Ladavas. 2002. Acoustical Vision of Neglected Stimuli: Interaction Among Spatially Converging Audiovisual Inputs in Neglect Patients. Journal of Cognitive Neuroscience, Vol. 14, 1 (2002), 62--69.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Ruohan Gao and Kristen Grauman. 2019. 2.5D Visual Sound. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 324--333.Google ScholarGoogle ScholarCross RefCross Ref
  8. Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 776--780.Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarGoogle ScholarCross RefCross Ref
  10. Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN Architectures for Large-Scale Audio Classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 131--135.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Elad Hoffer and Nir Ailon. 2015. Deep Metric Learning Using Triplet Network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3. 84--92.Google ScholarGoogle Scholar
  12. Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep Multimodal Clustering for Unsupervised Audiovisual Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9248--9257.Google ScholarGoogle ScholarCross RefCross Ref
  13. Xun Jiang, Xing Xu, Zhiguo Chen, Jingran Zhang, Jingkuan Song, Fumin Shen, Huimin Lu, and Heng Tao Shen. 2022. DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing. In Proceedings of the ACM International Conference on Multimedia. 719--727.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, and Kwanghoon Sohn. 2021. Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1336--1345.Google ScholarGoogle ScholarCross RefCross Ref
  15. Borong Liang, Yan Pan, Zhizhi Guo, Hang Zhou, Zhibin Hong, Xiaoguang Han, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. 2022. Expressive Talking Head Generation with Granular Audio-Visual Control. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3387--3396.Google ScholarGoogle ScholarCross RefCross Ref
  16. Alexander H Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, and James Glass. 2021. Cross-Modal Discrete Representation Learning. arXiv preprint arXiv:2106.05438 (2021).Google ScholarGoogle Scholar
  17. Jinxiang Liu, Chen Ju, Weidi Xie, and Ya Zhang. 2022. Exploiting Transformation Invariance and Equivariance for Self-Supervised Sound Localisation. In Proceedings of the ACM International Conference on Multimedia. 3742--3753.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Sagnik Majumder and Kristen Grauman. 2022. Active Audio-Visual Separation of Dynamic Sound Sources. In Proceedings of the European Conference on Computer Vision. Springer, 551--569.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Hailong Ning, Bin Zhao, and Yuan Yuan. 2021. Semantics-Consistent Representation Learning for Remote Sensing Image--Voice Retrieval. IEEE Transactions on Geoscience and Remote Sensing, Vol. 60 (2021), 1--14.Google ScholarGoogle ScholarCross RefCross Ref
  20. Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In Proceedings of the European Conference on Computer Vision. 631--648.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Monique Radeau. 1994. Auditory-Visual Spatial Interaction and Modularity. Cahiers de Psychologie Cognitive/Current Psychology of Cognition (1994).Google ScholarGoogle Scholar
  22. Tanzila Rahman, Mengyu Yang, and Leonid Sigal. 2021. TriBERT: Human-Centric Audio-Visual Representation Learning. Advances in Neural Information Processing Systems, Vol. 34 (2021), 9774--9787.Google ScholarGoogle Scholar
  23. Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2018. Learning to Localize Sound Source in Visual Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4358--4366.Google ScholarGoogle ScholarCross RefCross Ref
  24. Francesca Setti, Giacomo Handjaras, Davide Bottari, Andrea Leo, Matteo Diano, Valentina Bruno, Carla Tinti, Luca Cecchetti, Francesca Garbarini, Pietro Pietrini, et al. 2023. A Modality-Independent Proto-Organization of Human Multisensory Areas. Nature Human Behaviour (2023), 1--14.Google ScholarGoogle Scholar
  25. Yapeng Tian, Di Hu, and Chenliang Xu. 2021. Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2745--2754.Google ScholarGoogle ScholarCross RefCross Ref
  26. Eric Zhongcong Xu, Zeyang Song, Satoshi Tsutsui, Chao Feng, Mang Ye, and Mike Zheng Shou. 2022. AVA-AVD: Audio-Visual Speaker Diarization in the Wild. In Proceedings of the ACM International Conference on Multimedia. 3838--3847.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin. 2021. Visually Informed Binaural Audio Generation without Binaural Audios. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15485--15494.Google ScholarGoogle ScholarCross RefCross Ref
  28. Karren Yang, Bryan Russell, and Justin Salamon. 2020. Telling Left from Right: Learning Spatial Correspondence of Sight and Sound. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9932--9941.Google ScholarGoogle ScholarCross RefCross Ref
  29. Youshan Zhang and Jialu Li. 2023. BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2248--2257.Google ScholarGoogle ScholarCross RefCross Ref
  30. Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The Sound of Pixels. In Proceedings of the European Conference on Computer Vision. 570--586.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. Tao Zheng, Sunny Verma, and Wei Liu. 2022. Interpretable Binaural Ratio for Visually Guided Binaural Audio Generation. In Proceedings of the International Joint Conference on Neural Networks. 1--8.Google ScholarGoogle ScholarCross RefCross Ref
  32. Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. 2022. Audio-Visual Segmentation. In Proceedings of the European Conference on Computer Vision. 386--403.Google ScholarGoogle Scholar
  33. Lingyu Zhu and Esa Rahtu. 2022. Visually Guided Sound Source Separation and Localization Using Self-Supervised Motion Representations. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1289--1299.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          MM '23: Proceedings of the 31st ACM International Conference on Multimedia
          October 2023
          9913 pages
          ISBN:9798400701085
          DOI:10.1145/3581783

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 27 October 2023

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate995of4,171submissions,24%

          Upcoming Conference

          MM '24
          MM '24: The 32nd ACM International Conference on Multimedia
          October 28 - November 1, 2024
          Melbourne , VIC , Australia
        • Article Metrics

          • Downloads (Last 12 months)100
          • Downloads (Last 6 weeks)16

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader