research-article

Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

Authors:
Zhaojian Li

Northwestern Polytechnical University, Xi'an, China

Northwestern Polytechnical University, Xi'an, China

0000-0001-6700-7010
View Profile

,
Bin Zhao

Northwestern Polytechnical University, Xi'an, China

Northwestern Polytechnical University, Xi'an, China

0000-0002-0294-8538
View Profile

,
Yuan Yuan

Northwestern Polytechnical University, Xi'an, China

Northwestern Polytechnical University, Xi'an, China

0000-0002-0404-5498
View Profile

MM '23: Proceedings of the 31st ACM International Conference on MultimediaOctober 2023Pages 3755–3764https://doi.org/10.1145/3581783.3612428

Published:27 October 2023Publication History

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 3755–3764

ABSTRACT

Audiovisual self-supervised representation learning has made significant strides in various audiovisual tasks. Existing methods mostly focus on single representation modeling between audio and visual modalities, ignoring the complex correspondence between them, resulting in the inability to execute cross-modal understanding in a more natural audiovisual scene. Several biological studies have shown that human learning is influenced by multi-layered synchronization of perception. To this end, inspired by biology, we argue to exploit the naturally existing relationships in audio and visual modalities to learn audiovisual representations under multilayer perceptual integration. Firstly, we introduce an audiovisual multi-representation pretext task that integrates semantic consistency, temporal alignment, and spatial correspondence. Secondly, we propose a self-supervised audiovisual multi-representation learning approach, which simultaneously learns the perceptual relationship between visual and audio modalities at semantic, temporal, and spatial levels. To establish fine-grained correspondence between visual objects and sounds, an audiovisual object detection module is proposed, which detects potential sounding objects by combining unsupervised knowledge at multiple levels. In addition, we propose a modality-wise loss and a task-wise loss to learn a subspace-orthogonal representation space that makes representation relations more discriminative. Finally, experimental results demonstrate that collectively understanding the semantic, temporal, and spatial correspondence between audiovisual modalities enables the model to perform better on downstream tasks such as sound separation, sound spatialization, and audiovisual segmentation.

References

Triantafyllos Afouras, Andrew Owens, Joon Son Chung, and Andrew Zisserman. 2020. Self-Supervised Learning of Audio-Visual Objects from Video. In Proceedings of the European Conference on Computer Vision. 208--224.Google ScholarDigital Library
Mohammed M Alghamdi, He Wang, Andrew J Bulpitt, and David C Hogg. 2022. Talking Head from Speech Audio Using a Pre-trained Image Generator. In Proceedings of the ACM International Conference on Multimedia. 5228--5236.Google ScholarDigital Library
Relja Arandjelovic and Andrew Zisserman. 2017. Look, Listen and Learn. In Proceedings of the IEEE International Conference on Computer Vision. 609--617.Google ScholarCross Ref
Relja Arandjelovic and Andrew Zisserman. 2018. Objects that Sound. In Proceedings of the European Conference on Computer Vision. 435--451.Google ScholarDigital Library
Gemma A Calvert and Thomas Thesen. 2004. Multisensory Integration: Methodological Approaches and Emerging Principles in the Human Brain. Journal of Physiology-Paris, Vol. 98, 1--3 (2004), 191--205.Google ScholarCross Ref
Francesca Frassinetti, Francesco Pavani, and Elisabetta Ladavas. 2002. Acoustical Vision of Neglected Stimuli: Interaction Among Spatially Converging Audiovisual Inputs in Neglect Patients. Journal of Cognitive Neuroscience, Vol. 14, 1 (2002), 62--69.Google ScholarDigital Library
Ruohan Gao and Kristen Grauman. 2019. 2.5D Visual Sound. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 324--333.Google ScholarCross Ref
Jort F Gemmeke, Daniel PW Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R Channing Moore, Manoj Plakal, and Marvin Ritter. 2017. Audio Set: An Ontology and Human-Labeled Dataset for Audio Events. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 776--780.Google ScholarDigital Library
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep Residual Learning for Image Recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770--778.Google ScholarCross Ref
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN Architectures for Large-Scale Audio Classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing. 131--135.Google ScholarDigital Library
Elad Hoffer and Nir Ailon. 2015. Deep Metric Learning Using Triplet Network. In Similarity-Based Pattern Recognition: Third International Workshop, SIMBAD 2015, Copenhagen, Denmark, October 12-14, 2015. Proceedings 3. 84--92.Google Scholar
Di Hu, Feiping Nie, and Xuelong Li. 2019. Deep Multimodal Clustering for Unsupervised Audiovisual Learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9248--9257.Google ScholarCross Ref
Xun Jiang, Xing Xu, Zhiguo Chen, Jingran Zhang, Jingkuan Song, Fumin Shen, Huimin Lu, and Heng Tao Shen. 2022. DHHN: Dual Hierarchical Hybrid Network for Weakly-Supervised Audio-Visual Video Parsing. In Proceedings of the ACM International Conference on Multimedia. 719--727.Google ScholarDigital Library
Jiyoung Lee, Soo-Whan Chung, Sunok Kim, Hong-Goo Kang, and Kwanghoon Sohn. 2021. Looking Into Your Speech: Learning Cross-Modal Affinity for Audio-Visual Speech Separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1336--1345.Google ScholarCross Ref
Borong Liang, Yan Pan, Zhizhi Guo, Hang Zhou, Zhibin Hong, Xiaoguang Han, Junyu Han, Jingtuo Liu, Errui Ding, and Jingdong Wang. 2022. Expressive Talking Head Generation with Granular Audio-Visual Control. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3387--3396.Google ScholarCross Ref
Alexander H Liu, SouYoung Jin, Cheng-I Jeff Lai, Andrew Rouditchenko, Aude Oliva, and James Glass. 2021. Cross-Modal Discrete Representation Learning. arXiv preprint arXiv:2106.05438 (2021).Google Scholar
Jinxiang Liu, Chen Ju, Weidi Xie, and Ya Zhang. 2022. Exploiting Transformation Invariance and Equivariance for Self-Supervised Sound Localisation. In Proceedings of the ACM International Conference on Multimedia. 3742--3753.Google ScholarDigital Library
Sagnik Majumder and Kristen Grauman. 2022. Active Audio-Visual Separation of Dynamic Sound Sources. In Proceedings of the European Conference on Computer Vision. Springer, 551--569.Google ScholarDigital Library
Hailong Ning, Bin Zhao, and Yuan Yuan. 2021. Semantics-Consistent Representation Learning for Remote Sensing Image--Voice Retrieval. IEEE Transactions on Geoscience and Remote Sensing, Vol. 60 (2021), 1--14.Google ScholarCross Ref
Andrew Owens and Alexei A Efros. 2018. Audio-Visual Scene Analysis with Self-Supervised Multisensory Features. In Proceedings of the European Conference on Computer Vision. 631--648.Google ScholarDigital Library
Monique Radeau. 1994. Auditory-Visual Spatial Interaction and Modularity. Cahiers de Psychologie Cognitive/Current Psychology of Cognition (1994).Google Scholar
Tanzila Rahman, Mengyu Yang, and Leonid Sigal. 2021. TriBERT: Human-Centric Audio-Visual Representation Learning. Advances in Neural Information Processing Systems, Vol. 34 (2021), 9774--9787.Google Scholar
Arda Senocak, Tae-Hyun Oh, Junsik Kim, Ming-Hsuan Yang, and In So Kweon. 2018. Learning to Localize Sound Source in Visual Scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4358--4366.Google ScholarCross Ref
Francesca Setti, Giacomo Handjaras, Davide Bottari, Andrea Leo, Matteo Diano, Valentina Bruno, Carla Tinti, Luca Cecchetti, Francesca Garbarini, Pietro Pietrini, et al. 2023. A Modality-Independent Proto-Organization of Human Multisensory Areas. Nature Human Behaviour (2023), 1--14.Google Scholar
Yapeng Tian, Di Hu, and Chenliang Xu. 2021. Cyclic Co-Learning of Sounding Object Visual Grounding and Sound Separation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2745--2754.Google ScholarCross Ref
Eric Zhongcong Xu, Zeyang Song, Satoshi Tsutsui, Chao Feng, Mang Ye, and Mike Zheng Shou. 2022. AVA-AVD: Audio-Visual Speaker Diarization in the Wild. In Proceedings of the ACM International Conference on Multimedia. 3838--3847.Google ScholarDigital Library
Xudong Xu, Hang Zhou, Ziwei Liu, Bo Dai, Xiaogang Wang, and Dahua Lin. 2021. Visually Informed Binaural Audio Generation without Binaural Audios. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 15485--15494.Google ScholarCross Ref
Karren Yang, Bryan Russell, and Justin Salamon. 2020. Telling Left from Right: Learning Spatial Correspondence of Sight and Sound. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 9932--9941.Google ScholarCross Ref
Youshan Zhang and Jialu Li. 2023. BirdSoundsDenoising: Deep Visual Audio Denoising for Bird Sounds. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 2248--2257.Google ScholarCross Ref
Hang Zhao, Chuang Gan, Andrew Rouditchenko, Carl Vondrick, Josh McDermott, and Antonio Torralba. 2018. The Sound of Pixels. In Proceedings of the European Conference on Computer Vision. 570--586.Google ScholarDigital Library
Tao Zheng, Sunny Verma, and Wei Liu. 2022. Interpretable Binaural Ratio for Visually Guided Binaural Audio Generation. In Proceedings of the International Joint Conference on Neural Networks. 1--8.Google ScholarCross Ref
Jinxing Zhou, Jianyuan Wang, Jiayi Zhang, Weixuan Sun, Jing Zhang, Stan Birchfield, Dan Guo, Lingpeng Kong, Meng Wang, and Yiran Zhong. 2022. Audio-Visual Segmentation. In Proceedings of the European Conference on Computer Vision. 386--403.Google Scholar
Lingyu Zhu and Esa Rahtu. 2022. Visually Guided Sound Source Separation and Localization Using Self-Supervised Motion Representations. In Proceedings of the IEEE Winter Conference on Applications of Computer Vision. 1289--1299.Google ScholarCross Ref

Index Terms

Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning
1. Computing methodologies
  1. Artificial intelligence
    1. Knowledge representation and reasoning
  2. Machine learning
    1. Learning paradigms
      1. Unsupervised learning
    2. Machine learning approaches
      1. Bio-inspired approaches

Recommendations

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

When watching videos, the occurrence of a visual event is often accompanied by an audio event, e.g., the voice of lip motion, the music of playing instruments. There is an underlying correlation between audio and visual events, which can be utilized as ...
Read More
Contrastive Self-supervised Representation Learning Using Synthetic Data
Abstract
Learning discriminative representations with deep neural networks often relies on massive labeled data, which is expensive and difficult to obtain in many real scenarios. As an alternative, self-supervised learning that leverages input itself as ...
Read More
Learning Self-Supervised Multimodal Representations of Human Behaviour
MM '20: Proceedings of the 28th ACM International Conference on Multimedia

Self-supervised learning of representations has important potential applications in human behaviour understanding. The ability to learn useful representations from large unlabeled datasets by modeling intrinsic properties of the data has been ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA
Copyright © 2023 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2023
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
audiovisual learning
contrastive learning
representation learning
self-supervised learning
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate995of4,171submissions,24%
Upcoming Conference
MM '24

Sponsor:

sigmm

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne , VIC , Australia
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 100
  Total Downloads
- Downloads (Last 12 months)100
- Downloads (Last 6 weeks)16
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Bio-Inspired Audiovisual Multi-Representation Integration via Self-Supervised Learning

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

ABSTRACT

References

Cited By

Index Terms

Recommendations

Look, Listen, and Attend: Co-Attention Network for Self-Supervised Audio-Visual Representation Learning

Contrastive Self-supervised Representation Learning Using Synthetic Data

Learning Self-Supervised Multimodal Representations of Human Behaviour