skip to main content
10.1145/3581783.3613805acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

Published: 27 October 2023 Publication History

Abstract

Audio and vision are important senses for high-level cognition, and their special strong correlation makes audio-visual coding a crucial factor in many multimodal tasks. However, there are two challenges in audio-visual coding. First, the heterogeneity of multimodal data often leads to misalignment of cross-modal features under the same sample, which reduces their representation quality. Second, most self-supervised learning frameworks are constructed based on instance semantics, and the generated pseudo labels introduce additional classification noise. To address these challenges, we propose a Supervised Cross-modal Contrastive Learning Framework for Audio-Visual Coding (SCLAV). Our framework includes an audio-visual coding network composed of an inter-modal attention interaction module and an intra-modal self-integration module, which leverage multimodal complementary and hidden information for better representation. Additionally, we introduce a supervised cross-modal contrastive loss to minimize the distance between audio and vision features of the same instance, and use weak labels of multimodal data to eliminate the feature-oriented classification noise. Extensive experiments on the AVE and XD-Violence datasets demonstrate that SCLAV outperforms the state-of-the-art results, even with limited computational resources.

References

[1]
Triantafyllos Afouras, Yuki M. Asano, Francois Fagan, Andrea Vedaldi, and Florian Metze. 2022. Self-Supervised Object Detection From Audio-Visual Correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10575--10586.
[2]
Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems 34 (2021), 24206--24221.
[3]
Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems 33 (2020), 25--37.
[4]
Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. 2020. Self-supervised learning by cross-modal audiovideo clustering. Advances in Neural Information Processing Systems 33 (2020), 9758--9770.
[5]
Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision. 609--617.
[6]
Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning. PMLR, 1298--1312.
[7]
Mathilde Brousmiche, Jean Rouat, and Stephane Dupont. 2022. Multimodal attentive fusion network for audio-visual event recognition. Information Fusion 85 (2022), 52--59.
[8]
Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems 33 (2020), 9912--9924.
[9]
Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.
[10]
Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, et al. 2021. Multimodal clustering networks for self-supervised learning from unlabeled videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8012--8021.
[11]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. ICML. arXiv preprint arXiv:2002.05709 (2020).
[12]
Nicholas Frosst, Nicolas Papernot, and Geoffrey Hinton. 2019. Analyzing and improving representations with the soft nearest neighbor loss. In International conference on machine learning. PMLR, 2012-2020.
[13]
Xiangming Gu, Longshen Ou, Danielle Ong, and Ye Wang. 2022. Mm-alt: A multimodal automatic lyric transcription system. In Proceedings of the 30th ACM International Conference on Multimedia. 3328--3337.
[14]
Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2022. Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 976--980.
[15]
Tengda Han,Weidi Xie, and Andrew Zisserman. 2020. Self-supervised co-training for video representation learning. Advances in Neural Information Processing Systems 33 (2020), 5679--5690.
[16]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.
[17]
Olivier Henaff. 2020. Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning. PMLR, 4182--4192.
[18]
Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131--135.
[19]
R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).
[20]
Chih-Hui Ho and Nuno Nvasconcelos. 2020. Contrastive learning with adversarial examples. Advances in Neural Information Processing Systems 33 (2020), 17081--17093.
[21]
Konstantinos Kamnitsas, Daniel Castro, Loic Le Folgoc, Ian Walker, Ryutaro Tanno, Daniel Rueckert, Ben Glocker, Antonio Criminisi, and Aditya Nori. 2018. Semi-supervised learning via compact latent space clustering. In International conference on machine learning. PMLR, 2459--2468.
[22]
Prannay Khosla, Piotr Teterwak, ChenWang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems 33 (2020), 18661--18673.
[23]
Marius Kloft and Gilles Blanchard. 2011. The local rademacher complexity of lp-norm multiple kernel learning. Advances in Neural Information Processing Systems 24 (2011).
[24]
Shuo Li, Fang Liu, and Licheng Jiao. 2022. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1395--1403.
[25]
Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang FrankWang. 2019. Dual-modality Seq2Seq Network for Audio-visual Event Localization. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2002-2006. https://doi.org/10.1109/ICASSP.2019.8683226
[26]
Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Asian Conference on Computer Vision.
[27]
Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization. In Proceedings of the Asian Conference on Computer Vision (ACCV).
[28]
Ishan Misra and Laurens van der Maaten. 2020. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6707--6717.
[29]
Pedro Morgado, Ishan Misra, and Nuno Vasconcelos. 2021. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12934--12945.
[30]
Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12475--12486.
[31]
Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, and Kunio Kashino. 2022. Conceptbeam: Concept driven target speech extraction. In Proceedings of the 30th ACM International Conference on Multimedia. 4252--4260.
[32]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[33]
Felix Ott, David Rügamer, Lucas Heublein, Bernd Bischl, and Christopher Mutschler. 2022. Auxiliary Cross-Modal Representation Learning with Triplet Loss Functions for Online Handwriting Recognition. arXiv preprint arXiv:2202.07901 (2022).
[34]
Wen-Feng Pang, Qian-Hua He, Yong-jian Hu, and Yan-Xiong Li. 2021. Violence detection in videos based on fusing visual and audio information. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2260--2264.
[35]
Mandela Patrick, Yuki Asano, Polina Kuznetsova, Ruth Fong, Joao F Henriques, Geoffrey Zweig, and Andrea Vedaldi. 2020. Multi-modal self-supervision from generalized data transformations. (2020).
[36]
AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. 2020. Evolving losses for unsupervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 133--142.
[37]
Janani Ramaswamy. 2020. What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4372--4376. https://doi.org/10.1109/ICASSP40776.2020.9053895
[38]
Janani Ramaswamy and Sukhendu Das. 2020. See the sound, hear the pixels. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2970--2979.
[39]
J. Ramaswamy and S. Das. 2020. See the Sound, Hear the Pixels. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).
[40]
Ruslan Salakhutdinov and Geoff Hinton. 2007. Learning a nonlinear embedding by preserving class neighbourhood structure. In Artificial intelligence and statistics. PMLR, 412--419.
[41]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[42]
Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6479--6488.
[43]
Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI 16. Springer, 776--794.
[44]
Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, JohanWVerjans, and Gustavo Carneiro. 2021. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision. 4975--4986.
[45]
Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audiovisual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV). 247--263.
[46]
Luyu Wang, Pauline Luc, Adria Recasens, Jean-Baptiste Alayrac, and Aaron van den Oord. 2021. Multimodal self-supervised learning of general audio representations. arXiv preprint arXiv:2104.12807 (2021).
[47]
Dong-Lai Wei, Chen-Geng Liu, Yang Liu, Jing Liu, Xiao-Guang Zhu, and Xin- Hua Zeng. 2022. Look, Listen and Pay More Attention: Fusing Multi-Modal Information for Video Violence Detection. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1980--1984.
[48]
Yake Wei, Di Hu, Yapeng Tian, and Xuelong Li. 2022. Learning in audio-visual context: A review, analysis, and new perspective. arXiv preprint arXiv:2208.09579 (2022).
[49]
Peng Wu and Jing Liu. 2021. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing 30 (2021), 3513--3527.
[50]
Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. 2019. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV). 6292--6300.
[51]
Zhirong Wu, Alexei A Efros, and Stella X Yu. 2018. Improving generalization via scalable neighborhood component analysis. In Proceedings of the european conference on computer vision (ECCV). 685--701.
[52]
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3733--3742.
[53]
Yan Xia and Zhou Zhao. 2022. Cross-modal background suppression for audiovisual event localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19989--19998.
[54]
Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan. 2020. Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the 28th ACM International Conference on Multimedia. 3893--3901.
[55]
H. Xuan, Z. Zhang, S. Chen, J. Yang, and Y. Yan. 2020. Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization. In Association for the Advancement of Artificial Intelligence (AAAI). 279--286.
[56]
Liu Yang, Zhenjie Wu, Junkun Hong, and Jun Long. 2022. MCL: A Contrastive Learning Method for Multimodal Data Fusion in Violence Detection. IEEE Signal Processing Letters (2022).
[57]
Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, and Yuejie Zhang. 2022. Modalityaware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection. In Proceedings of the 30th ACM International Conference on Multimedia. 6278--6287.
[58]
Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 3 (2020), 1--23.
[59]
Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8436--8444.
[60]
Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive Sample Propagation Along the Audio-Visual Event Line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8436--8444.
[61]
Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. 2019. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6002--6012.

Cited By

View all
  • (2024)OpenAVE: Moving towards Open Set Audio-Visual Event LocalizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681232(7503-7512)Online publication date: 28-Oct-2024
  • (2024)Event Traffic Forecasting with Sparse Multimodal DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680706(8855-8864)Online publication date: 28-Oct-2024

Index Terms

  1. SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    MM '23: Proceedings of the 31st ACM International Conference on Multimedia
    October 2023
    9913 pages
    ISBN:9798400701085
    DOI:10.1145/3581783
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. audio-visual coding
    2. contrastive learning
    3. multi-modal fusion
    4. supervised cross-modal contrastive loss

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    MM '23
    Sponsor:
    MM '23: The 31st ACM International Conference on Multimedia
    October 29 - November 3, 2023
    Ottawa ON, Canada

    Acceptance Rates

    Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)282
    • Downloads (Last 6 weeks)20
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)OpenAVE: Moving towards Open Set Audio-Visual Event LocalizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681232(7503-7512)Online publication date: 28-Oct-2024
    • (2024)Event Traffic Forecasting with Sparse Multimodal DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680706(8855-8864)Online publication date: 28-Oct-2024

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media