research-article

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding

Authors:

Jialiang Cheng,

Jincai ChenAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 261 - 270

https://doi.org/10.1145/3581783.3613805

Published: 27 October 2023 Publication History

Abstract

Audio and vision are important senses for high-level cognition, and their special strong correlation makes audio-visual coding a crucial factor in many multimodal tasks. However, there are two challenges in audio-visual coding. First, the heterogeneity of multimodal data often leads to misalignment of cross-modal features under the same sample, which reduces their representation quality. Second, most self-supervised learning frameworks are constructed based on instance semantics, and the generated pseudo labels introduce additional classification noise. To address these challenges, we propose a Supervised Cross-modal Contrastive Learning Framework for Audio-Visual Coding (SCLAV). Our framework includes an audio-visual coding network composed of an inter-modal attention interaction module and an intra-modal self-integration module, which leverage multimodal complementary and hidden information for better representation. Additionally, we introduce a supervised cross-modal contrastive loss to minimize the distance between audio and vision features of the same instance, and use weak labels of multimodal data to eliminate the feature-oriented classification noise. Extensive experiments on the AVE and XD-Violence datasets demonstrate that SCLAV outperforms the state-of-the-art results, even with limited computational resources.

References

[1]

Triantafyllos Afouras, Yuki M. Asano, Francois Fagan, Andrea Vedaldi, and Florian Metze. 2022. Self-Supervised Object Detection From Audio-Visual Correspondence. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10575--10586.

[2]

Hassan Akbari, Liangzhe Yuan, Rui Qian, Wei-Hong Chuang, Shih-Fu Chang, Yin Cui, and Boqing Gong. 2021. Vatt: Transformers for multimodal self-supervised learning from raw video, audio and text. Advances in Neural Information Processing Systems 34 (2021), 24206--24221.

[3]

Jean-Baptiste Alayrac, Adria Recasens, Rosalia Schneider, Relja Arandjelović, Jason Ramapuram, Jeffrey De Fauw, Lucas Smaira, Sander Dieleman, and Andrew Zisserman. 2020. Self-supervised multimodal versatile networks. Advances in Neural Information Processing Systems 33 (2020), 25--37.

[4]

Humam Alwassel, Dhruv Mahajan, Bruno Korbar, Lorenzo Torresani, Bernard Ghanem, and Du Tran. 2020. Self-supervised learning by cross-modal audiovideo clustering. Advances in Neural Information Processing Systems 33 (2020), 9758--9770.

[5]

Relja Arandjelovic and Andrew Zisserman. 2017. Look, listen and learn. In Proceedings of the IEEE international conference on computer vision. 609--617.

[6]

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. 2022. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning. PMLR, 1298--1312.

[7]

Mathilde Brousmiche, Jean Rouat, and Stephane Dupont. 2022. Multimodal attentive fusion network for audio-visual event recognition. Information Fusion 85 (2022), 52--59.

Digital Library

[8]

Mathilde Caron, Ishan Misra, Julien Mairal, Priya Goyal, Piotr Bojanowski, and Armand Joulin. 2020. Unsupervised learning of visual features by contrasting cluster assignments. Advances in neural information processing systems 33 (2020), 9912--9924.

[9]

Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299--6308.

[10]

Brian Chen, Andrew Rouditchenko, Kevin Duarte, Hilde Kuehne, Samuel Thomas, Angie Boggust, Rameswar Panda, Brian Kingsbury, Rogerio Feris, David Harwath, et al. 2021. Multimodal clustering networks for self-supervised learning from unlabeled videos. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8012--8021.

[11]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. ICML. arXiv preprint arXiv:2002.05709 (2020).

[12]

Nicholas Frosst, Nicolas Papernot, and Geoffrey Hinton. 2019. Analyzing and improving representations with the soft nearest neighbor loss. In International conference on machine learning. PMLR, 2012-2020.

[13]

Xiangming Gu, Longshen Ou, Danielle Ong, and Ye Wang. 2022. Mm-alt: A multimodal automatic lyric transcription system. In Proceedings of the 30th ACM International Conference on Multimedia. 3328--3337.

Digital Library

[14]

Andrey Guzhov, Federico Raue, Jörn Hees, and Andreas Dengel. 2022. Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 976--980.

[15]

Tengda Han,Weidi Xie, and Andrew Zisserman. 2020. Self-supervised co-training for video representation learning. Advances in Neural Information Processing Systems 33 (2020), 5679--5690.

[16]

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 9729--9738.

[17]

Olivier Henaff. 2020. Data-efficient image recognition with contrastive predictive coding. In International conference on machine learning. PMLR, 4182--4192.

[18]

Shawn Hershey, Sourish Chaudhuri, Daniel PW Ellis, Jort F Gemmeke, Aren Jansen, R Channing Moore, Manoj Plakal, Devin Platt, Rif A Saurous, Bryan Seybold, et al. 2017. CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp). IEEE, 131--135.

[19]

R Devon Hjelm, Alex Fedorov, Samuel Lavoie-Marchildon, Karan Grewal, Phil Bachman, Adam Trischler, and Yoshua Bengio. 2018. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670 (2018).

[20]

Chih-Hui Ho and Nuno Nvasconcelos. 2020. Contrastive learning with adversarial examples. Advances in Neural Information Processing Systems 33 (2020), 17081--17093.

[21]

Konstantinos Kamnitsas, Daniel Castro, Loic Le Folgoc, Ian Walker, Ryutaro Tanno, Daniel Rueckert, Ben Glocker, Antonio Criminisi, and Aditya Nori. 2018. Semi-supervised learning via compact latent space clustering. In International conference on machine learning. PMLR, 2459--2468.

[22]

Prannay Khosla, Piotr Teterwak, ChenWang, Aaron Sarna, Yonglong Tian, Phillip Isola, Aaron Maschinot, Ce Liu, and Dilip Krishnan. 2020. Supervised contrastive learning. Advances in neural information processing systems 33 (2020), 18661--18673.

[23]

Marius Kloft and Gilles Blanchard. 2011. The local rademacher complexity of lp-norm multiple kernel learning. Advances in Neural Information Processing Systems 24 (2011).

[24]

Shuo Li, Fang Liu, and Licheng Jiao. 2022. Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 1395--1403.

[25]

Yan-Bo Lin, Yu-Jhe Li, and Yu-Chiang FrankWang. 2019. Dual-modality Seq2Seq Network for Audio-visual Event Localization. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2002-2006. https://doi.org/10.1109/ICASSP.2019.8683226

[26]

Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual transformer with instance attention for audio-visual event localization. In Proceedings of the Asian Conference on Computer Vision.

[27]

Yan-Bo Lin and Yu-Chiang Frank Wang. 2020. Audiovisual Transformer with Instance Attention for Audio-Visual Event Localization. In Proceedings of the Asian Conference on Computer Vision (ACCV).

[28]

Ishan Misra and Laurens van der Maaten. 2020. Self-supervised learning of pretext-invariant representations. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 6707--6717.

[29]

Pedro Morgado, Ishan Misra, and Nuno Vasconcelos. 2021. Robust audio-visual instance discrimination. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12934--12945.

[30]

Pedro Morgado, Nuno Vasconcelos, and Ishan Misra. 2021. Audio-visual instance discrimination with cross-modal agreement. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12475--12486.

[31]

Yasunori Ohishi, Marc Delcroix, Tsubasa Ochiai, Shoko Araki, Daiki Takeuchi, Daisuke Niizumi, Akisato Kimura, Noboru Harada, and Kunio Kashino. 2022. Conceptbeam: Concept driven target speech extraction. In Proceedings of the 30th ACM International Conference on Multimedia. 4252--4260.

Digital Library

[32]

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).

[33]

Felix Ott, David Rügamer, Lucas Heublein, Bernd Bischl, and Christopher Mutschler. 2022. Auxiliary Cross-Modal Representation Learning with Triplet Loss Functions for Online Handwriting Recognition. arXiv preprint arXiv:2202.07901 (2022).

[34]

Wen-Feng Pang, Qian-Hua He, Yong-jian Hu, and Yan-Xiong Li. 2021. Violence detection in videos based on fusing visual and audio information. In ICASSP 2021-2021 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2260--2264.

[35]

Mandela Patrick, Yuki Asano, Polina Kuznetsova, Ruth Fong, Joao F Henriques, Geoffrey Zweig, and Andrea Vedaldi. 2020. Multi-modal self-supervision from generalized data transformations. (2020).

[36]

AJ Piergiovanni, Anelia Angelova, and Michael S Ryoo. 2020. Evolving losses for unsupervised video representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 133--142.

[37]

Janani Ramaswamy. 2020. What Makes the Sound?: A Dual-Modality Interacting Network for Audio-Visual Event Localization. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 4372--4376. https://doi.org/10.1109/ICASSP40776.2020.9053895

[38]

Janani Ramaswamy and Sukhendu Das. 2020. See the sound, hear the pixels. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 2970--2979.

[39]

J. Ramaswamy and S. Das. 2020. See the Sound, Hear the Pixels. In 2020 IEEE Winter Conference on Applications of Computer Vision (WACV).

[40]

Ruslan Salakhutdinov and Geoff Hinton. 2007. Learning a nonlinear embedding by preserving class neighbourhood structure. In Artificial intelligence and statistics. PMLR, 412--419.

[41]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[42]

Waqas Sultani, Chen Chen, and Mubarak Shah. 2018. Real-world anomaly detection in surveillance videos. In Proceedings of the IEEE conference on computer vision and pattern recognition. 6479--6488.

[43]

Yonglong Tian, Dilip Krishnan, and Phillip Isola. 2020. Contrastive multiview coding. In Computer Vision-ECCV 2020: 16th European Conference, Glasgow, UK, August 23-28, 2020, Proceedings, Part XI 16. Springer, 776--794.

Digital Library

[44]

Yu Tian, Guansong Pang, Yuanhong Chen, Rajvinder Singh, JohanWVerjans, and Gustavo Carneiro. 2021. Weakly-supervised video anomaly detection with robust temporal feature magnitude learning. In Proceedings of the IEEE/CVF international conference on computer vision. 4975--4986.

[45]

Yapeng Tian, Jing Shi, Bochen Li, Zhiyao Duan, and Chenliang Xu. 2018. Audiovisual event localization in unconstrained videos. In Proceedings of the European Conference on Computer Vision (ECCV). 247--263.

Digital Library

[46]

Luyu Wang, Pauline Luc, Adria Recasens, Jean-Baptiste Alayrac, and Aaron van den Oord. 2021. Multimodal self-supervised learning of general audio representations. arXiv preprint arXiv:2104.12807 (2021).

[47]

Dong-Lai Wei, Chen-Geng Liu, Yang Liu, Jing Liu, Xiao-Guang Zhu, and Xin- Hua Zeng. 2022. Look, Listen and Pay More Attention: Fusing Multi-Modal Information for Video Violence Detection. In ICASSP 2022--2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1980--1984.

[48]

Yake Wei, Di Hu, Yapeng Tian, and Xuelong Li. 2022. Learning in audio-visual context: A review, analysis, and new perspective. arXiv preprint arXiv:2208.09579 (2022).

[49]

Peng Wu and Jing Liu. 2021. Learning causal temporal relation and feature discrimination for anomaly detection. IEEE Transactions on Image Processing 30 (2021), 3513--3527.

Digital Library

[50]

Yu Wu, Linchao Zhu, Yan Yan, and Yi Yang. 2019. Dual attention matching for audio-visual event localization. In Proceedings of the IEEE/CVF international conference on computer vision (ICCV). 6292--6300.

[51]

Zhirong Wu, Alexei A Efros, and Stella X Yu. 2018. Improving generalization via scalable neighborhood component analysis. In Proceedings of the european conference on computer vision (ECCV). 685--701.

Digital Library

[52]

Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition. 3733--3742.

[53]

Yan Xia and Zhou Zhao. 2022. Cross-modal background suppression for audiovisual event localization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 19989--19998.

[54]

Haoming Xu, Runhao Zeng, Qingyao Wu, Mingkui Tan, and Chuang Gan. 2020. Cross-modal relation-aware networks for audio-visual event localization. In Proceedings of the 28th ACM International Conference on Multimedia. 3893--3901.

Digital Library

[55]

H. Xuan, Z. Zhang, S. Chen, J. Yang, and Y. Yan. 2020. Cross-Modal Attention Network for Temporal Inconsistent Audio-Visual Event Localization. In Association for the Advancement of Artificial Intelligence (AAAI). 279--286.

[56]

Liu Yang, Zhenjie Wu, Junkun Hong, and Jun Long. 2022. MCL: A Contrastive Learning Method for Multimodal Data Fusion in Violence Detection. IEEE Signal Processing Letters (2022).

[57]

Jiashuo Yu, Jinyu Liu, Ying Cheng, Rui Feng, and Yuejie Zhang. 2022. Modalityaware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection. In Proceedings of the 30th ACM International Conference on Multimedia. 6278--6287.

Digital Library

[58]

Donghuo Zeng, Yi Yu, and Keizo Oyama. 2020. Deep triplet neural networks with cluster-cca for audio-visual cross-modal retrieval. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM) 16, 3 (2020), 1--23.

Digital Library

[59]

Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive sample propagation along the audio-visual event line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 8436--8444.

[60]

Jinxing Zhou, Liang Zheng, Yiran Zhong, Shijie Hao, and Meng Wang. 2021. Positive Sample Propagation Along the Audio-Visual Event Line. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 8436--8444.

[61]

Chengxu Zhuang, Alex Lin Zhai, and Daniel Yamins. 2019. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 6002--6012.

Cited By

Yu JZhang BTeng ZFan JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)OpenAVE: Moving towards Open Set Audio-Visual Event LocalizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681232(7503-7512)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681232
Han XZhang ZWu YZhang XWu ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Event Traffic Forecasting with Sparse Multimodal DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680706(8855-8864)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680706

Index Terms

SCLAV: Supervised Cross-modal Contrastive Learning for Audio-Visual Coding
1. Computing methodologies
  1. Artificial intelligence
    1. Computer vision
      1. Computer vision tasks
        Scene understanding

Recommendations

Cross-modal label contrastive learning for unsupervised audio-visual event localization
AAAI'23/IAAI'23/EAAI'23: Proceedings of the Thirty-Seventh AAAI Conference on Artificial Intelligence and Thirty-Fifth Conference on Innovative Applications of Artificial Intelligence and Thirteenth Symposium on Educational Advances in Artificial Intelligence

This paper for the first time explores audio-visual event localization in an unsupervised manner. Previous methods tackle this problem in a supervised setting and require segment-level or video-level event category ground-truth to train the model. However,...
Cross-modal contrastive learning for multimodal sentiment recognition
Abstract
Multimodal sentiment recognition has obtained increasing attention in recent years due to its potential to improve sentiment recognition accuracy by integrating information from multiple modalities. However, the heterogeneity issue caused by the ...
Deep semi-supervised learning with contrastive learning and partial label propagation for image data
Abstract
Deep semi-supervised learning is becoming an active research topic because it jointly utilizes labeled and unlabeled samples in training deep neural networks. Recent advances are mainly focused on inductive semi-supervised learning ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

October 2023

9913 pages

ISBN:9798400701085

DOI:10.1145/3581783

General Chairs:
Abdulmotaleb El Saddik
University of Ottawa, Canada & MBZUAI, UAE
,
Tao Mei
HiDream.ai, China
,
Rita Cucchiara
University of Modena and Reggio Emilia, Italy
,
Program Chairs:
Marco Bertini
University of Florence, Italy
,
Diana Patricia Tobon Vallejo
Unversidad de Medellin, Colombia
,
Pradeep K. Atrey
University at Albany, State University of New York, USA
,
M. Shamim Hossain
M. Shamim Hossain (King Saud University, KSA

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China

Conference

MM '23

Sponsor:

SIGMM

MM '23: The 31st ACM International Conference on Multimedia

October 29 - November 3, 2023

Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
482
Total Downloads

Downloads (Last 12 months)282
Downloads (Last 6 weeks)20

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Yu JZhang BTeng ZFan JCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)OpenAVE: Moving towards Open Set Audio-Visual Event LocalizationProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681232(7503-7512)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681232
Han XZhang ZWu YZhang XWu ZCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Event Traffic Forecasting with Sparse Multimodal DataProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680706(8855-8864)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3680706

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten