short-paper

Multi-grained Representation Learning for Cross-modal Retrieval

Authors:

Shaoyi DuAuthors Info & Claims

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 2194 - 2198

https://doi.org/10.1145/3539618.3592025

Published: 18 July 2023 Publication History

Abstract

The purpose of audio-text retrieval is to learn a cross-modal similarity function between audio and text, enabling a given audio/text to find similar text/audio from a candidate set. Recent audio-text retrieval models aggregate multi-modal features into a single-grained representation. However, single-grained representation is difficult to solve the situation that an audio is described by multiple texts of different granularity levels, because the association pattern between audio and text is complex. Therefore, we propose an adaptive aggregation strategy to automatically find the optimal pool function to aggregate the features into a comprehensive representation, so as to learn valuable multi-grained representation. And multi-grained comparative learning is carried out in order to focus on the complex correlation between audio and text in different granularity. Meanwhile, text-guided token interaction is used to reduce the impact of redundant audio clips. We evaluated our proposed method on two audio-text retrieval benchmark datasets of Audiocaps and Clotho, achieving the state-of-the-art results in text-to-audio and audio-to-text retrieval. Our findings emphasize the importance of learning multi-modal multi-grained representation.

Supplemental Material

MP4 File

Multi-grained Representation Learning for Cross-modal Retrieval

Download
11.19 MB

References

[1]

Yi-Wen Chao, Dongchao Yang, Rongzhi Gu, and Yuexian Zou. 2022. 3CMLF: Three-Stage Curriculum-Based Mutual Learning Framework for Audio-Text Retrieval. In 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 1602--1607.

[2]

Hui Chen, Guiguang Ding, Xudong Liu, Zijia Lin, Ji Liu, and Jungong Han. 2020. Imram: Iterative matching with recurrent attention memory for cross-modal image-text retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 12655--12663.

[3]

Xing Cheng, Hezheng Lin, Xiangyu Wu, Fan Yang, and Dong Shen. 2021. Improving video-text retrieval by multi-stream corpus alignment and dual softmax loss. arXiv preprint arXiv:2109.04290 (2021).

[4]

Adrien Deliège, Maxime Istasse, Ashwani Kumar, Christophe De Vleeschouwer, and Marc Van Droogenbroeck. 2021. Ordinal pooling. arXiv preprint arXiv:2109.01561 (2021).

[5]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[6]

Haiwen Diao, Ying Zhang, Lin Ma, and Huchuan Lu. 2021. Similarity reasoning and filtration for image-text matching. In Proceedings of the AAAI conference on artificial intelligence, Vol. 35. 1218--1226.

[7]

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. 2020. Clotho: An audio captioning dataset. In ICASSP 2020--2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 736--740.

[8]

Satya Krishna Gorti, Noël Vouitsis, Junwei Ma, Keyvan Golestan, Maksims Volkovs, Animesh Garg, and Guangwei Yu. 2022. X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 5006--5015.

[9]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[10]

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. 2021. Scaling up visual and vision-language representation learning with noisy text supervision. In International Conference on Machine Learning. PMLR, 4904--4916.

[11]

Nal Kalchbrenner, Edward Grefenstette, and Phil Blunsom. 2014. A convolutional neural network for modelling sentences. arXiv preprint arXiv:1404.2188 (2014).

[12]

Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. 2019. Audiocaps: Generating captions for audios in the wild. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 119--132.

[13]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014).

[14]

A Sophia Koepke, Andreea-Maria Oncescu, Joao Henriques, Zeynep Akata, and Samuel Albanie. 2022. Audio retrieval with natural language queries: A benchmark study. IEEE Transactions on Multimedia (2022).

[15]

Qiuqiang Kong, Yin Cao, Turab Iqbal, Yuxuan Wang, Wenwu Wang, and Mark D Plumbley. 2020. Panns: Large-scale pretrained audio neural networks for audio pattern recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 28 (2020), 2880--2894.

Digital Library

[16]

Chunxiao Liu, Zhendong Mao, Tianzhu Zhang, Hongtao Xie, Bin Wang, and Yongdong Zhang. 2020. Graph structured network for image-text matching. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10921--10930.

[17]

Yuqi Liu, Pengfei Xiong, Luhui Xu, Shengming Cao, and Qin Jin. 2022. TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval. arXiv preprint arXiv:2207.07852 (2022).

[18]

Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An empirical study of CLIP for end to end video clip retrieval and captioning. Neurocomputing, Vol. 508 (2022), 293--304.

Digital Library

[19]

Yiwei Ma, Guohai Xu, Xiaoshuai Sun, Ming Yan, Ji Zhang, and Rongrong Ji. 2022. X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval. In Proceedings of the 30th ACM International Conference on Multimedia. 638--647.

Digital Library

[20]

Xinhao Mei, Xubo Liu, Jianyuan Sun, Mark D Plumbley, and Wenwu Wang. 2022. On Metric Learning for Audio-Text Cross-Modal Retrieval. arXiv preprint arXiv:2203.15537 (2022).

[21]

A.-M. Oncescu, A.S. Koepke, J. Henriques, and Albanie S. Akata, Z. 2021. Audio Retrieval with Natural Language Queries. In INTERSPEECH.

[22]

Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. 2022. Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. arXiv preprint arXiv:2211.06687 (2022).

[23]

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. 2021. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783 (2021).

[24]

Yan Zeng, Xinsong Zhang, and Hang Li. 2021. Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts. arXiv preprint arXiv:2111.08276 (2021).

Cited By

Suryawanshi YShah VRandar SJoshi A(2025)Audio meets text: a loss-enhanced journey with manifold mixup and re-rankingKnowledge and Information Systems10.1007/s10115-024-02283-467:3(2195-2231)Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s10115-024-02283-4
Shi HZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Modal-Enhanced Semantic Modeling for Fine-Grained 3D Human Motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681625(10114-10123)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681625
Zhao SXu LLiu YDu SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681389(9244-9252)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681389
Show More Cited By

Index Terms

Multi-grained Representation Learning for Cross-modal Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Speech / audio search

Recommendations

Multi-Modal Knowledge Representation Learning via Webly-Supervised Relationships Mining
MM '17: Proceedings of the 25th ACM international conference on Multimedia

Knowledge representation learning (KRL) encodes enormous structured information with entities and relations into a continuous low-dimensional semantic space. Most conventional methods solely focus on learning knowledge representation from single ...
Hybrid representation learning for cross-modal retrieval
Abstract
The rapid development of Deep Neural Networks (DNNs) in single-modal retrieval has promoted the wide application of DNNs in cross-modal retrieval tasks. Therefore, we propose a DNN-based method to learn the shared representation for ...
Fine-Grained Cross-Modal Contrast Learning for Video-Text Retrieval
Advanced Intelligent Computing Technology and Applications
Abstract
Video-sharing platforms emphasize video-text retrieval in multimodal information retrieval. Existing methods often overlook video text intricacies and redundancy, focusing mainly on single-granularity information. To address this, we propose Fine-...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '23: Proceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2023

3567 pages

ISBN:9781450394086

DOI:10.1145/3539618

General Chairs:
Hsin-Hsi Chen
National Taiwan University
,
Wei-Jou (Edward) Duh
National Taiwan University
,
Hen-Hsen Huang
Academia Sinica
,
Program Chairs:
Makoto P. Kato
Spotify
,
Josiane Mothe
Universite de Toulouse
,
Barbara Poblete
University of Chile and Amazon Visiting Academic

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 July 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Short-paper

Funding Sources

Conference

SIGIR '23

Sponsor:

SIGIR

SIGIR '23: The 46th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 23 - 27, 2023

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
373
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)12

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Suryawanshi YShah VRandar SJoshi A(2025)Audio meets text: a loss-enhanced journey with manifold mixup and re-rankingKnowledge and Information Systems10.1007/s10115-024-02283-467:3(2195-2231)Online publication date: 1-Mar-2025
https://dl.acm.org/doi/10.1007/s10115-024-02283-4
Shi HZhang HCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Modal-Enhanced Semantic Modeling for Fine-Grained 3D Human Motion RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681625(10114-10123)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681625
Zhao SXu LLiu YDu SCai JKankanhalli MPrabhakaran BBoll SSubramanian RZheng LSingh VCesar PXie LXu D(2024)Multi-grained Correspondence Learning of Audio-language Models for Few-shot Audio RecognitionProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681389(9244-9252)Online publication date: 28-Oct-2024
https://dl.acm.org/doi/10.1145/3664647.3681389
Zhang SMu HLiu TTong QSheng JSerra ESpezzano F(2024)MSKR: Advancing Multi-modal Structured Knowledge Representation with Synergistic Hard Negative SamplesProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3679680(3207-3216)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3679680
Wang TLi FZhu LLi JZhang ZShen H(2024)Cross-Modal Retrieval: A Systematic Review of Methods and Future DirectionsProceedings of the IEEE10.1109/JPROC.2024.3525147112:11(1716-1754)Online publication date: Nov-2024
https://doi.org/10.1109/JPROC.2024.3525147
Wang QGu JLing Z(2024)Multiscale Matching Driven by Cross-Modal Similarity Consistency for Audio-Text RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10446302(11581-11585)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10446302
Bo XLiu JYang DMa W(2024)Bridging the gap: multi-granularity representation learning for text-based vehicle retrievalComplex & Intelligent Systems10.1007/s40747-024-01614-w11:1Online publication date: 13-Nov-2024
https://doi.org/10.1007/s40747-024-01614-w
Liu YDu SZhao STian ZZheng N(2023)CenterDA: Center-Aware Unsupervised Domain Adaptation Regularized by Class Diversity for Distracted Driver Recognition2023 IEEE 26th International Conference on Intelligent Transportation Systems (ITSC)10.1109/ITSC57777.2023.10422204(1092-1097)Online publication date: 24-Sep-2023
https://doi.org/10.1109/ITSC57777.2023.10422204

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten