skip to main content
10.1145/3581783.3612411acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval

Published: 27 October 2023 Publication History

Abstract

With the powerful representation ability and privileged efficiency, deep cross-modal hashing (DCMH) has become an emerging fast similarity search technique. Prior studies primarily focus on exploring pairwise similarities across modalities, but fail to comprehensively capture the multi-grained semantic correlations during intra- and inter-modal negotiation. To tackle this issue, this paper proposes a novel Multi-granularity Interactive Transformer Hashing (MITH) network, which hierarchically considers both coarse- and fine-grained similarity measurements across different modalities in one unified transformer-based framework. To the best of our knowledge, this is the first attempt for multi-granularity transformer-based cross-modal hashing. Specifically, a well-designed distilled intra-modal interaction module is deployed to excavate modality-specific concept knowledge with global-local knowledge distillation under the guidance of implicit conceptual category-level representations. Moreover, we construct a contrastive inter-modal alignment module to mine modality-independent semantic concept correspondences with instance- and token-wise contrastive learning, respectively. Such a collaborative learning paradigm can jointly alleviate the heterogeneity and semantic gaps among different modalities from a multi-granularity perspective, yielding discriminative modality-invariant hash codes. Extensive experiments on multiple representative cross-modal datasets demonstrate the consistent superiority of MITH over the existing state-of-the-art baselines. The codes are available at https://github.com/DarrenZZhang/MITH.

Supplemental Material

MP4 File
This presentation video is about the paper entitled "Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval" accepted by ACM MM 2023. To comprehensively capture the multi-grained semantic correlations during intra- and inter-modal negotiation, we propose a novel Multi-granularity Interactive Transformer Hashing (MITH) network, which hierarchically considers both coarse- and fine-grained similarity measurements across different modalities in one unified transformer-based framework. In the presentation, we first give an overview of cross-modal hashing and the motivation for our work. Then, we elaborate on the detailed description of our method, including feature extraction, distilled intra-modal interaction, contrastive inter-modal alignment, and cross-modal hashing learning. Moreover, extensive experiments on multiple benchmark datasets demonstrate the superiority of our method. The code and data are publicly available at https://github.com/DarrenZZhang/MITH.

References

[1]
Cong Bai, Chao Zeng, Qing Ma, Jinglin Zhang, and Shengyong Chen. 2020. Deep adversarial discrete hashing for cross-modal retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 525--531.
[2]
Yue Cao, Bin Liu, Mingsheng Long, and Jianmin Wang. 2018. Cross-modal hamming hashing. In Proceedings of the European Conference on Computer Vision. 202--218.
[3]
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. 2021. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 9650--9660.
[4]
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning. 1597--1607.
[5]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. Nus-wide: A real-world web image database from national university of singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval. 1--9.
[6]
Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2075--2082.
[7]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. 2020. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
[8]
Dehong Gao, Linbo Jin, Ben Chen, Minghui Qiu, Peng Li, Yi Wei, Yi Hu, and Hao Wang. 2020. Fashionbert: Text and image matching with adaptive loss for cross-modal retrieval. In Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval. 2251--2260.
[9]
Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Guo, Mohammad Gheshlaghi Azar, et al. 2020. Bootstrap your own latent-A new approach to self-supervised learning. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 33. 21271--21284.
[10]
Wen Gu, Xiaoyan Gu, Jingzi Gu, Bo Li, Zhi Xiong, and Weiping Wang. 2019. Adversary guided asymmetric hashing for cross-modal retrieval. In Proceedings of the 2019 on International Conference on Multimedia Retrieval. 159--167.
[11]
Raia Hadsell, Sumit Chopra, and Yann LeCun. 2006. Dimensionality reduction by learning an invariant mapping. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, Vol. 2. 1735--1742.
[12]
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 9729--9738.
[13]
Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. 2015. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531 (2015).
[14]
Hengtong Hu, Lingxi Xie, Richang Hong, and Qi Tian. 2020. Creating something from nothing: Unsupervised knowledge distillation for cross-modal hashing. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition. 3123--3132.
[15]
Peng Hu, Hongyuan Zhu, Jie Lin, Dezhong Peng, Yin-Ping Zhao, and Xi Peng. 2023. Unsupervised contrastive cross-modal hashing. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 45 (2023), 3877--3889.
[16]
Mark J Huiskes and Michael S Lew. 2008. The Mir Flickr Retrieval Evaluation. In Proceedings of the 1st ACM International Conference on Multimedia Information Retrieval. 39--43.
[17]
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3232--3240.
[18]
Shaishav Kumar and Raghavendra Udupa. 2011. Learning hash functions for cross-view similarity search. In Twenty-second International joint Conference on Artificial Intelligence. 1360--1365.
[19]
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-supervised adversarial hashing networks for cross-modal retrieval. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4242--4251.
[20]
Kunpeng Li, Yulun Zhang, Kai Li, Yuanyuan Li, and Yun Fu. 2019. Visual semantic reasoning for image-text matching. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4654--4662.
[21]
Mingyong Li and Hongya Wang. 2021. Unsupervised deep cross-modal hashing by knowledge distillation for large-scale cross-modal retrieval. In Proceedings of the International Conference on Multimedia Retrieval. 183--191.
[22]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. 2014. Microsoft coco: Common objects in context. In Proceedings of the European Conference on Computer Vision. 740--755.
[23]
Ilya Loshchilov and Frank Hutter. 2016. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983 (2016).
[24]
Ilya Loshchilov and Frank Hutter. 2017. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017).
[25]
Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. 2016. Understanding the effective receptive field in deep convolutional neural networks. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 29. 4905--4913.
[26]
Xinhong Ma, Tianzhu Zhang, and Changsheng Xu. 2020. Multi-level correlation adversarial hashing for cross-modal retrieval. IEEE Transactions on Multimedia, Vol. 22, 12 (2020), 3101--3114.
[27]
Nicola Messina, Giuseppe Amato, Andrea Esuli, Fabrizio Falchi, Claudio Gennaro, and Stéphane Marchand-Maillet. 2021. Fine-grained visual textual alignment for cross-modal retrieval using transformer encoders. ACM Transactions on Multimedia Computing, Communications, and Applications, Vol. 17, 4 (2021), 1--23.
[28]
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. 2018. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748 (2018).
[29]
Weihua Ou, Jiaxin Deng, Lei Zhang, Jianping Gou, and Quan Zhou. 2023. Cross-Modal generation and pair correlation alignment Hashing. IEEE Transactions on Intelligent Transportation Systems, Vol. 24 (2023), 3018--3026.
[30]
Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. 2019. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3967--3976.
[31]
Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019. Pytorch: An imperative style, high-performance deep learning library. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 32. 8024--8035.
[32]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning. 8748--8763.
[33]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.
[34]
Heng Tao Shen, Luchen Liu, Yang Yang, Xing Xu, Zi Huang, Fumin Shen, and Richang Hong. 2020. Exploiting subspace relation in semantic labels for cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering, Vol. 33, 10 (2020), 3351--3365.
[35]
Yufeng Shi, Xinge You, Feng Zheng, Shuo Wang, and Qinmu Peng. 2019. Equally-Guided Discriminative Hashing for Cross-modal Retrieval. In Proceedings of the Twenty-Eighth International Joint Conference on Artificial Intelligence. 4767--4773.
[36]
Shupeng Su, Zhisheng Zhong, and Chao Zhang. 2019. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 3027--3035.
[37]
Changchang Sun, Hugo Latapie, Gaowen Liu, and Yan Yan. 2022. Deep Normalized Cross-Modal Hashing with Bi-Direction Relation Reasoning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4941--4949.
[38]
Junfeng Tu, Xueliang Liu, Zongxiang Lin, Richang Hong, and Meng Wang. 2022. Differentiable Cross-modal Hashing via Multimodal Transformers. In Proceedings of the 30th ACM International Conference on Multimedia. 453--461.
[39]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 30. 1--15.
[40]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial cross-modal retrieval. In Proceedings of the 25th ACM International Conference on Multimedia. 154--162.
[41]
Di Wang, Quan Wang, Lihuo He, Xinbo Gao, and Yumin Tian. 2020a. Joint and individual matrix factorization hashing for large-scale cross-modal retrieval. Pattern Recognition, Vol. 107 (2020), 107479.
[42]
Yukang Wang, Wei Zhou, Tao Jiang, Xiang Bai, and Yongchao Xu. 2020b. Intra-class feature variation distillation for semantic segmentation. In Proceedings of the European Conference on Computer Vision. 346--362.
[43]
Hongfa Wu, Lisai Zhang, Qingcai Chen, Yimeng Deng, Joanna Siebert, Yunpeng Han, Zhonghua Li, Dejiang Kong, and Zhao Cao. 2022. Contrastive Label Correlation Enhanced Unified Hashing Encoder for Cross-modal Retrieval. In Proceedings of the 31st ACM International Conference on Information & Knowledge Management. 2158--2168.
[44]
Lin Wu, Yang Wang, and Ling Shao. 2018a. Cycle-consistent deep generative hashing for cross-modal retrieval. IEEE Transactions on Image Processing, Vol. 28, 4 (2018), 1602--1612.
[45]
Zhirong Wu, Yuanjun Xiong, Stella X Yu, and Dahua Lin. 2018b. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3733--3742.
[46]
De Xie, Cheng Deng, Chao Li, Xianglong Liu, and Dacheng Tao. 2020. Multi-task consistency-preserving adversarial hashing for cross-modal retrieval. IEEE Transactions on Image Processing, Vol. 29 (2020), 3626--3637.
[47]
Ruiqing Xu, Chao Li, Junchi Yan, Cheng Deng, and Xianglong Liu. 2019. Graph convolutional network hashing for cross-modal retrieval. In Proceedings of the International Joint Conference on Artificial Intelligence, Vol. 2019. 982--988.
[48]
Xing Xu, Fumin Shen, Yang Yang, Heng Tao Shen, and Xuelong Li. 2017. Learning discriminative binary codes for large-scale cross-modal retrieval. IEEE Transactions on Image Processing, Vol. 26, 5 (2017), 2494--2507.
[49]
Hong-Lei Yao, Yu-Wei Zhan, Zhen-Duo Chen, Xin Luo, and Xin-Shun Xu. 2021. TEACH: Attention-aware deep cross-modal hashing. In Proceedings of the International Conference on Multimedia Retrieval. 376--384.
[50]
Mang Ye, Xu Zhang, Pong C Yuen, and Shih-Fu Chang. 2019. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6210--6219.
[51]
Guanghao Yin, Wei Wang, Zehuan Yuan, Chuchu Han, Wei Ji, Shouqian Sun, and Changhu Wang. 2022. Content-variant reference image quality assessment via knowledge distillation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. 3134--3142.
[52]
Jun Yu, Hao Zhou, Yibing Zhan, and Dacheng Tao. 2021. Deep graph-neighbor coherence preserving network for unsupervised cross-modal hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 35. 4626--4634.
[53]
Jure Zbontar, Li Jing, Ishan Misra, Yann LeCun, and Stéphane Deny. 2021. Barlow twins: Self-supervised learning via redundancy reduction. In Proceedings of the International Conference on Machine Learning. 12310--12320.
[54]
Jian Zhang, Yuxin Peng, and Mingkuan Yuan. 2018. Unsupervised generative adversarial cross-modal hashing. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32. 539--546.
[55]
Zheng Zhang, Zhihui Lai, Zi Huang, Wai Keung Wong, Guo-Sen Xie, Li Liu, and Ling Shao. 2019. Scalable supervised asymmetric hashing with semantic and latent factor embedding. IEEE Transactions on Image Processing, Vol. 28, 10 (2019), 4803--4818.
[56]
Zheng Zhang, Luyao Liu, Yadan Luo, Zi Huang, Fumin Shen, Heng Tao Shen, and Guangming Lu. 2020. Inductive structure consistent hashing via flexible semantic calibration. IEEE Transactions on Neural Networks and Learning Systems, Vol. 32, 10 (2020), 4514--4528.
[57]
Zheng Zhang, Haoyang Luo, Lei Zhu, Guangming Lu, and Heng Tao Shen. 2023. Modality-invariant asymmetric networks for cross-modal hashing. IEEE Transactions on Knowledge and Data Engineering, Vol. 35 (2023), 5091--5104.
[58]
Borui Zhao, Quan Cui, Renjie Song, Yiyu Qiu, and Jiajun Liang. 2022. Decoupled knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 11953--11962.
[59]
Lei Zhu, Liewu Cai, Jiayu Song, Xinghui Zhu, Chengyuan Zhang, and Shichao Zhang. 2022. MSSPQ: Multiple semantic structure-preserving quantization for cross-Modal retrieval. In Proceedings of the 2022 International Conference on Multimedia Retrieval. 631--638.
[60]
Xiatian Zhu, Shaogang Gong, et al. 2018. Knowledge distillation by on-the-fly native ensemble. In Proceedings of the Advances in Neural Information Processing Systems, Vol. 31. 1--11.

Cited By

View all
  • (2024)Enhancing cross-modal retrieval via visual-textual prompt hashingProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/69(623-631)Online publication date: 3-Aug-2024
  • (2024)Privacy-Enhanced Prototype-Based Federated Cross-Modal Hashing for Cross-Modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367450720:9(1-19)Online publication date: 23-Sep-2024
  • (2024)Stay Focused is All You Need for Adversarial RobustnessProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681676(6482-6491)Online publication date: 28-Oct-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '23: Proceedings of the 31st ACM International Conference on Multimedia
October 2023
9913 pages
ISBN:9798400701085
DOI:10.1145/3581783
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 October 2023

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. contrastive learning
  2. cross-modal hashing
  3. cross-modal retrieval
  4. knowledge distillation
  5. multi-granularity
  6. transformer

Qualifiers

  • Research-article

Funding Sources

  • Guangdong Natural Science Foundation
  • National Key Research and Development Program of China
  • Guangdong Provincial Key Laboratory of Novel Security Intelligence Technologies
  • Shenzhen Science and Technology Program
  • National Natural Science Foundation of China

Conference

MM '23
Sponsor:
MM '23: The 31st ACM International Conference on Multimedia
October 29 - November 3, 2023
Ottawa ON, Canada

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)404
  • Downloads (Last 6 weeks)26
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Enhancing cross-modal retrieval via visual-textual prompt hashingProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence10.24963/ijcai.2024/69(623-631)Online publication date: 3-Aug-2024
  • (2024)Privacy-Enhanced Prototype-Based Federated Cross-Modal Hashing for Cross-Modal RetrievalACM Transactions on Multimedia Computing, Communications, and Applications10.1145/367450720:9(1-19)Online publication date: 23-Sep-2024
  • (2024)Stay Focused is All You Need for Adversarial RobustnessProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681676(6482-6491)Online publication date: 28-Oct-2024
  • (2024)FedCAFE: Federated Cross-Modal Hashing with Adaptive Feature EnhancementProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3681319(9670-9679)Online publication date: 28-Oct-2024
  • (2024)Reversed in Time: A Novel Temporal-Emphasized Benchmark for Cross-Modal Video-Text RetrievalProceedings of the 32nd ACM International Conference on Multimedia10.1145/3664647.3680731(5260-5269)Online publication date: 28-Oct-2024
  • (2024)Contrastive Multi-Bit Collaborative Learning for Deep Cross-Modal HashingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.341957736:11(5835-5848)Online publication date: 1-Nov-2024
  • (2024)Deep Hierarchy-Aware Proxy Hashing With Self-Paced Learning for Cross-Modal RetrievalIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340105036:11(5926-5939)Online publication date: 1-Nov-2024
  • (2024)Cross-Modal Retrieval: A Systematic Review of Methods and Future DirectionsProceedings of the IEEE10.1109/JPROC.2024.3525147112:11(1716-1754)Online publication date: Nov-2024
  • (2024)A Multi-View Double Alignment Hashing Network with Weighted Contrastive Learning2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687739(1-6)Online publication date: 15-Jul-2024
  • (2024)CREAMY: Cross-Modal Recipe Retrieval By Avoiding Matching ImperfectlyIEEE Access10.1109/ACCESS.2024.337015812(33283-33295)Online publication date: 2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media