research-article

Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval

Authors:

Zhiyong ChengAuthors Info & Claims

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

Pages 982 - 991

https://doi.org/10.1145/3477495.3531947

Published: 07 July 2022 Publication History

Abstract

Multi-modal hashing learns binary hash codes with extremely low storage cost and high retrieval speed. It can support efficient multi-modal retrieval well. However, most existing methods still suffer from three important problems: 1) Limited semantic representation capability with shallow learning. 2) Mandatory feature-level multi-modal fusion ignores heterogeneous multi-modal semantic gaps. 3) Direct coarse pairwise semantic preserving cannot effectively capture the fine-grained semantic correlations. For solving these problems, in this paper, we propose a Bit-aware Semantic Transformer Hashing (BSTH) framework to excavate bit-wise semantic concepts and simultaneously align the heterogeneous modalities for multi-modal hash learning on the concept-level. Specifically, the bit-wise implicit semantic concepts are learned with the transformer in a self-attention manner, which can achieve implicit semantic alignment on the fine-grained concept-level and reduce the heterogeneous modality gaps. Then, the concept-level multi-modal fusion is performed to enhance the semantic representation capability of each implicit concept and the fused concept representations are further encoded to the corresponding hash bits via bit-wise hash functions. Further, to supervise the bit-aware transformer module, a label prototype learning module is developed to learn prototype embeddings for all categories that capture the explicit semantic correlations on the category-level by considering the co-occurrence priors. Experiments on three widely tested multi-modal retrieval datasets demonstrate the superiority of the proposed method from various aspects.

Supplementary Material

MP4 File (SIGIR22-fp0683.mp4)

Presentation video.

Download
52.96 MB

References

[1]

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision. 213--229.

Digital Library

[2]

Yongbiao Chen, Sheng Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, and Zhengwei Qi. 2021. TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval. arXiv preprint arXiv:2105.01823 (2021).

[3]

Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval . 1--9.

Digital Library

[4]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.

[5]

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations .

[6]

Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In Proceedings of the European Conference on Computer Vision. 214--229.

Digital Library

[7]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.

[8]

Mark J. Huiskes, Bart Thomee, and Michael S. Lew. 2010. New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative. In Proceedings of the ACM SIGMM International Conference on Multimedia Information Retrieval. 527--536.

[9]

Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3270--3278.

[10]

Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of International Conference on Learning Representations .

[11]

Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations .

[12]

Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 4242--4251.

[13]

Shuyan Li, Xiu Li, Jiwen Lu, and Jie Zhou. 2021. Self-Supervised Video Hashing via Bidirectional Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 13549--13558.

[14]

Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision. 740--755.

[15]

Li Liu, Mengyang Yu, and Ling Shao. 2015. Multiview Alignment Hashing for Efficient Image Search. IEEE Transactions on Image Processing, Vol. 24, 3 (2015), 956--966.

Digital Library

[16]

Luyao Liu, Zheng Zhang, and Zi Huang. 2020 b. Flexible Discrete Multi-view Hashing with Collective Latent Feature Learning. Neural Processing Letters, Vol. 52, 3 (2020), 1765--1791.

Digital Library

[17]

Song Liu, Shengsheng Qian, Yang Guan, Jiawei Zhan, and Long Ying. 2020 a. Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval . 1379--1388.

Digital Library

[18]

Xianglong Liu, Junfeng He, Di Liu, and Bo Lang. 2012. Compact kernel hashing with multiple features. In Proceedings of the ACM International Conference on Multimedia. 881--884.

Digital Library

[19]

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021).

[20]

Xu Lu, Lei Zhu, Zhiyong Cheng, Jingjing Li, Xiushan Nie, and Huaxiang Zhang. 2019 a. Flexible Online Multi-modal Hashing for Large-scale Multimedia Retrieval. In Proceedings of the ACM International Conference on Multimedia. 1129--1137.

Digital Library

[21]

Xu Lu, Lei Zhu, Jingjing Li, Huaxiang Zhang, and Heng Tao Shen. 2019 b. Efficient Supervised Discrete Multi-View Hashing for Large-Scale Multimedia Search. IEEE Transactions on Multimedia, Vol. 22, 8 (2019), 2048--2060.

[22]

Xu Lu, Lei Zhu, Li Liu, Liqiang Nie, and Huaxiang Zhang. 2021. Graph Convolutional Multi-modal Hashing for Flexible Multimedia Retrieval. In Proceedings of the ACM International Conference on Multimedia . 1414--1422.

Digital Library

[23]

Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).

[24]

Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science, Vol. 290, 5500 (2000), 2323--2326.

[25]

Fumin Shen, Yan Xu, Li Liu, Yang Yang, Zi Huang, and Heng Tao Shen. 2018b. Unsupervised Deep Hashing with Similarity-Adaptive and Discrete Optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 12 (2018), 3034--3044.

Digital Library

[26]

Xiaobo Shen, Fumin Shen, Li Liu, Yunhao Yuan, Weiwei Liu, and Quan-Sen Sun. 2018a. Multiview Discrete Hashing for Scalable Multimedia Search. ACM Transactions on Intelligent Systems and Technology, Vol. 9, 5 (2018), 53:1--53:21.

Digital Library

[27]

Xiao-Bo Shen, Fumin Shen, Quan-Sen Sun, and Yunhao Yuan. 2015. Multi-view Latent Hashing for Efficient Multimedia Search. In Proceedings of the ACM International Conference on Multimedia. 831--834.

Digital Library

[28]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of International Conference on Learning Representations .

[29]

Jingkuan Song, Tao He, Lianli Gao, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2018. Binary Generative Adversarial Networks for Image Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 394--401.

[30]

Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Jiebo Luo. 2013. Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval. IEEE Transactions on Multimedia, Vol. 15, 8 (2013), 1997--2008.

Digital Library

[31]

Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Proceedings of the International Conference on Learning Representations .

[32]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 7463--7472.

[33]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems. 5998--6008.

[34]

Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the ACM International Conference on Multimedia. 154--162.

Digital Library

[35]

Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 5763--5772.

[36]

Chenggang Yan, Biao Gong, Yuxuan Wei, and Yue Gao. 2021. Deep Multi-View Enhancement Hashing for Image Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 4 (2021), 1445--1451.

[37]

Rui Yang, Yuliang Shi, and Xin-Shun Xu. 2017. Discrete Multi-view Hashing for Effective Image Retrieval. In Proceedings of the ACM International Conference on Multimedia Retrieval. 175--183.

Digital Library

[38]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems . 5754--5764.

[39]

Jian Zhang and Yuxin Peng. 2019. SSDH: Semi-Supervised Deep Hashing for Large Scale Image Retrieval. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, 1 (2019), 212--225.

Digital Library

[40]

Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394--10403.

[41]

Chaoqun Zheng, Lei Zhu, Zhiyong Cheng, Jingjing Li, and An-An Liu. 2021. Adaptive Partial Multi-View Hashing for Efficient Social Image Retrieval. IEEE Transactions on Multimedia, Vol. 23 (2021), 4079--4092.

[42]

Chaoqun Zheng, Lei Zhu, Xu Lu, Jingjing Li, Zhiyong Cheng, and Hanwang Zhang. 2020. Fast Discrete Collaborative Multi-Modal Hashing for Large-Scale Multimedia Retrieval. IEEE Transactions on Knowledge and Data Engineering, Vol. 32, 11 (2020), 2171--2184.

[43]

Jile Zhou, Guiguang Ding, and Yuchen Guo. 2014. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 415--424.

Digital Library

[44]

Xiang Zhou, Fumin Shen, Li Liu, Wei Liu, Liqiang Nie, Yang Yang, and Heng Tao Shen. 2020. Graph Convolutional Network Hashing. IEEE Transactions on Cybernetics, Vol. 50, 4 (2020), 1460--1472.

[45]

Lei Zhu, Xu Lu, Zhiyong Cheng, Jingjing Li, and Huaxiang Zhang. 2020. Deep Collaborative Multi-View Hashing for Large-Scale Image Search. IEEE Transactions on Image Processing, Vol. 29 (2020), 4643--4655.

Digital Library

Cited By

Zhu JSheng MHuang ZChang JJiang JLong JLuo CLiu L(2025)CLIP Multi-modal Hashing for Multimedia RetrievalMultiMedia Modeling10.1007/978-981-96-2054-8_15(195-205)Online publication date: 3-Jan-2025
https://doi.org/10.1007/978-981-96-2054-8_15
Zhu YGu HNie JTang JJin JZhang Y(2024)Hashing-Based Multi-Modal Semantic Communication2024 IEEE Wireless Communications and Networking Conference (WCNC)10.1109/WCNC57260.2024.10570632(1-6)Online publication date: 21-Apr-2024
https://doi.org/10.1109/WCNC57260.2024.10570632
Tu RMao XLiu JJi YWei WHuang H(2024)Similarity Transitivity Broken-Aware Multi-Modal HashingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339649236:11(7003-7014)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3396492
Show More Cited By

Index Terms

Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

A semantic model for cross-modal and multi-modal retrieval
ICMR '13: Proceedings of the 3rd ACM conference on International conference on multimedia retrieval

In this paper, a semantic model for cross-modal and multi-modal retrieval is studied. We assume that the semantic correlation of multimedia data from different modalities can be depicted in a probabilistic generation framework. Media data from different ...
Adaptive semi-paired query hashing for multi-modal retrieval
Abstract
Multi-modal hashing has attracted enormous attention in large-scale multimedia retrieval, owing to its advantages of low storage cost and fast Hamming distance computation. Existing multi-modal hashing methods assume that all multi-modal data are ...
Hadamard matrix-guided multi-modal hashing for multi-modal retrieval
Abstract
Multi-modal hashing can encode heterogeneous multi-modal data into compact binary codes, which has been extensively studied to solve large-scale multi-modal retrieval. However, since pioneer methods do not exploit fully the potential ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 2022

3569 pages

ISBN:9781450387323

DOI:10.1145/3477495

General Chairs:
Enrique Amigo
UNED
,
Pablo Castells
UAM and Amazon
,
Julio Gonzalo
UNED
,
Program Chairs:
Ben Carterette
Spotify
,
J. Shane Culpepper
RMIT University
,
Gabriella Kazai
Waseda University

Copyright © 2022 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGIR: ACM Special Interest Group on Information Retrieval

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 07 July 2022

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Natural Science Foundation of Shandong, China
Youth Innovation Project of Shandong Universities, China
Major Fundamental Research Project of Shandong, China

Conference

SIGIR '22

Sponsor:

SIGIR

SIGIR '22: The 45th International ACM SIGIR Conference on Research and Development in Information Retrieval

July 11 - 15, 2022

Madrid, Spain

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

18
Total Citations
View Citations
696
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)8

Reflects downloads up to 28 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Zhu JSheng MHuang ZChang JJiang JLong JLuo CLiu L(2025)CLIP Multi-modal Hashing for Multimedia RetrievalMultiMedia Modeling10.1007/978-981-96-2054-8_15(195-205)Online publication date: 3-Jan-2025
https://doi.org/10.1007/978-981-96-2054-8_15
Zhu YGu HNie JTang JJin JZhang Y(2024)Hashing-Based Multi-Modal Semantic Communication2024 IEEE Wireless Communications and Networking Conference (WCNC)10.1109/WCNC57260.2024.10570632(1-6)Online publication date: 21-Apr-2024
https://doi.org/10.1109/WCNC57260.2024.10570632
Tu RMao XLiu JJi YWei WHuang H(2024)Similarity Transitivity Broken-Aware Multi-Modal HashingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339649236:11(7003-7014)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3396492
Li FWang BZhu LLi JZhang ZChang X(2024)Cross-Domain Transfer Hashing for Efficient Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337479134:10(9664-9677)Online publication date: Oct-2024
https://doi.org/10.1109/TCSVT.2024.3374791
Zhu JHuang ZLiu LTang CDai L(2024)Boosted Curriculum Multi-View Hashing for Multimedia RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.344096831(2065-2069)Online publication date: 2024
https://doi.org/10.1109/LSP.2024.3440968
Zhang TXue ZDong YDu JLiang M(2024)A Multi-View Double Alignment Hashing Network with Weighted Contrastive Learning2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687739(1-6)Online publication date: 15-Jul-2024
https://doi.org/10.1109/ICME57554.2024.10687739
Zhu JCui YLiu LSun ZDai YWang XLuo CDai L(2024)Adaptive Loss-aware Modulation for Multimedia Retrieval2024 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM59182.2024.00072(649-658)Online publication date: 9-Dec-2024
https://doi.org/10.1109/ICDM59182.2024.00072
Zhu JCui YHuang ZLi XLiu LZeng LDai L(2024)Adaptive Confidence Multi-View Hashing for Multimedia RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447517(7900-7904)Online publication date: 14-Apr-2024
https://doi.org/10.1109/ICASSP48485.2024.10447517
Chen YLong JGuo LYang Z(2024)Supervised Semantic-Embedded Hashing for Multimedia RetrievalKnowledge-Based Systems10.1016/j.knosys.2024.112023299(112023)Online publication date: Sep-2024
https://doi.org/10.1016/j.knosys.2024.112023
Zhu JHu PLi BZhou Y(2024)Fast metric multi-view hashing for multimedia retrievalInformation Fusion10.1016/j.inffus.2023.102130103:COnline publication date: 4-Mar-2024
https://dl.acm.org/doi/10.1016/j.inffus.2023.102130
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten