skip to main content
10.1145/3477495.3531947acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval

Published: 07 July 2022 Publication History

Abstract

Multi-modal hashing learns binary hash codes with extremely low storage cost and high retrieval speed. It can support efficient multi-modal retrieval well. However, most existing methods still suffer from three important problems: 1) Limited semantic representation capability with shallow learning. 2) Mandatory feature-level multi-modal fusion ignores heterogeneous multi-modal semantic gaps. 3) Direct coarse pairwise semantic preserving cannot effectively capture the fine-grained semantic correlations. For solving these problems, in this paper, we propose a Bit-aware Semantic Transformer Hashing (BSTH) framework to excavate bit-wise semantic concepts and simultaneously align the heterogeneous modalities for multi-modal hash learning on the concept-level. Specifically, the bit-wise implicit semantic concepts are learned with the transformer in a self-attention manner, which can achieve implicit semantic alignment on the fine-grained concept-level and reduce the heterogeneous modality gaps. Then, the concept-level multi-modal fusion is performed to enhance the semantic representation capability of each implicit concept and the fused concept representations are further encoded to the corresponding hash bits via bit-wise hash functions. Further, to supervise the bit-aware transformer module, a label prototype learning module is developed to learn prototype embeddings for all categories that capture the explicit semantic correlations on the category-level by considering the co-occurrence priors. Experiments on three widely tested multi-modal retrieval datasets demonstrate the superiority of the proposed method from various aspects.

Supplementary Material

MP4 File (SIGIR22-fp0683.mp4)
Presentation video.

References

[1]
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. 2020. End-to-End Object Detection with Transformers. In Proceedings of the European Conference on Computer Vision. 213--229.
[2]
Yongbiao Chen, Sheng Zhang, Fangxin Liu, Zhigang Chang, Mang Ye, and Zhengwei Qi. 2021. TransHash: Transformer-based Hamming Hashing for Efficient Image Retrieval. arXiv preprint arXiv:2105.01823 (2021).
[3]
Tat-Seng Chua, Jinhui Tang, Richang Hong, Haojie Li, Zhiping Luo, and Yantao Zheng. 2009. NUS-WIDE: a real-world web image database from National University of Singapore. In Proceedings of the ACM International Conference on Image and Video Retrieval . 1--9.
[4]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171--4186.
[5]
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In Proceedings of the International Conference on Learning Representations .
[6]
Valentin Gabeur, Chen Sun, Karteek Alahari, and Cordelia Schmid. 2020. Multi-modal Transformer for Video Retrieval. In Proceedings of the European Conference on Computer Vision. 214--229.
[7]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long Short-Term Memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
[8]
Mark J. Huiskes, Bart Thomee, and Michael S. Lew. 2010. New trends and ideas in visual concept detection: the MIR flickr retrieval evaluation initiative. In Proceedings of the ACM SIGMM International Conference on Multimedia Information Retrieval. 527--536.
[9]
Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep Cross-Modal Hashing. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3270--3278.
[10]
Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In Proceedings of International Conference on Learning Representations .
[11]
Thomas N. Kipf and Max Welling. 2017. Semi-Supervised Classification with Graph Convolutional Networks. In Proceedings of the International Conference on Learning Representations .
[12]
Chao Li, Cheng Deng, Ning Li, Wei Liu, Xinbo Gao, and Dacheng Tao. 2018. Self-Supervised Adversarial Hashing Networks for Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 4242--4251.
[13]
Shuyan Li, Xiu Li, Jiwen Lu, and Jie Zhou. 2021. Self-Supervised Video Hashing via Bidirectional Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition . 13549--13558.
[14]
Tsung-Yi Lin, Michael Maire, Serge J. Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollá r, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common Objects in Context. In Proceedings of the European Conference on Computer Vision. 740--755.
[15]
Li Liu, Mengyang Yu, and Ling Shao. 2015. Multiview Alignment Hashing for Efficient Image Search. IEEE Transactions on Image Processing, Vol. 24, 3 (2015), 956--966.
[16]
Luyao Liu, Zheng Zhang, and Zi Huang. 2020 b. Flexible Discrete Multi-view Hashing with Collective Latent Feature Learning. Neural Processing Letters, Vol. 52, 3 (2020), 1765--1791.
[17]
Song Liu, Shengsheng Qian, Yang Guan, Jiawei Zhan, and Long Ying. 2020 a. Joint-modal Distribution-based Similarity Hashing for Large-scale Unsupervised Deep Cross-modal Retrieval. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval . 1379--1388.
[18]
Xianglong Liu, Junfeng He, Di Liu, and Bo Lang. 2012. Compact kernel hashing with multiple features. In Proceedings of the ACM International Conference on Multimedia. 881--884.
[19]
Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. 2021. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030 (2021).
[20]
Xu Lu, Lei Zhu, Zhiyong Cheng, Jingjing Li, Xiushan Nie, and Huaxiang Zhang. 2019 a. Flexible Online Multi-modal Hashing for Large-scale Multimedia Retrieval. In Proceedings of the ACM International Conference on Multimedia. 1129--1137.
[21]
Xu Lu, Lei Zhu, Jingjing Li, Huaxiang Zhang, and Heng Tao Shen. 2019 b. Efficient Supervised Discrete Multi-View Hashing for Large-Scale Multimedia Search. IEEE Transactions on Multimedia, Vol. 22, 8 (2019), 2048--2060.
[22]
Xu Lu, Lei Zhu, Li Liu, Liqiang Nie, and Huaxiang Zhang. 2021. Graph Convolutional Multi-modal Hashing for Flexible Multimedia Retrieval. In Proceedings of the ACM International Conference on Multimedia . 1414--1422.
[23]
Alec Radford, Karthik Narasimhan, Tim Salimans, and Ilya Sutskever. 2018. Improving language understanding by generative pre-training. (2018).
[24]
Sam T Roweis and Lawrence K Saul. 2000. Nonlinear dimensionality reduction by locally linear embedding. Science, Vol. 290, 5500 (2000), 2323--2326.
[25]
Fumin Shen, Yan Xu, Li Liu, Yang Yang, Zi Huang, and Heng Tao Shen. 2018b. Unsupervised Deep Hashing with Similarity-Adaptive and Discrete Optimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 40, 12 (2018), 3034--3044.
[26]
Xiaobo Shen, Fumin Shen, Li Liu, Yunhao Yuan, Weiwei Liu, and Quan-Sen Sun. 2018a. Multiview Discrete Hashing for Scalable Multimedia Search. ACM Transactions on Intelligent Systems and Technology, Vol. 9, 5 (2018), 53:1--53:21.
[27]
Xiao-Bo Shen, Fumin Shen, Quan-Sen Sun, and Yunhao Yuan. 2015. Multi-view Latent Hashing for Efficient Multimedia Search. In Proceedings of the ACM International Conference on Multimedia. 831--834.
[28]
Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of International Conference on Learning Representations .
[29]
Jingkuan Song, Tao He, Lianli Gao, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2018. Binary Generative Adversarial Networks for Image Retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence. 394--401.
[30]
Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Jiebo Luo. 2013. Effective Multiple Feature Hashing for Large-Scale Near-Duplicate Video Retrieval. IEEE Transactions on Multimedia, Vol. 15, 8 (2013), 1997--2008.
[31]
Weijie Su, Xizhou Zhu, Yue Cao, Bin Li, Lewei Lu, Furu Wei, and Jifeng Dai. 2020. VL-BERT: Pre-training of Generic Visual-Linguistic Representations. In Proceedings of the International Conference on Learning Representations .
[32]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. VideoBERT: A Joint Model for Video and Language Representation Learning. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 7463--7472.
[33]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is All you Need. In Proceedings of the Advances in Neural Information Processing Systems. 5998--6008.
[34]
Bokun Wang, Yang Yang, Xing Xu, Alan Hanjalic, and Heng Tao Shen. 2017. Adversarial Cross-Modal Retrieval. In Proceedings of the ACM International Conference on Multimedia. 154--162.
[35]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. CAMP: Cross-Modal Adaptive Message Passing for Text-Image Retrieval. In Proceedings of the IEEE/CVF International Conference on Computer Vision . 5763--5772.
[36]
Chenggang Yan, Biao Gong, Yuxuan Wei, and Yue Gao. 2021. Deep Multi-View Enhancement Hashing for Image Retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. 43, 4 (2021), 1445--1451.
[37]
Rui Yang, Yuliang Shi, and Xin-Shun Xu. 2017. Discrete Multi-view Hashing for Effective Image Retrieval. In Proceedings of the ACM International Conference on Multimedia Retrieval. 175--183.
[38]
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime G. Carbonell, Ruslan Salakhutdinov, and Quoc V. Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. In Proceedings of the Advances in Neural Information Processing Systems . 5754--5764.
[39]
Jian Zhang and Yuxin Peng. 2019. SSDH: Semi-Supervised Deep Hashing for Large Scale Image Retrieval. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 29, 1 (2019), 212--225.
[40]
Liangli Zhen, Peng Hu, Xu Wang, and Dezhong Peng. 2019. Deep Supervised Cross-Modal Retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10394--10403.
[41]
Chaoqun Zheng, Lei Zhu, Zhiyong Cheng, Jingjing Li, and An-An Liu. 2021. Adaptive Partial Multi-View Hashing for Efficient Social Image Retrieval. IEEE Transactions on Multimedia, Vol. 23 (2021), 4079--4092.
[42]
Chaoqun Zheng, Lei Zhu, Xu Lu, Jingjing Li, Zhiyong Cheng, and Hanwang Zhang. 2020. Fast Discrete Collaborative Multi-Modal Hashing for Large-Scale Multimedia Retrieval. IEEE Transactions on Knowledge and Data Engineering, Vol. 32, 11 (2020), 2171--2184.
[43]
Jile Zhou, Guiguang Ding, and Yuchen Guo. 2014. Latent semantic sparse hashing for cross-modal similarity search. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval. 415--424.
[44]
Xiang Zhou, Fumin Shen, Li Liu, Wei Liu, Liqiang Nie, Yang Yang, and Heng Tao Shen. 2020. Graph Convolutional Network Hashing. IEEE Transactions on Cybernetics, Vol. 50, 4 (2020), 1460--1472.
[45]
Lei Zhu, Xu Lu, Zhiyong Cheng, Jingjing Li, and Huaxiang Zhang. 2020. Deep Collaborative Multi-View Hashing for Large-Scale Image Search. IEEE Transactions on Image Processing, Vol. 29 (2020), 4643--4655.

Cited By

View all
  • (2025)CLIP Multi-modal Hashing for Multimedia RetrievalMultiMedia Modeling10.1007/978-981-96-2054-8_15(195-205)Online publication date: 3-Jan-2025
  • (2024)Hashing-Based Multi-Modal Semantic Communication2024 IEEE Wireless Communications and Networking Conference (WCNC)10.1109/WCNC57260.2024.10570632(1-6)Online publication date: 21-Apr-2024
  • (2024)Similarity Transitivity Broken-Aware Multi-Modal HashingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339649236:11(7003-7014)Online publication date: Nov-2024
  • Show More Cited By

Index Terms

  1. Bit-aware Semantic Transformer Hashing for Multi-modal Retrieval

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGIR '22: Proceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval
    July 2022
    3569 pages
    ISBN:9781450387323
    DOI:10.1145/3477495
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 07 July 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. concept-aware
    2. fine-grained semantic
    3. hashing technology
    4. multi-modal retrieval
    5. transformer

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    SIGIR '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 792 of 3,983 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)126
    • Downloads (Last 6 weeks)8
    Reflects downloads up to 28 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2025)CLIP Multi-modal Hashing for Multimedia RetrievalMultiMedia Modeling10.1007/978-981-96-2054-8_15(195-205)Online publication date: 3-Jan-2025
    • (2024)Hashing-Based Multi-Modal Semantic Communication2024 IEEE Wireless Communications and Networking Conference (WCNC)10.1109/WCNC57260.2024.10570632(1-6)Online publication date: 21-Apr-2024
    • (2024)Similarity Transitivity Broken-Aware Multi-Modal HashingIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.339649236:11(7003-7014)Online publication date: Nov-2024
    • (2024)Cross-Domain Transfer Hashing for Efficient Cross-Modal RetrievalIEEE Transactions on Circuits and Systems for Video Technology10.1109/TCSVT.2024.337479134:10(9664-9677)Online publication date: Oct-2024
    • (2024)Boosted Curriculum Multi-View Hashing for Multimedia RetrievalIEEE Signal Processing Letters10.1109/LSP.2024.344096831(2065-2069)Online publication date: 2024
    • (2024)A Multi-View Double Alignment Hashing Network with Weighted Contrastive Learning2024 IEEE International Conference on Multimedia and Expo (ICME)10.1109/ICME57554.2024.10687739(1-6)Online publication date: 15-Jul-2024
    • (2024)Adaptive Loss-aware Modulation for Multimedia Retrieval2024 IEEE International Conference on Data Mining (ICDM)10.1109/ICDM59182.2024.00072(649-658)Online publication date: 9-Dec-2024
    • (2024)Adaptive Confidence Multi-View Hashing for Multimedia RetrievalICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)10.1109/ICASSP48485.2024.10447517(7900-7904)Online publication date: 14-Apr-2024
    • (2024)Supervised Semantic-Embedded Hashing for Multimedia RetrievalKnowledge-Based Systems10.1016/j.knosys.2024.112023299(112023)Online publication date: Sep-2024
    • (2024)Fast metric multi-view hashing for multimedia retrievalInformation Fusion10.1016/j.inffus.2023.102130103:COnline publication date: 4-Mar-2024
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media