research-article

Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval

Authors:

Guangming LuAuthors Info & Claims

MM '23: Proceedings of the 31st ACM International Conference on Multimedia

Pages 893 - 902

https://doi.org/10.1145/3581783.3612411

Published: 27 October 2023 Publication History

Get Access

Abstract

With the powerful representation ability and privileged efficiency, deep cross-modal hashing (DCMH) has become an emerging fast similarity search technique. Prior studies primarily focus on exploring pairwise similarities across modalities, but fail to comprehensively capture the multi-grained semantic correlations during intra- and inter-modal negotiation. To tackle this issue, this paper proposes a novel Multi-granularity Interactive Transformer Hashing (MITH) network, which hierarchically considers both coarse- and fine-grained similarity measurements across different modalities in one unified transformer-based framework. To the best of our knowledge, this is the first attempt for multi-granularity transformer-based cross-modal hashing. Specifically, a well-designed distilled intra-modal interaction module is deployed to excavate modality-specific concept knowledge with global-local knowledge distillation under the guidance of implicit conceptual category-level representations. Moreover, we construct a contrastive inter-modal alignment module to mine modality-independent semantic concept correspondences with instance- and token-wise contrastive learning, respectively. Such a collaborative learning paradigm can jointly alleviate the heterogeneity and semantic gaps among different modalities from a multi-granularity perspective, yielding discriminative modality-invariant hash codes. Extensive experiments on multiple representative cross-modal datasets demonstrate the consistent superiority of MITH over the existing state-of-the-art baselines. The codes are available at https://github.com/DarrenZZhang/MITH.

Supplemental Material

MP4 File

This presentation video is about the paper entitled "Multi-Granularity Interactive Transformer Hashing for Cross-modal Retrieval" accepted by ACM MM 2023. To comprehensively capture the multi-grained semantic correlations during intra- and inter-modal negotiation, we propose a novel Multi-granularity Interactive Transformer Hashing (MITH) network, which hierarchically considers both coarse- and fine-grained similarity measurements across different modalities in one unified transformer-based framework. In the presentation, we first give an overview of cross-modal hashing and the motivation for our work. Then, we elaborate on the detailed description of our method, including feature extraction, distilled intra-modal interaction, contrastive inter-modal alignment, and cross-modal hashing learning. Moreover, extensive experiments on multiple benchmark datasets demonstrate the superiority of our method. The code and data are publicly available at https://github.com/DarrenZZhang/MITH.

Download
28.06 MB

References

[1]

Cong Bai, Chao Zeng, Qing Ma, Jinglin Zhang, and Shengyong Chen. 2020. Deep adversarial discrete hashing for cross-modal retrieval. In Proceedings of the 2020 International Conference on Multimedia Retrieval. 525--531.

Abstract

Supplemental Material

References

Cited By

Index Terms

Recommendations

Contrastive Label Correlation Enhanced Unified Hashing Encoder for Cross-modal Retrieval

Discriminant Cross-modal Hashing

Unsupervised multi-graph cross-modal hashing for large-scale multimedia retrieval

Comments

Information

Published In

Sponsors

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Funding Sources

Conference

Acceptance Rates

Contributors

Other Metrics

Bibliometrics

Article Metrics

Other Metrics

Citations

Cited By

Login options

Full Access

View options

PDF

eReader

Share

Share this Publication link

Share on social media

Affiliations