research-article

Similarity Preserving Transformer Cross-Modal Hashing for Video-Text Retrieval

Authors:

Shirui PanAuthors Info & Claims

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

Pages 5883 - 5891

https://doi.org/10.1145/3664647.3681606

Published: 28 October 2024 Publication History

Abstract

As social networks grow exponentially, there is an increasing demand for video retrieval using natural language. Cross-modal hashing that encodes multi-modal data using compact hash code has been widely used in large-scale image-text retrieval, primarily due to its computation and storage efficiency. When applied to video-text retrieval, existing unsupervised cross-modal hashing extracts the frame- or word-level features individually, and thus ignores long-term dependencies. In addition, effectively exploiting the multi-modal structure is a remarkable challenge owing to the complex nature of video and text. To address the above issues, we propose Similarity Preserving Transformer Cross-Modal Hashing (SPTCH), a new unsupervised deep cross-modal hashing method for video-text retrieval. SPTCH encodes video and text by bidirectional transformer encoder that exploits their long-term dependencies. SPTCH constructs a multi-modal collaborative graph to model correlations among multi-modal data, and applies semantic aggregation by employing Graph Convolutional Network (GCN) on such graph. SPTCH designs unsupervised multi-modal contrastive loss and neighborhood reconstruction loss to effectively leverage inter- and intra-modal similarity structure among videos and texts. The empirical results on three video benchmark datasets illustrate that the proposed SPTCH generally outperforms state-of-the-arts in video-text retrieval.

Supplemental Material

MP4 File - presentation video for ACM MM 2024 4973#

presentation video for ACM MM 2024 4973#

Download
5.11 MB

References

[1]

Jo ao Carreira and Andrew Zisserman. 2017. Quo Vadis, Action Recognition? A New Model and the Kinetics Dataset. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 4724--4733.

[2]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. In Proceedings of International Conference on Machine Learning. PMLR, 1597--1607.

[3]

Guiguang Ding, Yuchen Guo, and Jile Zhou. 2014. Collective matrix factorization hashing for multimodal data. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 2075--2082.

Digital Library

[4]

Qing-Yuan Jiang and Wu-Jun Li. 2017. Deep cross-modal hashing. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 3232--3240.

[5]

Kazuya Kawakami. 2008. Supervised sequence labelling with recurrent neural networks. Ph.,D. Dissertation. Technical University of Munich.

[6]

Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In Proceedings of IEEE International Conference on Computer Vision. 706--715.

[7]

Chao Li, Yang Yang, Jiewei Cao, and Zi Huang. 2017. Jointly modeling static visual appearance and temporal pattern for unsupervised video hashing. In Proceedings of ACM International Conference on Information and Knowledge Management. 9--17.

Digital Library

[8]

Shuyan Li, Zhixiang Chen, Jiwen Lu, Xiu Li, and Jie Zhou. 2019. Neighborhood preserving hashing for scalable video retrieval. In Proceedings of IEEE International Conference on Computer Vision. 8212--8221.

[9]

Shuyan Li, Xiu Li, Jiwen Lu, and Jie Zhou. 2021. Self-supervised video hashing via bidirectional transformers. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 13549--13558.

[10]

Shuyan Li, Xiu Li, Jiwen Lu, and Jie Zhou. 2022. Structure-Adaptive Neighborhood Preserving Hashing for Scalable Video Search. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 32, 4 (2022), 2441--2454.

[11]

Xuelong Li, Di Hu, and Feiping Nie. 2017. Deep binary reconstruction for cross-modal hashing. In Proceedings of ACM International Conference on Multimedia. 1398--1406.

Digital Library

[12]

Zechao Li, Jinhui Tang, Liyan Zhang, and Jian Yang. 2020. Weakly-supervised Semantic Guided Hashing for Social Image Retrieval. International Journal of Computer Vision, Vol. 128, 8 (2020), 2265--2278.

Digital Library

[13]

Zijia Lin, Guiguang Ding, Mingqing Hu, and Jianmin Wang. 2015. Semantics-preserving hashing for cross-view retrieval. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 3864--3872.

[14]

Venice Erin Liong, Jiwen Lu, Yap-Peng Tan, and Jie Zhou. 2016. Deep video hashing. IEEE Transactions on Multimedia, Vol. 19, 6 (2016), 1209--1219.

Digital Library

[15]

Zachary C Lipton, John Berkowitz, and Charles Elkan. 2015. A critical review of recurrent neural networks for sequence learning. arXiv preprint arXiv:1506.00019 (2015).

[16]

Song Liu, Shengsheng Qian, Yang Guan, Jiawei Zhan, and Long Ying. 2020. Joint-modal distribution-based similarity hashing for large-scale unsupervised deep cross-modal retrieval. In Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval. 1379--1388.

Digital Library

[17]

Wei Liu, Jun Wang, Sanjiv Kumar, and Shih-Fu Chang. 2011. Hashing with graphs. In Proceedings of International Conference on Machine Learning. 1--8.

[18]

Paul Over. 2013. Trecvid 2013--an overview of the goals, tasks, data, evaluation mechanisms and metrics. (2013).

[19]

Razvan Pascanu, Tomás Mikolov, and Yoshua Bengio. 2013. On the difficulty of training recurrent neural networks. In Proceedings of International Conference on Machine Learning, Vol. 28. 1310--1318.

[20]

Mengshi Qi, Jie Qin, Yi Yang, Yunhong Wang, and Jiebo Luo. 2021. Semantics-aware spatial-temporal binaries for cross-modal video retrieval. IEEE Transactions on Image Processing, Vol. 30 (2021), 2989--3004.

[21]

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. 2015. Imagenet large scale visual recognition challenge. International Journal of Computer Vision, Vol. 115 (2015), 211--252.

Digital Library

[22]

Gunnar A Sigurdsson, Gül Varol, Xiaolong Wang, Ali Farhadi, Ivan Laptev, and Abhinav Gupta. 2016. Hollywood in homes: Crowdsourcing data collection for activity understanding. In Proceedings of European Conference on Computer Vision. Springer, 510--526.

[23]

Karen Simonyan and Andrew Zisserman. 2015. Very Deep Convolutional Networks for Large-Scale Image Recognition. In Proceedings of International Conference on Learning Representations.

[24]

Jingkuan Song, Yi Yang, Zi Huang, Heng Tao Shen, and Richang Hong. 2011. Multiple feature hashing for real-time large scale near-duplicate video retrieval. In Proceedings of ACM International Conference on Multimedia. 423--432.

Digital Library

[25]

Jingkuan Song, Hanwang Zhang, Xiangpeng Li, Lianli Gao, Meng Wang, and Richang Hong. 2018. Self-supervised video hashing with hierarchical binary auto-encoder. IEEE Transactions on Image Processing, Vol. 27, 7 (2018), 3210--3221.

[26]

Shupeng Su, Zhisheng Zhong, and Chao Zhang. 2019. Deep joint-semantics reconstructing hashing for large-scale unsupervised cross-modal retrieval. In Proceedings of IEEE International Conference on Computer Vision. 3027--3035.

[27]

Laurens Van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of Machine Learning Research, Vol. 9, 11 (2008).

[28]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Proceedings of Advances in Neural Information Processing Systems, Vol. 30 (2017).

[29]

Di Wang, Xinbo Gao, Xiumei Wang, and Lihuo He. 2015. Semantic topic multimodal hashing for cross-media retrieval. In Proceedings of International Joint Conference on Artificial Intelligence.

[30]

Jason Wei and Kai Zou. 2019. Eda: Easy data augmentation techniques for boosting performance on text classification tasks. arXiv preprint arXiv:1901.11196 (2019).

[31]

Yair Weiss, Antonio Torralba, and Rob Fergus. 2008. Spectral hashing. Proceedings of Advances in Neural Information Processing Systems, Vol. 21 (2008).

[32]

Gengshen Wu, Jungong Han, Yuchen Guo, Li Liu, Guiguang Ding, Qiang Ni, and Ling Shao. 2019. Unsupervised Deep Video Hashing via Balanced Code for Large-Scale Video Retrieval. IEEE Transactions on Image Processing, Vol. 28, 4 (2019), 1993--2007.

Digital Library

[33]

Gengshen Wu, Zijia Lin, Jungong Han, Li Liu, Guiguang Ding, Baochang Zhang, and Jialie Shen. 2018. Unsupervised Deep Hashing via Binary Latent Factor Models for Large-scale Cross-modal Retrieval. In Proceedings of International Joint Conference on Artificial Intelligence, Vol. 1. 5.

[34]

Jun Xu, Tao Mei, Ting Yao, and Yong Rui. 2016. Msr-vtt: A large video description dataset for bridging video and language. In Proceedings of IEEE Conference on Computer Vision and Pattern Recognition. 5288--5296.

[35]

Guangnan Ye, Dong Liu, Jun Wang, and Shih-Fu Chang. 2013. Large-scale video hashing via structure learning. In Proceedings of IEEE International Conference on Computer Vision. 2272--2279.

Digital Library

[36]

En Yu, Jianhua Ma, Jiande Sun, Xiaojun Chang, Huaxiang Zhang, and Alexander G. Hauptmann. 2022. Deep Discrete Cross-Modal Hashing with Multiple Supervision. Neurocomputing, Vol. 486 (2022), 215--224.

Digital Library

[37]

Dongqing Zhang and Wu-Jun Li. 2014. Large-scale supervised multimodal hashing with semantic correlation maximization. In Proceedings of AAAI Conference on Artificial Intelligence, Vol. 28.

[38]

Hanwang Zhang, Meng Wang, Richang Hong, and Tat-Seng Chua. 2016. Play and rewind: Optimizing binary representations of videos by self-supervised temporal hashing. In Proceedings of ACM International Conference on Multimedia. 781--790.

Digital Library

[39]

Peng-Fei Zhang, Yang Li, Zi Huang, and Xin-Shun Xu. 2021. Aggregation-Based Graph Convolutional Hashing for Unsupervised Cross-Modal Retrieval. IEEE Transactions on Multimedia, Vol. 24 (2021), 466--479.

Digital Library

[40]

Lei Zhu, Xize Wu, Jingjing Li, Zheng Zhang, Weili Guan, and Heng Tao Shen. 2023. Work together: Correlation-identity reconstruction hashing for unsupervised cross-modal retrieval. IEEE Transactions on Knowledge and Data Engineering, Vol. 35, 9 (2023), 8838--8851.

Digital Library

Index Terms

Similarity Preserving Transformer Cross-Modal Hashing for Video-Text Retrieval
1. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval

Recommendations

CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval
ICMR '22: Proceedings of the 2022 International Conference on Multimedia Retrieval

With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing ...
When CLIP meets cross-modal hashing retrieval: A new strong baseline
Abstract
Recent days witness significant progress in various multi-modal tasks made by Contrastive Language-Image Pre-training (CLIP), a multi-modal large-scale model that learns visual representations from natural language supervision. However, the ...
Highlights
- Explore the effects of scaling multimodal CLIP model for cross-modal retrieval.
- Propose a novel unsupervised contrastive multi-modal fusion hashing network.
- The novel network performs better than SOTA unsupervised and supervised ...
Contrastive transformer cross-modal hashing for video-text retrieval
IJCAI '24: Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence

As video-based social networks continue to grow exponentially, there is a rising interest in video retrieval using natural language. Cross-modal hashing, which learns compact hash code for encoding multi-modal data, has proven to be widely effective in ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '24: Proceedings of the 32nd ACM International Conference on Multimedia

October 2024

11719 pages

ISBN:9798400706868

DOI:10.1145/3664647

General Chairs:
Jianfei Cai
Monash University, Australia
,
Mohan Kankanhalli
NUS, Singapore
,
Balakrishnan Prabhakaran
UT Dallas, USA
,
Susanne Boll
University of Oldenburg, Germany
,
Program Chairs:
Ramanathan Subramanian
University of Canberra & IIT Ropar, Australia
,
Liang Zheng
Australian National University, Australia
,
Vivek K. Singh
Rutgers University, USA
,
Pablo Cesar
Centrum Wiskunde & Informatica, Netherlands
,
Lexing Xie
Australian National University, Australia
,
Dong Xu
University of Hong Kong, Hong Kong

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 28 October 2024

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

MM '24

Sponsor:

SIGMM

MM '24: The 32nd ACM International Conference on Multimedia

October 28 - November 1, 2024

Melbourne VIC, Australia

Acceptance Rates

MM '24 Paper Acceptance Rate 1,150 of 4,385 submissions, 26%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
248
Total Downloads

Downloads (Last 12 months)248
Downloads (Last 6 weeks)150

Reflects downloads up to 05 Mar 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten