Multimedia retrieval by deep hashing with multilevel similarity learning☆
Introduction
With the pervasiveness of digital devices such as mobile phones, digital cameras and computers, recent years have witnessed a rapid growth in user-generated multimedia content including videos, images and texts. For example, Facebook users usually share and post their blogs with some relevant pictures every day, and the videos in Youtube always include some user-provided tags. With the explosive growth of multimedia data on the internet, it is imperative to develop efficient techniques to support effective similarity search among large-scale multimedia data.
Due to storage and computation efficiency, hashing has been widely studied for large-scale multimedia retrieval [1], [2], [3]. The core idea of hashing is to transform high-dimensional data into compact binary hash codes by preserving the data structure in the original feature space. In the compact Hamming space, the computation cost can be greatly reduced by calculating the hamming distance with XOR bit operation. The most well-know data-independent hashing algorithm is Locality Sensitive Hashing (LSH) [4], which generates binary hash codes with random linear projections. However, LSH requires long hash codes to boost the retrieval performance which inevitably limits its practical application. To address the limitation of random linear projections, learning-based hashing methods have been proposed to learn data-dependent hash functions which can generate more discriminative hash codes to achieve superior performance. Representative methods include Kernel-based Supervised Hashing (KSH) [5], Supervised Discrete Hashing (DSH) [6], Deep Hashing Network (DHN) [7], Deep Hashing (DH) [8] and Deep Pairwise-Supervised Hashing (DPSH) [7].
Recently, numerous cross-modal hashing methods have been proposed, which aims to learn two sets of modality-specific hash functions by encoding the underlying correlations of multiple modalities. However, most existing cross-modal hashing methods learn binary hash codes based on linear projections. That is, those hashing methods cannot explore the nonlinearity of multimedia data. Besides, the hand-crafted features (e.g. GIST [9], SIFT [10] and BoW [11]) are used to learn binary hash codes in those cross-modal hashing methods. By representing each instance with hand-crafted features, the hash functions cannot well preserve the complex correlations embedded in the high-level semantic structure of multimedia data. Recent research has shown that deep learning has great potential in learning powerful feature representations and discriminative semantic information from multimedia data [12], [13]. Typically, Convolutional Neural Networks (CNNs) [14], [15], [16] achieve state-of-the-art performance in many computer vision tasks such as image classification [12], [17], [18], image captioning [19], [20], multi-view learning [21], [22], [23], [24], [25] and image retrieval [26], [27], [28], [29], [30]. Although CNNs have been successfully applied for single modal retrieval such as image retrieval, their applications to cross-modal retrieval have not been fully studied.
Inspired by the impressive feature representation ability of Convolutional Neural Networks (CNNs), we propose to take advantage of the CNNs for nonlinear hash function learning as well as feature representation learning. In addition, as the instances belonging to the same category exhibit huge visual and semantical variations, it is inaccurate to define the similarity relationships (such as similar and dissimilar) purely using the semantic label information. Different from most existing cross-modal hashing methods that treat the similarity relationship as one of two cases (i.e. similar or dissimilar), we propose to exploit multilevel similarities (e.g. rather similar, less similar, dissimilar and very dissimilar) to characterize the similarity relationships. In this work, we propose a novel deep supervised cross-modal hashing method which can be decomposed into binary hash codes learning and deep hash function learning. The framework of the proposed method is illustrated in Fig. 1. In the binary codes learning step, we first construct multilevel similarity correlation by joint exploring the local structure and semantic label information. Meanwhile, the unified binary hash codes are learned by preserving such multilevel similarity correlation as well as investigating bit balance and quantization error properties. In the hash function learning, our goal is to simultaneously learn the feature representations and two sets of nonlinear hash functions with the deep neural networks. Specifically, the hash layer and classification layer are both added on the top of the deep neural networks. Besides, we introduce the well-designed loss to minimize the prediction errors of the feature representations as well as the errors between the unified binary codes and outputs of the hash layer. We evaluate the proposed method on two widely-used multimodal datasets. Experimental results demonstrate the superiority of the proposed method.
The main contributions of the proposed method are summarized as follows:
- •
We propose a novel deep cross-modal hashing method for multimedia retrieval, which learns discriminative and compact binary hash codes with deep neural networks by exploring the multilevel semantic similarity correlations of multimedia data. Specifically, the multilevel semantic similarity is learned by exploiting the local structure and semantic label information simultaneously.
- •
The proposed method is the first attempt to incorporate the deep feature representation learning, hash function learning and multilevel semantic similarity learning into one unified framework.
- •
We evaluate the proposed deep multimodal hashing method on two real-world multimodal datasets. The experimental results for cross-modal retrieval tasks demonstrate the superiority of the proposed method.
This work is based on our preliminary work published in [31]. Compared to the original work, the major extensions include (1) the refinement of Abstract, Introduction and Related Work; (2) we introduce the optimization of the proposed method in detail; (3) More experimental results are provided to demonstrate the efficiency of our proposed method.
Section snippets
Related work
Multimodal retrieval focuses on the task of searching for relevant multimedia contents including images, texts, videos, audio and so on. In particular, for cross-modal retrieval, there are mainly two retrieval tasks: image-query-text task and text-query-image task. For the image-query-text task, the image is adopted to search for relevant textual documents in the text database. As for the text-query-image task, the textual query is used to retrieve images in the image database. In this section,
Mathematical notations
Throughout this paper, uppercase bold font characters are used to denote matrices and lowercase bold font characters are used to denote vectors. Let denote a set of -dimensional data points fromthe image modality . Similar, we use to define a set of -dimensional data points from the text modality . Let denote the label matrix, where C is the number of the categories. In addition, we have a set of fine-grained semantic similarity matrix
Experiment
In order to evaluate the effectiveness of the proposed method, we conduct the experiments on two widely used real world multimodal datasets including MIRFLICKR25K [48] and NUS-WIDE [49].
Conclusion
In this paper, we propose a novel cross-modal hashing method, named as Deep Multimodal Hashing with Multilevel Similarity Learning (DHMSL), for similarity search on large-scale multimedia data. The proposed method integrates the multilevel similarity learning, deep representation learning and hash function learning into one unified framework. Specifically, DHMSL is a two-stage hashing framework which learns binary hash codes and hash functions separately. In binary hash codes learning stage,
Acknowledgments
This work was partially supported by the National Key Research and Development Program of China under Grant 2017YFC0820601, the National Natural Science Foundation of China (Grant Nos. 61772275, 61732007 and 61720106004) and the Natural Science Foundation of Jiangsu Province (Grant BK20170033).
Qiuli Liu received the M.S. degree in Artistic Design from Nanjing Normal University at Nanjing, Jiangsu, China, in 2012. Now she is a Ph.D candidate student in University of Electronic Science and Technology of China in Software En-gineering. Her research interests include multimedia theory and technology, deep learning and multimedia retrieval.
References (49)
Deep learning in neural networks: an overview
Neural Networks
(2015)- et al.
Supervised deep hashing for scalable face image retrieval
Pattern Recogn.
(2018) - et al.
Weakly-supervised multimodal hashing for scalable social image retrieval
IEEE Trans. Circuits Syst. Video Technol.
(2018) - et al.
Learning binary hash codes for large-scale image search
- et al.
Semantic neighbor graph hashing for multimodal retrieval
IEEE Trans. Image Process.
(2018) - et al.
Locality-sensitive hashing scheme based on p-stable distributions
- et al.
Supervised hashing with kernels
- et al.
Supervised discrete hashing
- W.-J. Li, S. Wang, W.-C. Kang, Feature learning based deep supervised hashing with pairwise labels, arXiv preprint...
- et al.
Deep hashing for compact binary codes learning
Modeling the shape of the scene: a holistic representation of the spatial envelope
Int. J. Comput. Vision
Distinctive image features from scale-invariant keypoints
Int. J. Comput. Vision
Evaluating bag- of-visual-words representations in scene classification
Deep learning
Nature
Imagenet classification with deep convolutional neural networks
Large-scale video classification with convolutional neural networks
Learning and transferring mid-level image representations using convolutional neural networks
L1-norm distance linear discriminant analysis based on an effective iterative algorithm
IEEE Trans. Circuits Syst. Video Technol.
Deep residual learning for image recognition
Deep visual-semantic alignments for generating image descriptions
Show and tell: A neural image caption generator
A framework of joint low-rank and sparse regression for image memorability prediction
IEEE Trans. Circuits Syst. Video Technol.
Low-rank multi-view embedding learning for micro-video popularity prediction
IEEE Trans. Knowl. Data Eng.
Urban water quality prediction based on multi-task multi-view learning
Cited by (4)
Video hashing with secondary frames and invariant moments
2021, Journal of Visual Communication and Image RepresentationCitation Excerpt :Therefore, the application of copyright protection calls for efficient techniques for searching video copies of a given copyrighted video from massive data. Hash function [2–4] is an effective technique of data processing. Input data of hash function can be mapped to a fixed-size string called hash, which is usually used to represent the input data in practice.
Deep learning-based information retrieval with normalized dominant feature subset and weighted vector model
2024, PeerJ Computer ScienceA survey on social image semantic analysis
2023, Kexue Tongbao/Chinese Science Bulletin
Qiuli Liu received the M.S. degree in Artistic Design from Nanjing Normal University at Nanjing, Jiangsu, China, in 2012. Now she is a Ph.D candidate student in University of Electronic Science and Technology of China in Software En-gineering. Her research interests include multimedia theory and technology, deep learning and multimedia retrieval.
Lu Jin received the B.E. degree in Measuring and Control Technology and Instrumentations from Northeast University at Qinhuangdao, Hebei, China, in 2010. Now she is a Ph.D candidate student in Nanjing University of Science and Tech-nology in Computer Science and Technology. From 2015 to 2017, she worked as a visiting scholar in the Department of Computer Science at University of Central Florida. Her research interests include multimedia computing, deep learning and multimedia retrieval. She has received the Best Student Paper Award in ICIMCS 2018.
Zechao Li is currently a Professor at Nanjing University of Science and Technology. He received the Ph.D degree from National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences in 2013, and the B.E. degree from University of Science and Technology of China in 2008. His research interests include intelligent media analysis, computer vision, etc. He received the Young Talent Program of China Association for Science and Technology, the Excel-lent Doctoral Dissertation of Chinese Academy of Sciences, and the Excellent Doctoral Theses of China Computer Federation.
Jinhui Tang is a Professor in School of Computer Science and Engineering, Nanjing University of Science and Technology, China. He received his B.E. and Ph.D. degrees in July 2003 and July 2008 respectively, both from the University of Science and Technology of China. From 2008 to 2010, he worked as a research fellow in School of Comput-ing, National University of Singapore. His current research interests include large scale multimedia search. He has authored over 150 journal and conference papers in these areas. Prof. Tang is a co-recipient of the Best Paper Awards in ACM MM 2007, PCM 2011 and ICIMCS 2011, and the Best Student Paper Award in MMM 2016.