Semantic consistency hashing for cross-modal retrieval☆
Introduction
With the rapid development of information technology and the Internet, one webpage may contain text, audio, image, video and so on. Although these data are represented by different modalities, they have strong semantic correlation. For example, Fig. 1 displays a number of documents collected from Wikipedia. Each document includes one figure along with surrounding texts. These pairwise images and texts are connected by blue solid line denoting that they have strong semantic correlation. And those connected by blue dotted line mean that the image is relevant to these texts, i.e. they have the same semantic concept, while those connected by red dotted line denote that they are irrelevant to each other. The task of cross-modal retrieval is using one kind of media to retrieve similar samples in dataset of different modalities, and the returned samples are ranked by the correlation. However, with the explosive growth of multimedia on the Internet, storage cost and efficiency are two main challenges in large-scale retrieval.
Hashing method mapping sample from high-dimensional feature space to low-dimensional binary Hamming space has been received much attention due to its efficiency and low memory cost [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]. However, most of existing hashing schemes can work only on single modality [1], [2], [3], [4], [5], [6]. There have been only a few works addressing multi-modal retrieval so far [7], [9], [10], [11], [12]. Multi-modal hashing generally can be categorized into two types: multi-modal fusion hashing (MMFH) and cross-modal hashing (CMH). MMFH aims at generating better binary codes by taking advantage of the complementarity of each modality than single modal hashing [7]. While the CMH method is to construct a shared Hamming space to retrieve similar samples over heterogeneous cross-modal dataset [9], [10], [11], [12]. In this paper, we focus on CMH method. The key point of CMH is to find the correlation between different modalities in Hamming space. However, how to learn a low-dimensional Hamming space over heterogeneous cross-modal dataset is still a challenging issue.
There have been many recent works focus on this issue. For example, Canonical Correlation Analysis (CCA) hashing method maps the sample from different modalities to a low-dimensional Hamming space by maximizing the correlation between different modalities [14]. Multimodal latent binary embedding (MLBE) employs a binary latent factor with a probabilistic model to learn hashing codes [15]. Co-Regularized Hashing (CRH) is proposed to learn a low-dimensional hamming space by mapping the data far from zero for each bit, and inter-modal similarity is effectively preserved at the same time [8]. Multimodal NN hashing (MM-NNH) proposed in [16] aims at learning a group of hashing functions by preserving intra-modal and inter-modal similarity. However, above cross-modal hashing methods directly learn hashing functions respectively for each modality. It may degrade the performance because the learned Hamming space is not semantically distinguishing.
To address above issue, a supervised scheme (SliM2) [17] is proposed to embed heterogeneous data into a semantic space by dictionary learning and sparse coding. In [18], Latent Semantic Sparse Hashing (LSSH) algorithm is proposed to learn a semantic space by sparse coding and matrix factorization. Sparse coding is used to capture the salient structure of text, and matrix factorization is used to learn the latent concept for image. At last, learning a linear mapping matrix bridges the semantic space between the text and the image. Collective Matrix Factorization Hashing (CMFH) [19] intends to project sample to a common semantic space by collective matrix factorization, thus inter-modal semantic similarity is preserved effectively. The results of above methods prove that learning a semantic space is helpful to cross-modal retrieval. However, those methods only consider to preserve inter-modal semantic consistency, but ignore to preserve intra-modal semantic consistency. Inter-modal semantic consistency aims at preserving the global similarity structure, while intra-modal semantic consistency aims at preserving the local similarity structure for each modality in the learned low-dimensional semantic space. Moreover, recent studies have proved that samples from high-dimensional space actually lie on a low-dimensional manifold in real-world [20], [21]. Hence it will be beneficial for introducing the intra-modal semantic consistency to cross-modal retrieval framework.
In this paper, we put forward a semantic consistency hashing method for cross-modal retrieval. We aim to efficiently learn binary codes for different modalities by jointing intra-modal and inter-modal semantic consistency into a framework. In order to preserve inter-modal semantic consistency, an identical representation is learned by non-negative matrix factorization (NMF) for the samples with different modalities. The main advantages of NMF are as follows: (1) Nonnegative representation is consistent with the cognition of human brain [22], [23]. (2) The constraint of non-negative brings sparse, and relatively sparse representation can resist noise to a certain extent [24], which will be beneficial for learning the shared semantic space with the noisy labels. In order to preserve intra-modal semantic consistency, neighbor preserving algorithm is utilized to preserve the local similarity structure. This allows to exploit richer information existing in data to learn a better shared semantic space.
Our main contributions are as follows:
- 1.
We propose a semantic consistency hashing method to effectively find the semantic correlation between different modalities in the shared semantic space. Not only the inter-modal semantic consistency is preserved by NMF, but also the intra-modal semantic consistency is preserved by neighbor preserving algorithm in the shared semantic space.
- 2.
We propose an efficient and iterative optimization framework. In experiments, we find that satisfactory performance can be achieved in about iterations. Meanwhile, the training time complexity is reduced from traditional or higher (such as proposed in [8], [15], and [19]) to O(N).
- 3.
As for performance, SCH significantly outperforms existing approaches. Extensive experiments on two public datasets show that SCH outperforms baseline algorithms by a maximum of 16.48% on mean average precision.
Section snippets
Non-negative matrix factorization
NMF has been widely applied to many fields [25], [26], [27], [28], [29], and has attracted much attention owning to its theoretical interpretation and excellent performance. The algorithm of NMF is as follows.
Given a matrix , mi is a d-dimensional vector denoting one sample. NMF aims at gaining two matrices and (where P is the dimension of latent space) satisfying . U and V can be obtained by minimizing the following objective function:
Semantic consistency hashing
In this section, we will describe the algorithm of SCH in detail. Firstly, we present main formulation of SCH. Then a three-step iterative algorithm is proposed to optimize the problem. At last generating hashing codes and complexity analysis algorithm are presented.
Experiments
In this section, we evaluate the performance of the proposed algorithm on two public real world datasets. First, the datasets and evaluation criteria used in experiments are introduced, and then parameters setting and tuning are presented. At last, we conduct experimental comparisons with several existing approaches, including CCA [14], CRH [8], SCM [33], LSSH [18] and CMFH [19], STMH [10], and demonstrate the results.
In experiments, two cross-modal retrieval tasks are conducted: (1) retrieving
Conclusions
In this paper, we propose a Semantic Consistency Hashing (SCH) method for efficient similarity search over large-scale heterogeneous dataset. In particular, through leveraging both inter-modal and intra-modal semantic consistency, hashing functions are learned for different modalities. Then an iterative updating scheme is applied to efficiently derive local optimal solutions. The time complexity is reduced to O(N). In experiments, the results gain significant improvement compared with the
Acknowledgment
This work is supported by the Foundation for Innovative Research Groups of the NSFC (Grant no. 71421001), National Natural Science Foundation of China (Grant nos. 61502073, 61172109), the Fundamental Research Funds for the Central Universities (No. DUT14QY03) and the Open Projects Program of National Laboratory of Pattern Recognition (No. 201407349).
Tao Yao received his Master degree from Wuhan University of Technology, China, in 2006. Currently, he is seeking his Ph.D. degree in School of Information and Communication Engineering at Dalian University of Technology, China. From 2006 to now, he works as a Lecturer in School of Information and Electrical Engineering at Ludong University, China. His research interests include multimedia retrieval, computer vision and machine learning.
References (36)
- et al.
Large-scale image retrieval based on boosting iterative quantization hashing with query-adaptive reranking
Neurocomputing
(2013) - et al.
Linear spectral hashing
Neurocomputing
(2014) - et al.
Graph dual regularization non-negative matrix factorization for co-clustering
Pattern Recognit.
(2012) - et al.
Two algorithms for orthogonal nonnegative matrix factorization with application to clustering
Neurocomputing
(2014) - B. Kulis, K. Grauman, Kernelized locality-sensitive hashing for scalable image search, In: IEEE International...
- et al.
Scalable similarity search with topology preserving hashing
IEEE Trans. Image Process.
(2014) - et al.
Batch-orthogonal locality-sensitive hashing for angular similarity
IEEE Trans. Pattern Anal. Mach. Intell.
(2014) - et al.
Semi-supervised hashing with semantic confidence for large scale visual search
ACM Spec. Interest Group Inf. Retr.
(2015) - J.C. Caicedo, F.A. González, Multimodal fusion for image retrieval using matrix factorization, In: ACM International...
- et al.
Co-regularized hashing for multimodal data
Neural Inf. Process. Syst.
(2012)
Multimedia search rerankinga literature survey
ACM Comput. Surv. (CSUR)
Multimodal similarity-preserving hashing
IEEE Trans. Pattern Anal. Mach. Intell.
Cited by (0)
Tao Yao received his Master degree from Wuhan University of Technology, China, in 2006. Currently, he is seeking his Ph.D. degree in School of Information and Communication Engineering at Dalian University of Technology, China. From 2006 to now, he works as a Lecturer in School of Information and Electrical Engineering at Ludong University, China. His research interests include multimedia retrieval, computer vision and machine learning.
Xiangwei Kong received her Ph.D. degree in Management Science and Engineering from Dalian University of Technology, China, in 2003. From 2006 to 2007, she was a visiting researcher in Department of Computer Science at Purdue University, USA. She is currently a professor in the School of Information and Communication Engineering at Dalian University of Technology, China. Her research interests include digital image processing and recognition, multimedia information security, digital media forensics, image retrieval and mining, multisource information fusion, knowledge management and business intelligence.
Haiyan Fu received her Ph.D. from Dalian University of Technology, China, in 2014. She is currently an associate professor in the School of Information and Communication Engineering at Dalian University of Technology, China. Her research interests are in the areas of image retrieval and computer vision.
Qi Tian received the B.E. degree in electronic engineering from Tsinghua University, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University in 1996 and the Ph.D. degree in electrical and computer engineering from the University of Illinois, Urbana-Champaign in 2002. He is currently a Professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA). He took a one-year faculty leave at Microsoft Research Asia (MSRA) during 2008–2009. His research interests include multimedia information retrieval and computer vision.
- ☆
Fully documented templates are available in the elsarticle package on CTAN http://www.ctan.org/tex-archive/macros/latex/contrib/elsarticle.