Elsevier

Neurocomputing

Volume 193, 12 June 2016, Pages 250-259
Neurocomputing

Semantic consistency hashing for cross-modal retrieval

https://doi.org/10.1016/j.neucom.2016.02.016Get rights and content

Abstract

The task of cross-modal retrieval is to query similar objects in dataset of multi-modality, such as using text to query images and vice versa. However, most of existing works suffer from high computational complexity and storage cost in large-scale applications. Recently, hashing method mapping the high-dimensional data to compact binary codes has attracted a lot of concerns due to its efficiency and low storage cost over large-scale dataset. In this paper, we propose a Semantic Consistency Hashing (SCH) method for cross-modal retrieval. SCH learns a shared semantic space simultaneously taking both inter-modal and intra-modal semantic correlations into account. In order to preserve the inter-modal semantic consistency, an identical representation is learned using non-negative matrix factorization for the samples with different modalities. Meanwhile, neighbor preserving algorithm is adopted to preserve the semantic consistency in each modality. In addition, an effective optimal algorithm is proposed to reduce the time complexity from traditional O(N2) or higher to O(N). Extensive experiments on two public datasets demonstrate that the proposed approach significantly outperforms the existing schemes.

Introduction

With the rapid development of information technology and the Internet, one webpage may contain text, audio, image, video and so on. Although these data are represented by different modalities, they have strong semantic correlation. For example, Fig. 1 displays a number of documents collected from Wikipedia. Each document includes one figure along with surrounding texts. These pairwise images and texts are connected by blue solid line denoting that they have strong semantic correlation. And those connected by blue dotted line mean that the image is relevant to these texts, i.e. they have the same semantic concept, while those connected by red dotted line denote that they are irrelevant to each other. The task of cross-modal retrieval is using one kind of media to retrieve similar samples in dataset of different modalities, and the returned samples are ranked by the correlation. However, with the explosive growth of multimedia on the Internet, storage cost and efficiency are two main challenges in large-scale retrieval.

Hashing method mapping sample from high-dimensional feature space to low-dimensional binary Hamming space has been received much attention due to its efficiency and low memory cost [1], [2], [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13]. However, most of existing hashing schemes can work only on single modality [1], [2], [3], [4], [5], [6]. There have been only a few works addressing multi-modal retrieval so far [7], [9], [10], [11], [12]. Multi-modal hashing generally can be categorized into two types: multi-modal fusion hashing (MMFH) and cross-modal hashing (CMH). MMFH aims at generating better binary codes by taking advantage of the complementarity of each modality than single modal hashing [7]. While the CMH method is to construct a shared Hamming space to retrieve similar samples over heterogeneous cross-modal dataset [9], [10], [11], [12]. In this paper, we focus on CMH method. The key point of CMH is to find the correlation between different modalities in Hamming space. However, how to learn a low-dimensional Hamming space over heterogeneous cross-modal dataset is still a challenging issue.

There have been many recent works focus on this issue. For example, Canonical Correlation Analysis (CCA) hashing method maps the sample from different modalities to a low-dimensional Hamming space by maximizing the correlation between different modalities [14]. Multimodal latent binary embedding (MLBE) employs a binary latent factor with a probabilistic model to learn hashing codes [15]. Co-Regularized Hashing (CRH) is proposed to learn a low-dimensional hamming space by mapping the data far from zero for each bit, and inter-modal similarity is effectively preserved at the same time [8]. Multimodal NN hashing (MM-NNH) proposed in [16] aims at learning a group of hashing functions by preserving intra-modal and inter-modal similarity. However, above cross-modal hashing methods directly learn hashing functions respectively for each modality. It may degrade the performance because the learned Hamming space is not semantically distinguishing.

To address above issue, a supervised scheme (SliM2) [17] is proposed to embed heterogeneous data into a semantic space by dictionary learning and sparse coding. In [18], Latent Semantic Sparse Hashing (LSSH) algorithm is proposed to learn a semantic space by sparse coding and matrix factorization. Sparse coding is used to capture the salient structure of text, and matrix factorization is used to learn the latent concept for image. At last, learning a linear mapping matrix bridges the semantic space between the text and the image. Collective Matrix Factorization Hashing (CMFH) [19] intends to project sample to a common semantic space by collective matrix factorization, thus inter-modal semantic similarity is preserved effectively. The results of above methods prove that learning a semantic space is helpful to cross-modal retrieval. However, those methods only consider to preserve inter-modal semantic consistency, but ignore to preserve intra-modal semantic consistency. Inter-modal semantic consistency aims at preserving the global similarity structure, while intra-modal semantic consistency aims at preserving the local similarity structure for each modality in the learned low-dimensional semantic space. Moreover, recent studies have proved that samples from high-dimensional space actually lie on a low-dimensional manifold in real-world [20], [21]. Hence it will be beneficial for introducing the intra-modal semantic consistency to cross-modal retrieval framework.

In this paper, we put forward a semantic consistency hashing method for cross-modal retrieval. We aim to efficiently learn binary codes for different modalities by jointing intra-modal and inter-modal semantic consistency into a framework. In order to preserve inter-modal semantic consistency, an identical representation is learned by non-negative matrix factorization (NMF) for the samples with different modalities. The main advantages of NMF are as follows: (1) Nonnegative representation is consistent with the cognition of human brain [22], [23]. (2) The constraint of non-negative brings sparse, and relatively sparse representation can resist noise to a certain extent [24], which will be beneficial for learning the shared semantic space with the noisy labels. In order to preserve intra-modal semantic consistency, neighbor preserving algorithm is utilized to preserve the local similarity structure. This allows to exploit richer information existing in data to learn a better shared semantic space.

Our main contributions are as follows:

  • 1.

    We propose a semantic consistency hashing method to effectively find the semantic correlation between different modalities in the shared semantic space. Not only the inter-modal semantic consistency is preserved by NMF, but also the intra-modal semantic consistency is preserved by neighbor preserving algorithm in the shared semantic space.

  • 2.

    We propose an efficient and iterative optimization framework. In experiments, we find that satisfactory performance can be achieved in about 1020 iterations. Meanwhile, the training time complexity is reduced from traditional O(N2) or higher (such as proposed in [8], [15], and [19]) to O(N).

  • 3.

    As for performance, SCH significantly outperforms existing approaches. Extensive experiments on two public datasets show that SCH outperforms baseline algorithms by a maximum of 16.48% on mean average precision.

At last, the rest of this paper is organized as follows. In Section 2, we briefly introduce the notion of non-negative matrix factorization. In Section 3, we present components of the proposed method, including the formulation of SCH, optimization, generating hashing codes and complexity analysis. In Section 4, we report experimental results on two public datasets and give experimental analysis. Finally, Section 5 provides concluding remarks.

Section snippets

Non-negative matrix factorization

NMF has been widely applied to many fields [25], [26], [27], [28], [29], and has attracted much attention owning to its theoretical interpretation and excellent performance. The algorithm of NMF is as follows.

Given a matrix M={m1,m2,,mN}R+, mi is a d-dimensional vector denoting one sample. NMF aims at gaining two matrices UR+d×P and VR+P×N (where P is the dimension of latent space) satisfying MUV. U and V can be obtained by minimizing the following objective function:L=argminU,VMUVF2s.t.

Semantic consistency hashing

In this section, we will describe the algorithm of SCH in detail. Firstly, we present main formulation of SCH. Then a three-step iterative algorithm is proposed to optimize the problem. At last generating hashing codes and complexity analysis algorithm are presented.

Experiments

In this section, we evaluate the performance of the proposed algorithm on two public real world datasets. First, the datasets and evaluation criteria used in experiments are introduced, and then parameters setting and tuning are presented. At last, we conduct experimental comparisons with several existing approaches, including CCA [14], CRH [8], SCM [33], LSSH [18] and CMFH [19], STMH [10], and demonstrate the results.

In experiments, two cross-modal retrieval tasks are conducted: (1) retrieving

Conclusions

In this paper, we propose a Semantic Consistency Hashing (SCH) method for efficient similarity search over large-scale heterogeneous dataset. In particular, through leveraging both inter-modal and intra-modal semantic consistency, hashing functions are learned for different modalities. Then an iterative updating scheme is applied to efficiently derive local optimal solutions. The time complexity is reduced to O(N). In experiments, the results gain significant improvement compared with the

Acknowledgment

This work is supported by the Foundation for Innovative Research Groups of the NSFC (Grant no. 71421001), National Natural Science Foundation of China (Grant nos. 61502073, 61172109), the Fundamental Research Funds for the Central Universities (No. DUT14QY03) and the Open Projects Program of National Laboratory of Pattern Recognition (No. 201407349).

Tao Yao received his Master degree from Wuhan University of Technology, China, in 2006. Currently, he is seeking his Ph.D. degree in School of Information and Communication Engineering at Dalian University of Technology, China. From 2006 to now, he works as a Lecturer in School of Information and Electrical Engineering at Ludong University, China. His research interests include multimedia retrieval, computer vision and machine learning.

References (36)

  • Z. Lin, G. Ding, M. Hu, J. Wang, Semantics-preserving hashing for cross-view retrieval, In: Proceedings of the IEEE...
  • D. Wang, X. Gao, X. Wang, L. He, Semantic topic multimodal hashing for cross-media retrieval, In: Proceedings of the...
  • B. Wu, Q. Yang, W.-S. Zheng, Y. Wang, J. Wang, Quantized correlation hashing for fast cross-modal search, In:...
  • J. Song, Y. Yang, Y. Yang, Z. Huang, H. Shen, Inter–media hashing for large-scale retrieval from heterogenous data...
  • T. Mei et al.

    Multimedia search rerankinga literature survey

    ACM Comput. Surv. (CSUR)

    (2014)
  • Y. Gong, S. Lazebnik, Iterative quantization: a procrustean approach to learning binary codes, In: IEEE Conference on...
  • Y. Zhen, D.-Y. Yeung, A probabilistic model for multimodal hash function learning, In: ACM Conference on Knowledge...
  • J. Masci et al.

    Multimodal similarity-preserving hashing

    IEEE Trans. Pattern Anal. Mach. Intell.

    (2014)
  • Cited by (0)

    Tao Yao received his Master degree from Wuhan University of Technology, China, in 2006. Currently, he is seeking his Ph.D. degree in School of Information and Communication Engineering at Dalian University of Technology, China. From 2006 to now, he works as a Lecturer in School of Information and Electrical Engineering at Ludong University, China. His research interests include multimedia retrieval, computer vision and machine learning.

    Xiangwei Kong received her Ph.D. degree in Management Science and Engineering from Dalian University of Technology, China, in 2003. From 2006 to 2007, she was a visiting researcher in Department of Computer Science at Purdue University, USA. She is currently a professor in the School of Information and Communication Engineering at Dalian University of Technology, China. Her research interests include digital image processing and recognition, multimedia information security, digital media forensics, image retrieval and mining, multisource information fusion, knowledge management and business intelligence.

    Haiyan Fu received her Ph.D. from Dalian University of Technology, China, in 2014. She is currently an associate professor in the School of Information and Communication Engineering at Dalian University of Technology, China. Her research interests are in the areas of image retrieval and computer vision.

    Qi Tian received the B.E. degree in electronic engineering from Tsinghua University, China, in 1992, the M.S. degree in electrical and computer engineering from Drexel University in 1996 and the Ph.D. degree in electrical and computer engineering from the University of Illinois, Urbana-Champaign in 2002. He is currently a Professor in the Department of Computer Science at the University of Texas at San Antonio (UTSA). He took a one-year faculty leave at Microsoft Research Asia (MSRA) during 2008–2009. His research interests include multimedia information retrieval and computer vision.

    Fully documented templates are available in the elsarticle package on CTAN http://www.ctan.org/tex-archive/macros/latex/contrib/elsarticle.

    View full text