Keywords

1 Introduction

Nowadays, the amount of multimedia data grows explosively with the rapid development of information technology, consequently making the hashing based approximate nearest neighbor (ANN) search technique in great demand. The basic idea of hashing methods is to embed original high-dimensional data into compact binary codes, which can lead to fast computation of Hamming distances by hardware accelerated bit-wise XOR operation.

Most previous hashing methods focused on single-modal data. One of the most well-known work is locality sensitive hashing (LSH) [3], which projects data samples from original feature space to Hamming feature space while preserving their similarity as much as possible. To achieve better retrieval performance, various extensions of LSH were proposed to design more compact hashing, such as PCA based hashing [14], manifold learning based hashing [7], and kernel learning based hashing [5, 6, 13, 19]. Spectral Hashing (SH) [16] generates hash codes by thresholding a subset of eigenvectors of graph Laplacian constructed on data samples. Co-Regularized Hashing (CRH) [22] presents a boosted co-regularized framework to learn hashing functions for each bit. Supervised Hashing with Pseudo Labels (SHPL) [10] uses the cluster centers of training data to generate pseudo labels, which is utilized to enhance the discrimination of hash codes. Supervised Hashing with Kernels (KSH) [6] try to construct the hash functions by optimizing the code inner products.

These aforementioned methods are only applicable for single modality. However, most data in real-world applications are in the form of multiple modalities. For instance, a web page may contain both images and text and a YouTube video often has relevant tags. Consequently, more and more research interest has been devoted to cross-modal hashing. CMFH [2] was among the first to learn cross-modal hash functions using collective matrix factorization, and it aims to generate unified hash codes for each instance. In [20], Zhang and Li proposed an algorithm with linear-time complexity to learn hash functions, which can be used for large-scale data. Quantized Correlation Hashing (QCH) [17] aims to jointly learn binary codes learning and minimize the quantization loss. Kernelized Cross-Modal Hashing for Multimedia Retrieval (KCH) [11] maps data from different modalities into a common kernel space by canonical correlation analysis. Notably, multi-kernel learning has emerged as an effective approach to cross-modal hashing, as the utilization of multiple kernels can explore the complementary property of each single kernel. In [24], Zhou et al. proposed an kernelized cross-modal hashing algorithm embedded in boosting framework, but it only utilizes single kernel. Boosting Multi-kernel Locality-Sensitive Hashing (BMKLSH) [18] uses multi-kernel learning to produce hash codes, and the experimental results show its superiority over KLSH [5] based on single kernel learning.

Motivated by the great success of multi-kernel learning, we propose a supervised cross-modal hashing approach based on multi-kernel learning, which is named Multiple Kernel with Semantic Correlation Hashing (MKSH). Unlike the existing single-kernel methods [4, 6, 21, 23], we aim to learn multi-kernel hash functions. Moreover, differing from the existing multi-kernel hashing approaches [15] that assign the same weight to each kernel in a brute-force way, we utilize an alternated optimization strategy to simultaneously learn the kernel combination coefficients and hash functions that can lead to higher retrieval accuracy. Our contributions are summarized as follows:

  • We propose a novel cross-modal hashing algorithm utilizing multi-kernel learning.

  • In order to find the optimal allocation of different kernels, we propose an iterative method to solve the objective function.

  • To further enhance the algorithm performance, we utilize a sequential strategy to learn hash functions.

2 Proposed Algorithm

In this section, we detail the procedure of our hashing approach. Let \(O=\{o_i\}_{i=1}^n\) denote a set of multi-view samples and \(\mathbf X =\{x_i\}_{i=1}^n\), \(\mathbf Y =\{y_j\}_{j=1}^n\) represent two different views of O, where \(\mathbf X \in \mathfrak {R}^{d_x}\) and \(\mathbf Y \in \mathfrak {R}^{d_y}\). The goal of MKSH is to learn two hash functions for each modality respectively: \(f(x)=\left[ f_{(1)}(x), f_{(2)}(x),\ldots ,f_{(k)}(x)\right] :\mathfrak {R}^{d_x}\rightarrow \{-1,1\}^k\) and \(g(y)=\left[ g_{(1)}(y),g_{(2)}(y),\ldots ,g_{(k)}(y)\right] :\mathfrak {R}^{d_y}\rightarrow \{-1,1\}^k\), where k denotes the length of hash codes.

2.1 Learning Hash Functions

We use multiple kernels to define the mapping function in each modality as:

$$\begin{aligned} \left\{ \begin{aligned} K(x_i)=\left[ \mu _1K_1(x_i^{(1)})+\mu _2K_2(x_i^{(2)})\ldots \mu _MK_M(x_i^{(M)})\right] \\ K(y_j)=\left[ \mu _1K_1(y_j^{(1)})+\mu _2K_2(y_j^{(2)})\ldots \mu _MK_M(y_j^{(M)})\right] \\ \end{aligned} \right. \end{aligned}$$
(1)

where M indicates the number of kernels, and \(K_M(x_i^{(M)})\) is defined as \(K_M(x_i^{(M)})=k_M(\bar{x_i},x_j)\) (\(K_M(y_j^{(M)})=k_M(\bar{y_i},y_j)\)), and \(\bar{x}\in \mathbf{X }\) (\(\bar{y}\in \mathbf Y \)) are landmarks. We can use clustering methods to obtain landmarks. Then we define a prediction function with kernel as follows:

$$\begin{aligned} p(x)=\sum _{j=1}^mK(x_j)w_j-b \end{aligned}$$
(2)

where m is the number of landmarks, and \(b\in \mathfrak {R}\) is the bias, \(w_i\in \mathfrak {R}\) is the coefficient. As a fast alternative to the median, following [6], we set \(b=\frac{1}{n}\sum _{i=1}^n\sum _{j=1}^mK(x_j)w_j\). Then we have:

$$\begin{aligned} \begin{aligned} p(x)&=\sum _{j=1}^m\left( K(x_j)-\frac{1}{n}\sum _{i=1}^nK(x_j)\right) w_j\\&=\mathbf W ^TK(x). \end{aligned} \end{aligned}$$
(3)

The hashing functions are defined as follows:

$$\begin{aligned} \left\{ \begin{aligned} f(x)=\text {sgn}\left( \mathbf W _x^TK(x)\right) \\ g(y)=\text {sgn}\left( \mathbf W _y^TK(y)\right) \\ \end{aligned} \right. \end{aligned}$$
(4)

where \(\text {sgn}(u)\) is set to 1 if \(u>0\), otherwise \(-1\), and \(\mathbf W _x\in \mathfrak {R}^{d_{x\times k}}\) represent the projection matrices. We utilize the cosine similarity between the semantic label vectors to construct the pairwise semantic similarity \(\tilde{\mathbf{S }}_{ij}\), where \(\tilde{\mathbf{S }}_{ij}=\left( l_i\cdot l_j\right) \)/\(\left( \Vert l_i\Vert _2\Vert l_j\Vert _2\right) \), \(l_i\) and \(l_j\) are label vectors. We also use L to store label information, with \(L_{ij}=l_{i,j}/\Vert l_i\Vert _2\), where \(L_{ij}\) denotes the element at the ith row and the jth column in the matrix \(\mathbf L \), then we write \(\tilde{\mathbf{S }}_{ij}=\mathbf L *\mathbf L ^T\), finally, we perform element wise linear transformation on \(\tilde{\mathbf{S }}_{ij}\) to get semantic similarity matrix \(\mathbf S _{ij}\) as follows:

$$\begin{aligned} \mathbf S _{ij}=2\mathbf L *\mathbf L ^T-\mathbf 1 _n\mathbf 1 _n^T. \end{aligned}$$
(5)

where \(\mathbf S _{ij}\in [-1,1]\) is the semantic similarity matrix, and \(\mathbf 1 _n\) is an all-one column vector. Then we define the objective function minimizing the squared error as follows:

$$\begin{aligned} \mathop {\min }\limits _{f,g}\sum _{i,j}\left( f(x_i)^Tg(y_j)-\mathbf S _{ij}\right) ^2 \end{aligned}$$
(6)

Eq. (6) can be rewritten as:

$$\begin{aligned} \mathop {\min }\limits _\mathbf{W _x,\mathbf W _y}\left\| \text {sgn}\left( K(x)\mathbf W _x\right) \text {sgn}\left( K(y)\mathbf W _y\right) ^T-\mathbf S _{ij}\right\| _F^2. \end{aligned}$$
(7)

2.2 Learning Projection Matrices

The problem described in Eq. (7) is NP hard. However, we can use spectral relaxation to obtain a close-formed solution. We rewrite Eq. 7 as follows:

$$\begin{aligned}&\min \limits _\mathbf{W _x,\mathbf W _y}\left\| K(x)\mathbf W _x\left( K(y)\mathbf W _y\right) ^T-\mathbf S _{ij}\right\| _F^2\\&s.t.\quad {\left\{ \begin{array}{ll} \mathbf W _x^TK(x)^TK(x)\mathbf W _x=n\mathbf I _c \\ \mathbf W _y^TK(y)^TK(y)\mathbf W _y=n\mathbf I _c \end{array}\right. }\nonumber \end{aligned}$$
(8)

Removing the constant, then we have:

$$\begin{aligned}&\max \limits _\mathbf{W _x,\mathbf W _y}tr\left( \mathbf W _x^TK(x)^T\mathbf S _{ij}K(y)\mathbf W _y\right) \\&s.t.\quad {\left\{ \begin{array}{ll} \mathbf W _x^TK(x)^TK(x)\mathbf W _x=n\mathbf I _c \\ \mathbf W _y^TK(y)^TK(y)\mathbf W _y=n\mathbf I _c \end{array}\right. }\nonumber \end{aligned}$$
(9)

In Eq. (9), \(\mathbf I _c\) denotes an identity matrix of size \(c\times c\), the term \(K(x)^T\mathbf S K(y)\) can be regarded as to weigh the relationship between two different modalities. If we define \(C_{xy}=K(x)^T\mathbf S K(y)\) and \(C_{xx}=K(x)^T\mathbf S K(x)\) and \(C_{yy}=K(y)^T\mathbf S K(y)\), then the problem (9) can be viewed as a generalized eigenvalue problem. Consequently, we can get the optimal value of \(W_x\) and \(W_y\) by eigen-decomposition.

Some literatures have experimentally verified that orthogonal constraints are helpless to produce discriminative hash codes [14]. Following the idea in [20], we turn to use a sequential optimization strategy to learn hash functions. Suppose that the latter projection is related to the former, we solve hashing functions by defining a residue. The residue matrix \(\mathbf V _t\) is denoted by:

$$\begin{aligned} \mathbf V _t=\mathbf S -\sum _{k=1}^{t-1}\text {sgn}\left( K(x)\mathbf W _x^{(k)}\right) \text {sgn}\left( K(y)\mathbf W _y^{(k)}\right) ^T \end{aligned}$$
(10)

Then \(\mathbf C _{xy}\) can be computed by:

$$\begin{aligned} \begin{aligned} \mathbf C _{xy}^t&=K(x)^T\mathbf V _tK(y)\\&=K(x)^T\mathbf S K(y)-\sum _{k=1}^{t-1}K(x)^T\text {sgn}\left( K(x)\mathbf W _x^{(k)}\right) \text {sgn}\left( K(y)\mathbf W _y^{(k)}\right) ^TK(y)\\&=\mathbf C _{xy}^{(t-1)}-K(x)^T\text {sgn}\left( K(x)\mathbf W _x^{(t-1)}\right) \text {sgn}\left( K(y)\mathbf W _y^{(t-1)}\right) ^TK(y) \end{aligned} \end{aligned}$$

We rewrite Eq. (8) as follows:

$$\begin{aligned} \mathop {\max }\limits _\mathbf{W _x,\mathbf W _y}\left\| \left( K(x)\mathbf W _x^{(t)}\right) \left( K(y)\mathbf W _y^{(t)}\right) ^T-\mathbf V _t\right\| _F^2 \end{aligned}$$
(11)

Once the optimal value of Eq. (11) is obtained we can get the projections of two modalities \(\mathbf W _x\) and \(\mathbf W _y\).

2.3 Optimizing the Weights of Multiple Kernels

The objective function is written as:

$$\begin{aligned}&\L (\mathbf S , \mathbf W _x, \mathbf W _y, \mu )=\frac{1}{2}\mu ^TF\mu \\&\begin{array}{r@{\quad }r@{}l@{\quad }l} s.t.&{}\sum \limits _{m=1}^M \mu _m=1, \quad &{}\mu _m\geqslant 0\\ \end{array}\nonumber \end{aligned}$$
(12)

where \(F=tr\left( \mathbf W _x^TK(x)^T\mathbf S K(y)\mathbf W _y\right) \). If \(\mathbf W _x\) and \(\mathbf W _y\) are available, Eq. (12) can be regarded as a quadratic programming problem.

The overall algorithm is summarized in Algorithm 1.

figure a

3 Experiments

In this section, we conduct experiments on two benchmark datasets to verify the effectiveness of our approach.

3.1 Datasets

The used datasets are the Wiki dataset [8] and the NUS-WIDE dataset [1].

The Wiki dataset contains 2866 image-text pairs. Each image is represented by a SIFT feature vector with 1000-dimensional Bag-of-Visual-Words SIFT histogram, and each text is represented by an index vector of the top 5000 most frequent tags. There are 10 categories in the Wiki dataset.

The NUS-WIDE dataset contains 269648 images collected from Flickr. Following the experimental protocol in [12], we choose a subset comprising the most frequently-used 10 classes. Each image is represented by a 500-dimensional bag-of-visual-words SIFT histogram, and each text is represented by a Bag-of-Words feature vector with top 1000 most frequent tags. In the subset, we randomly choose 5000 image-tag pairs as the training set, and randomly choose 1866 image-text pairs from the remaining documents as the test set. Table 1 shows the details of the evaluated datasets in our experiments.

Table 1. The details of the evaluated datasets

3.2 Experimental Setup

We perform two cross-modal retrieval tasks on the NUS-WIDE and the Wiki datasets respectively, i.e., ‘img to text’ and ‘text to img’. We compare MKSH to six state-of-the-art cross-modal hashing methods, i.e., LCMH [25], LSSH [23], SCM-Seq [20], CMFH [2], RCMH [9], and KSH-CV [24]. We employ the mean Average Precision (mAP) to evaluate the retrieval performance. The average precision is defined as: \(AP=\frac{1}{N}\sum _{i=1}^{R}P(i)\times \delta (i)\), where P(i) means the retrieval accuracy of top i retrieved documents, and \(\delta (i)\) is an indicator function, if the i-th rank is a relevant instance, \(\delta (i)=1\), otherwise \(\delta (i)=0\). N is the number of relevant instances in the training set.

In our experiment, we choose the Gaussian RBF kernel \(K(x,y)=\exp (-\frac{\left\| x-y\right\| ^2}{2\varepsilon ^2})\), the sigmoid kernel \(K(x, y)=tanh(\alpha xy+c)\) and the exponential kernel \(K(x, y)=\exp (-\frac{\left\| x-y\right\| }{2\lambda ^2})\) as kernel functions, and set \(R=50\).

3.3 Experimental Results

We compare the mAP values of all the methods on the Wiki and NUS-WIDE datasets, and the code length ranges from 16 to 64. The detailed results are reported in Tables 2 and 3. Figure 1 shows the precision-recall curves of two query tasks on the Wiki dataset. We also compare the performance of our method using multiple kernels and single kernel respectively, and the results are plotted in Figs. 2 and 3.

Table 2. mAP results on Wiki dataset.
Fig. 1.
figure 1

PR-curves on the NUS-WIDE dataset varying code length

Table 3. mAP results on NUS-WIDE dataset.
Fig. 2.
figure 2

Compare mAP on multiple kernels and single kernels (Wiki)

Fig. 3.
figure 3

Compare mAP on multiple kernels and single kernels (NUS-WIDE)

Fig. 4.
figure 4

The effect of landmarks on MKSH

We can draw two conclusions from the aforementioned experimental results. Firstly, MKSH outperforms the alternatives, which shows its superiority over the compared methods. Secondly, MKSH shows its consistent advantage when the length of hash codes become longer, which can be owed to its sequential optimization strategy.

From Fig. 1 we also have two observations. Firstly, MKSH outperforms the compared methods. Secondly, we can find that RCMH and LCMH are not applicable for large-scale cross-modal retrieval due to their poor performance.

3.4 Parameters Sensitivity Study

According to our experimental study, the four parameters, including \(\varepsilon \), \(\alpha \), c and \(\lambda \), have a slight influence on the performance, so we set \(\varepsilon =0.6\), \(\alpha =9\), \(c=-0.1\) and \(\lambda =0.8\). The generation of the kernel matrix depends on the number of landmarks. Figure 4 shows the performance when varying the number of landmarks on the WIKI and NUS-WIDE datasets respectively. We can observe that the precision almost remain the same with the variation of the number of landmarks. Therefore, we can learn that the number of landmarks is not a sensitive parameter.

4 Conclusions

In this paper, we have proposed a novel algorithm for cross-modal hashing named MKSH. Multi-kernel learning and a sequential optimization strategy are used to achieve better performance. Experimental results on the Wiki and the NUS-WIDE datasets show that our method outperforms several state-of-the-art methods.