Keywords

1 Introduction

In the era of big data, Internet has been inundated with millions of images possessing sophisticated appearance variations. Traditional linear search is obviously not a satisfying choice for inquiring the relevant images from large data sets as its high-computational cost. Due to the high efficiencies in computation and storage, hashing methods [2, 9, 13] have become a powerful approach of approximate nearest neighbor (ANN) search in above fields. They learn stacks of projection functions to map original high-dimensional features to lower dimensional compact binary descriptors on the conditions that similar samples are mapped into nearby binary codes. By computing the Hamming distance between two binary descriptors via XOR bitwise operations, one can find the similar images of the query one.

Fig. 1.
figure 1

Pipeline of the proposed BMDH method. When an image is fed into the pre-trained deep network, it will finally be encoded into one binary vector. When a batch of m images pass through the network, the output of the full-connected layer is a \(n\times m \) real-valued matrix where n is the length of generated binary codes. To learn effective yet discriminative binary codes, three restrictions are used at the top layer of the network: (1) similarity-preserving mapping, (2) maximum variance on all projected dimensions, (3) balanced variance on each projected dimension.

Generally speaking, hashing methods are divided into two categories: data dependent and data independent. Without adopting the data distribution, data independent methods generate hash functions through random projections. Nevertheless, they need relatively longer binary codes to obtain the comparable accuracy [2]. Data dependent methods utilize the data distribution and thus having more accurate results. These methods can be further divided into three streams. The first stream are the unsupervised methods which only use the unlabeled data to learn hash functions. The representative methods include [2, 13]. The other two streams are supervised methods and semi-supervised methods which both incorporate label information in the process of generating hash codes. [6, 9] are the typical methods of them. Compared to the unsupervised methods, the supervised methods usually need fewer bits to achieve competitive performance as a result from the using of supervised information. Here, we focus on the supervised methods.

Conventional hashing methods exploit hand-crafted features to learn hash codes which can not necessarily preserve the semantic similarities of images in the learning process. By light of the outstanding performance of convolutional neural networks (CNNs) in vision tasks, some existing hashing methods use the CNNs to capture the rich semantic information beneath images. However, these methods even some deep hashing methods [8, 14] merely view the CNN architectures as feature extractors which are followed by the procedure of generating binary codes. That is to say, the image representation and hash functions learning are mutual independent two stages. It results in the feature representation may not be tailored to the hashing learning procedure and thus undermining the retrieval performance. Consequently, other hashing methods based deep learning [4, 5, 7, 16, 17] have been proposed to simultaneously learn the semantic features and binary codes thus outperforming the previous hashing methods.

It is known that variance is often analysed for statistics. Besides, the amount of information included in each projected dimension is directly proportional to the variance of it. Thus, there can be two considerations. On one hand, maximizing the total variance on all dimensions to amplify the information capacity of the binary codes. On the other hand, balancing the variance on each dimension so that it can be encoded into the same number of bits. Whereas, most existing hashing methods including the current state-of-the-art method DPSH [5] have not taken these aspects into consideration. In this paper, a novel deep hashing method dubbed balanced and maximum variance deep hashing (BMDH) is proposed to perform simultaneous feature learning and hash-code learning. The pipeline of our method is revealed in Fig. 1. Images whose supervised information is given by pairwise labels are used as the training inputs. The similarity matrix constructed from label information indicates whether two images are similar or not. To acquire more effective hash codes, we exert three criteria on the top layer of the deep network to form a joint objective function. Stochastic gradient descent (SGD) method is utilized to update all the parameters once time. To solve the discrete optimization problem, we relax the discrete outputs to continuous real-values. Meanwhile, a regularizer is used to approach the desired discrete values. Given a query image, we propagate it through the network and quantize the deep features to desired binary codes.

The main contributions of the proposed method are as follows:

  • A joint objective function is formed. We exert three restrictions on the top layer of the network to constitute a joint objective function. The first is that the learned binary codes should preserve the local data structure in the original space namely two similar samples should be encoded into nearby binary codes. The second is that the total variance on all projected dimensions is maximized so as to capture more information which is beneficial to the retrieval performance. The last is that the variance on each dimension should be as equal as possible so that each dimension can be encoded into the same number of bits.

  • An end-to-end model is constructed to simultaneously perform the feature representation and hash functions learning. In the framework, the above two stages can give feedback to each other. The optimal parameters are acquired using the simple back-propagation algorithm with stochastic gradient desent (SGD) method despite the sophisticated objective function.

  • The important variance information on the projected dimensions is focused on in our method. To the best of our knowledge, it is the first time that maximizing the total variance on all projected dimensions and balancing the variance on each dimension is simultaneously considered in the existing deep hashing methods.

The rest of the paper is organized as follows. In Sect. 2, we discuss the related works briefly. The proposed BMDH method is described in Sect. 3. The experimental results are presented in Sect. 4. Finally, Sect. 5 concludes the whole paper.

2 Related Work

Conventional Hashing: The earliest hashing research concentrates on data-independent methods in which the LSH methods [1] are the representatives. The basic idea of LSH methods is that two adjacent data points in the original space are projected to nearby binary codes in a large probability, while the probability of projecting two dissimilar data points to the same hash bucket is small. The hash functions are produced via random projections. However, LSH methods usually demand relatively longer codes to achieve comparable performance thus occupying larger storage space. Data-dependent methods learn more effective binary codes using training data. Among them, unsupervised methods learn hash functions using unlabeled data. The typical methods include Spectral Hashing (SH) [13], Iterative Quantization (ITQ) [2]. Semi-supervised and supervised hashing methods generate binary codes employing the label information. The representative methods include, Supervised Discrete Hashing (SDH) [10], Supervised Hashing with Kernels (KSH) [9], Fast Supervised Hashing (FastH) [6], Latent Factor Hashing (LFH) [15] and Sequential Projection Learning for Hashing (SPLH) [11].

Deep Hashing: Hashing methods based on CNN architectures are earning more and more attention especially when the outstanding performance of deep learning is demonstrated by Krizhevsky et al. on the Imagenet. Among the recent study, CNNH [14] learns the hash functions and feature representation based on binary codes obtained from the pairwise labels. However, it has not simultaneously learned the feature representation and hash functions. DLBH [7] learns binary codes by employing a hidden layer as features in a point-wise manner. [4, 16, 17] learn compact binary codes using triplet samples. [5, 8] learn binary codes with elaborately designed objective functions.

Balance and Maximize Variance: Variance is a significant statistic value which stands for the amount of information to some degree. So, it is quite necessary for us to think about the variance on the projected dimensions. Liong et al. [8] uses the criterion of maximizing the variance of learned binary vectors at the top layer of the network. But it is not an end-to-end model. Isotropic Hashing (IsoHash) [3] learns projection functions which can produce projected dimensions with isotropic variances. Iterative Quantization (ITQ) [2] seeks an optimized rotation to the PCA-projected data to minimize the quantization error which also effectively balances the variance on each dimension.

Totally, most of the existing hashing methods try to preserve the data structure in the original space. But they seldom consider maximizing the total variance on all projected dimensions and balancing the variance on each dimension or just take one aspect into account. In this paper, we propose a novel deep architecture to learn similarity-preserving binary codes with the consideration of simultaneously maximizing the total variance on all dimensions and balancing the variance on each dimension.

3 Approach

The purpose of hashing methods is to learn a series of projection functions h(x) which map u-dimensional real-valued features to v-dimensional binary codes \(b\in \{-1,1\}^v\)(\(u\gg v\)) while the distance relationship of data points in the original space is preserved. However, most existing hashing methods have not considered the variance on the projected dimensions which is an important statistic value. To learn compact and discriminative codes, we propose a novel hashing method called BMDH. Three criteria are exerted on the top layer so that a joint loss function is formed: (1) the distance relationship between data points in the Euclidean space is effectively preserved in the corresponding Hamming space, (2) the total variance on all projected dimensions is maximized, (3) the variance on each dimension is as equal as possible. In the following, we first give the proposed model and then describe how to optimize it.

3.1 Notations

Suppose there are M training data points. \(\varvec{x_i}\in \mathbb {R}^d(1\le i\le M)\) is the ith data point, and the matrix form is \(\varvec{X}\in \mathbb {R}^{M\times d}\). \(\varvec{B}^{M\times n}=\{\varvec{b}_i\}_{i=1}^M\) denotes the binary codes matrix, where \(\varvec{b}_i=sgn(h(\varvec{x_i}))\in \{-1,1\}^n\) denotes the n-bits binary vector of \(\varvec{x}_i\). \(sgn(v)=1\) if \(v>0\) and \(-1\) otherwise. \(\varvec{S}=\{s_{ij}\}\) is the similarity matrix where \(s_{ij}\in \{0,1\}\) is denoted as the similarity label between pairs of points. \(s_{ij}=1\) means the two data points \(\varvec{x}_i\) and \(\varvec{x}_j\) are similar and the corresponding Hamming distance is low. \(s_{ij}=0\) means the two data points are dissimilar and the corresponding Hamming distance is high. \(\varvec{\varTheta }\) denotes all the parameters of the feature learning part.

3.2 Proposed Model

Corresponding to the three restrictions, the overall objective function is composed of three parts. The first part aims to preserve the similarities between data pairs. That is to say, the Hamming diatance is minimized on similar pairs and maximized on dissimilar pairs simultaneously. The second part is used to maximize the total variance on all projected dimensions to boost the information capacity of the binary codes. The last part seeks to balance the variance on each dimension so that each dimension is allocated with the same number of bits.

$$\begin{aligned} \min _{\varvec{W}} J(\varvec{W})&\!=\!J_1(\varvec{W})\!-\!\lambda _1J_2(\varvec{W})\!+\!\lambda _2J_3(\varvec{W}) \nonumber \\&\!=\!-\!\sum _{s_{ij}\in \varvec{S}}(s_{ij}\varPhi _{ij}\!-\!\log (1\!+\!e^{\varPhi _{ij}}))\!+\!\rho \sum _{i=1}^m\Vert \varvec{b}_i\!-\!\varvec{g}_i\Vert _2^2\nonumber \\&\quad \!-\!\lambda 1(\sum _{i=1}^n{\text {E}}(\Vert \varvec{w}_i^T\varvec{H}\!+\!\varvec{p}_i\Vert _2^2)\!-\!\sum \limits _{i=1}^n\Vert {\text {E}}(\varvec{w}_i^T\varvec{H}\!+\!\varvec{p}_i)\Vert _2^2) \\&\quad \!+\!\lambda 2(\frac{1}{n}\sum _{i=1}^nu_i^2\!-\![\frac{1}{n}\sum _{i=1}^nu_i]^2)\nonumber \end{aligned}$$
(1)

where \(\rho \) is the regularization term, \(\lambda _1\) and \(\lambda _2\) are used to balance different objectives. \(\Vert \cdot \Vert _2\) is the L2-norm of vector.

To construct an end-to-end model, we set:

$$\begin{aligned} \varvec{G}=\varvec{W}^T\varvec{H}+\varvec{Q} \end{aligned}$$
(2)

where \(\varvec{H}=\{\varvec{h}_i\}_{i=1}^m\in \mathbb {R}^{4096\times m} \) denotes the output matrix of the full7 layer associated with the batch of m data points, where \(\varvec{h}_i\in \mathbb {R}^{4096\times 1} \) denotes the output vector associated with the data point \(\varvec{x}_i\). \(\varvec{W}=\{\varvec{w}_i\}_{i=1}^n\in \mathbb {R}^{4096\times n}\) is the projection matrix of the full8 layer, where \(\varvec{w}_i\) is the ith projection vector. \(\varvec{Q}\in \mathbb {R}^{n\times m}\) is the bias matrix. It means that we incorporate the hash-code learning part with the feature learning part by a fully-connected layer. As a result, the weight \(\varvec{W}\) and bias \(\varvec{Q}\) of the hash-code layer are simultaneously updated with all the other parameters of the feature learning layers.

In order to make a better understanding of the joint objective function, we describe it in details as below.

3.3 Preserving the Similarities

The proposed model learns non-linear projections that map the input data points into binary codes and simultaneously preserve the distance relationship of points in the original space.

We define the likelihood of the similarity matrix as that of [15]:

$$\begin{aligned} p(\varvec{S}|\varvec{B})=\prod \limits _{s_{ij}\in \varvec{S}}p(s_{ij}|\varvec{B}) \end{aligned}$$

while

$$\begin{aligned} p(s_{ij}|\varvec{B})= {\left\{ \begin{array}{ll} \frac{1}{1+e^{-\varGamma _{ij}}},&{} s_{ij}=1\\ \frac{e^{-\varGamma _{ij}}}{1+e^{-\varGamma _{ij}}},&{} s_{ij}=0 \end{array}\right. } \end{aligned}$$

where

$$\begin{aligned} \varGamma _{ij}=\frac{1}{2}\varvec{b}_i^T\varvec{b}_j \end{aligned}$$

Taking the negative log-likelihood of the observed similarity labels \(\varvec{S}\) as that of [5], we get a primary model:

$$\begin{aligned} \min _{\varvec{W}}J_1(\varvec{W})&\!=\!-\log p(\varvec{S}|\varvec{B})\!=\!-\!\sum _{s_{ij}\in \varvec{S}}\log (s_{ij}|\varvec{B}) \\&\!=\!-\!\sum _{s_{ij}\in \varvec{S}}\big [s_{ij}\log \frac{1}{1\!+\!e^{-\varGamma _{ij}}}\!+\!(1\!-\!s_{ij})\log \frac{e^{-\varGamma _{ij}}}{1\!+\!e^{-\varGamma _{ij}}}\big ] \\&\!=\!-\!\sum _{s_{ij}\in \varvec{S}}(s_{ij}\varGamma _{ij}\!-\!\log (1\!+\!e^{\varGamma _{ij}})) \end{aligned}$$

It is a reasonable model which makes the similar pairs have low Hamming distance and dissimilar pairs have high Hamming distance. To solve the discrete optimization problem, we set:

$$\begin{aligned} \varvec{g}_i=\varvec{W}^T\varvec{h}_i+\varvec{q} \end{aligned}$$

where \(\varvec{q}\in \mathbb {R}^{n\times 1}\) is the bias vector.

Then, we relax the discrete values \(\{\varvec{b}_i\}_{i=1}^m\) to continuous real-values \(\{\varvec{g}_i\}_{i=1}^m\) as that of [5]. The ultimate model can be reformulated as:

$$\begin{aligned} \min _{\varvec{W}}J_1(\varvec{W}) = -\sum _{s_{ij}\in \varvec{S}}(s_{ij}\varPhi _{ij}-\log (1+e^{\varPhi _{ij}}))+\rho \sum _{i=1}^m\Vert \varvec{b}_i-\varvec{g}_i\Vert _2^2 \end{aligned}$$
(3)

where \(\varPhi _{ij}=\frac{1}{2}\varvec{g}_i^T\varvec{g}_j\), \(\rho \) is the regularization term to approach the desired discrete values.

3.4 Maximizing the Variance

The variances on different projected dimensions are not identical. The bigger the variance on each projected dimension is, the richer the information of the binary codes convey. To expand the information capacity of all binary codes, it is a good idea to maximize the total variance on all dimensions. First, we work out the variances of different dimensions. Next, we sum the computed variances. Last, we maximize the sum so that the information contained in the binary codes is maximized.

The maximum variance theory is also the basic idea of Principal Component Analysis which can minimize the reconstruction error. Here, we set:

$$\begin{aligned} \varvec{z}_i=\varvec{w}_i^T\varvec{H}+\varvec{p}_i \end{aligned}$$

where \(\varvec{p}_i\in \mathbb {R}^{1\times m}\) is a bias vector.

Then, we get the following formulation:

$$\begin{aligned} \max _{\varvec{W}}J_2(\varvec{W})=\sum _{i=1}^n{\text {var}}(\varvec{z}_i) =\sum _{i=1}^n{\text {E}}(\Vert \varvec{w}_i^T\varvec{H}+\varvec{p}_i\Vert _2^2) -\sum \limits _{i=1}^n\Vert {\text {E}}(\varvec{w}_i^T\varvec{H}+\varvec{p}_i)\Vert _2^2 \end{aligned}$$
(4)

3.5 Balancing the Variance

Typically, different projected dimensions have different variances and dimensions with larger variances will carry more information. So it is unreasonable to utilize the same number of bits for different dimensions. To solve the problem, we balance the variance on each projected dimension namely making them as equal as possible so that the same number of bits can be employed.

Define \(u_i={\text {var}}(\varvec{z}_i)\) as the variance on each dimension and the variance vector is denoted as \(\varvec{u}=(u_1,u_2,\ldots ,u_n)\). To make \(u_i (i=1,2,\ldots ,n)\) as equal as possible, we minimize \({\text {var}}(\varvec{u})\), which denotes the variance of \(\varvec{u}\). We can get the following formulation:

$$\begin{aligned} \min _{\varvec{W}}J_3(\varvec{W}) ={\text {var}}(\varvec{u})={\text {E}}(\varvec{u}^2)-[{\text {E}}(\varvec{u})]^2 =\frac{1}{n}\sum _{i=1}^nu_i^2-[\frac{1}{n}\sum _{i=1}^nu_i]^2 \end{aligned}$$
(5)

Once the variance of \(\varvec{u}\) is minimized, the variance on each projected dimension is balanced.

3.6 Optimization

The strategy of back-propagation algorithm with stochastic gradient descent (SGD) method is leveraged to obtain the optimal parameters. First we calculate the outputs of the network with the given parameters and quantize the outputs:

$$\begin{aligned} \varvec{B}=sgn(\varvec{G})=sgn(\varvec{W}^T\varvec{H}+\varvec{Q}) \end{aligned}$$

Then, we work out the derivative of the joint objective function (1) with respect to \(\varvec{G}\). Since the objective function is composed of three parts, we compute the derivatives of the three parts respectively.

  • The vector form of Eq. (3) can be formulated as:

    $$\begin{aligned} J_1(\varvec{W})=-(\varvec{S}\varvec{\varPhi }-\log (1+e^{\varvec{\varPhi }})+\rho \Vert \varvec{B}-\varvec{G}\Vert _2^2 \end{aligned}$$

where \(\varvec{\varPhi }=\frac{1}{2}\varvec{G}^T\varvec{G}\).

The derivative of \(J_1(\varvec{W})\) with respect to \(\varvec{G}\) is computed as:

$$\begin{aligned} \frac{\partial J_1(\varvec{W})}{\partial \varvec{G}}=\varvec{G}(\frac{1}{1+e^{-\varvec{\varPhi }}}-\varvec{S})-2\rho \Vert \varvec{B}-\varvec{G}\Vert _2 \end{aligned}$$
  • The Eq. (4) is rewritten as:

    $$\begin{aligned} J_2(\varvec{W})&\!=\!\frac{1}{m}\sum _{i=1}^n\sum _{j=1}^m\Vert \varvec{w}_i^T\varvec{h}_j\!+\!p'_{ij}\Vert _2^2\!-\!\sum \limits _{i=1}^n\Vert {\text {E}}(\varvec{w}_i^T\varvec{H}\!+\!\varvec{p}_i)\Vert _2^2\nonumber \\&\!=\!\frac{1}{m}tr((\varvec{W}^T\varvec{H}\!+\!\varvec{Q})^T(\varvec{W}^T\varvec{H}\!+\!\varvec{Q}))\!-\!\sum \limits _{i=1}^n\Vert {\text {E}}(\varvec{z}_i)\Vert _2^2 \end{aligned}$$

where \(\varvec{h}_j\in \mathbb {R}^{4096\times 1}\) denotes the outputs of the full7 layer associated with the jth data point, \(p'_{ij}\) is the bias term related to the ith weight and the jth data point.

The derivative of \(J_2(\varvec{W})\) with respect to \(\varvec{G}\) is computed as:

$$\begin{aligned} \frac{\partial J_2(\varvec{W})}{\partial \varvec{G}}&\!=\!\frac{2}{m}(\varvec{W}^T\varvec{H}\!+\!\varvec{Q})\!-\!\frac{\partial \sum \limits _{i=1}^n\Vert {\text {E}}(\varvec{z}_i)\Vert _2^2}{\partial \varvec{G} } \\&\!=\!\frac{2}{m}\varvec{G}\!-\!\!(\frac{\partial \Vert {\text {E}}(\varvec{z}_1)\Vert _2^2}{\partial \varvec{z_1}},\frac{\partial \Vert {\text {E}}(\varvec{z}_2)\Vert _2^2}{\partial \varvec{z_2}}\!,\!\ldots \!,\!\frac{\partial \Vert {\text {E}}(\varvec{z}_n)\Vert _2^2}{\partial \varvec{z_n}})^T \! \\&=\frac{2}{m}\varvec{G} -\frac{2}{m}\left[ \begin{array}{cccc} {\text {E}}(\varvec{z}_1) &{} {\text {E}}(\varvec{z}_1) &{} \cdots {\text {E}}(\varvec{z}_1)\\ {\text {E}}(\varvec{z}_2) &{} {\text {E}}(\varvec{z}_2) &{} \cdots {\text {E}}(\varvec{z}_2)\\ \cdots &{} \cdots &{} \cdots \\ {\text {E}}(\varvec{z}_n) &{} {\text {E}}(\varvec{z}_n) &{} \cdots {\text {E}}(\varvec{z}_n)\\ \end{array} \right] _{n\times m}\\&=\varvec{G}' \end{aligned}$$
  • The Eq. (5) is reformulated as:

    $$\begin{aligned} J_3(\varvec{W})=\frac{1}{n}\sum _{i=1}^n[{\text {var}}(\varvec{z}_i)]^2-[\frac{1}{n}\sum _{i=1}^n{\text {var}}(\varvec{z}_i)]^2 \end{aligned}$$

The derivative of \(J_3(\varvec{W})\) with respect to \(\varvec{G}\) is computed as:

$$\begin{aligned} \frac{\partial J_3(\varvec{W})}{\partial \varvec{G}}&\!=\!\frac{1}{n}\frac{\partial (\sum \limits _{i=1}^n[{\text {var}}(\varvec{z}_i)]^2)}{\partial \varvec{G}}\!-\!\frac{\partial [\frac{1}{n}\sum \limits _{i=1}^n{\text {var}}(\varvec{z}_i)]^2}{\partial \varvec{G}} \\&\!=\!\frac{2}{n}\!({\text {var}}(\varvec{z}_1)\frac{\partial {\text {var}}(\varvec{z}_1)}{\partial \varvec{z}_1}\!,\!{\text {var}}(\varvec{z}_2)\frac{\partial {\text {var}} (\varvec{z}_2)}{\partial \varvec{z}_2}\!,\!\ldots \!,\!{\text {var}}(\varvec{z}_n)\frac{\partial {\text {var}}(\varvec{z}_n)}{\partial \varvec{z}_n})^T\! \\&\quad \!-\!\frac{2}{n}\sum \limits _{i=1}^n{\text {var}}(\varvec{z}_i)\!\cdot \!\frac{1}{n}(\frac{\partial {\text {var}}(\varvec{z}_1)}{\partial \varvec{z}_1}\!,\!\frac{\partial {\text {var}} (\varvec{z}_2)}{\partial \varvec{z}_2}\!,\!\ldots \!,\!\frac{\partial {\text {var}}(\varvec{z}_n)}{\partial \varvec{z}_n})^T \\&\!=\!\frac{4}{mn}({\text {var}}(\varvec{z}_1)\cdot (\varvec{z}_1-{\text {E}}(\varvec{z}_1)),{\text {var}}(\varvec{z}_2)\cdot (\varvec{z}_2-{\text {E}}(\varvec{z}_2)) ,\ldots ,{\text {var}}(\varvec{z}_n)\!\cdot \!(\varvec{z}_n\!-\!{\text {E}}(\varvec{z}_n)))^T \\&\quad \!-\!\frac{4}{mn^2}\sum \limits _{i=1}^n{\text {var}}(\varvec{z}_i)\!\cdot \!(\varvec{z}_1\!-\!{\text {E}}(\varvec{z}_1),\varvec{z}_2\!-\!{\text {E}}(\varvec{z}_2),\!\ldots \!,\varvec{z}_n\!-\!{\text {E}}(\varvec{z}_n))^T \\&=\varvec{G}'' \end{aligned}$$

In this way, the derivative of the joint loss function (1) with respect to \(\varvec{G}\) is the sum of the three parts:

$$\begin{aligned} \frac{\partial J(\varvec{W})}{\partial \varvec{G}}=\varvec{G}(\frac{1}{1+e^{-\varvec{\varPhi }}}-\varvec{S})-2\rho \Vert \varvec{B}-\varvec{G}\Vert _2-\lambda _1\varvec{G}'+\lambda _2\varvec{G}'' \end{aligned}$$
(6)

Last, the derivative of the parameters \(\varvec{W}\), \(\varvec{Q}\) and \(\varvec{H}\) can be computed as follows:

$$\begin{aligned} \frac{\partial J(\varvec{W})}{\partial \varvec{W}}=\varvec{H}\frac{\partial J(\varvec{W})}{\partial \varvec{G}} \end{aligned}$$
(7)
$$\begin{aligned} \frac{\partial J(\varvec{W})}{\partial \varvec{Q}}=\frac{\partial J(\varvec{W})}{\partial \varvec{G}} \end{aligned}$$
(8)
$$\begin{aligned} \frac{\partial J(\varvec{W})}{\partial \varvec{H}}=\varvec{W}\frac{\partial J(\varvec{W})}{\partial \varvec{G}} \end{aligned}$$
(9)

All the parameters are alternatively learned. That is to say, we update one parameter with other parameters fixed.

The proposed algorithm of BMDH is summarized as follows.

figure a

4 Experiment

We implement our experiments on two universally used data sets: CIFAR-10 and NUS-WIDE. The mean Average Precision (mAP) as the metric is applied to evaluate the retrieval performance. Several conventional hashing methods and deep hashing methods are utilized to make comparison with the proposed method.

4.1 Datasets

  • CIFAR-10 Data set consists of the total number of 60, 000 images each of which is a \(32\times 32\) color image. These images are categorized into 10 classes with 6000 images per class.

  • NUS-WIDE Data set consists of nearly 27, 0000 color images collected from the web. It is a multi-label data set each image of which is annotated with one or more labels from 81 semantic concepts. Following [4, 5, 14], we only exploit images whose labels are from the 21 most frequent concept tags. At least 5, 000 images are associated with each label.

4.2 Evaluation Metric

We use the mean Average Precision (mAP) as the metric for evaluation which means the mean precision of all query samples. It is defined as:

$$\begin{aligned} mAP=\frac{1}{|Q|}\sum _{i=1}^{|Q|}\frac{1}{n_i}\sum _{j=1}^{n_i}Precision(R_{ij}) \end{aligned}$$
(10)

where \(q_i\in Q\) denotes the query sample, \(n_i\) is the number of samples similar to the query sample \(q_i\) in the dataset. The relevant samples are ordered as \(\{x_1,x_2,\ldots ,x_{n_i}\}\), \(R_{ij}\) is the set of ranked retrieval results from the top result until getting to point \(x_j\).

The definition of Precision is as follows:

$$\begin{aligned} \text {Precision}\,=\,\frac{\text {the number of retrieved relevant points}}{\text {the number of all retrieved points}}. \end{aligned}$$

4.3 Experimental Setting

For CIFAR-10, following [5], we randomly sample 1000 images (100 images per class) as the query set. The rest images are used as database images. For the unsupervised methods, the database images are utilized as the training images. For the supervised methods, the training set consists of 5000 images (500 images per class) which are randomly sampled from the rest images. We construct the pairwise similarity matrix \(\varvec{S}\) based on the images class labels where two images sharing the same label are considered similar. For methods using hand-crafted features, we represent each image with a 512-dimension GIST vector [9].

For NUS-WIDE, we randomly sample 2, 100 images (100 images per class) from the selected 21 classes as the query set. For the unsupervised methods, all the remaining images from the 21 classes constitute the training data set. For the supervised methods, 500 images per class are randomly sampled from the 21 classes to form the training set. The pairwise similarity matrix \(\varvec{S}\) is constructed based on the images class labels where two images sharing at least one common label are considered similar. When calculating the mAP values, only the top 5, 000 returned neighbors are used for evaluation. For methods using hand-crafted features, we represent each image with a 1134-dimensional feature vector, which is composed of 64-D color histogram, 144-D color correlogram, 73-D edge direction histogram, 128-D wavelet texture, 225-D block-wise color moments and 500-D SIFT features.

The first seven layers of our network are initialized with the CNN-F network that has been pre-trained on ImageNet. The size of the batch is determined as 128 which means that 128 images are fed into the network in each iteration. The hyper-parameters \(\rho \), \(\lambda _1\) and \(\lambda _2\) are set to 10, 0.5 and 0.1 unless otherwise stated.

Fig. 2.
figure 2

mAP on different number of bits with respect to different criteria.

The experiments are carried out on Intel Core i5-4660 3.2 GHZ desktop computer of 8 G Memory with MATLAB2014a and MatConvNet. They are also implemented on a NVIDIA GTX 1070 GPU server.

4.4 Retrieval Results on CIFAR-10 and NUS-WIDE

Performance Improved Step by Step: Since the loss function is composed of three parts, we validate how the performance is gradually improved with another restriction added to the objective function every time. First, we acquire the performance with the criterion of similarity-preserving. Next, the criterion of maximum variance on all projected dimensions is added. Last, we add the criterion of balanced variance on each projected dimension to get the final performance.

When the number of iterations is set to 10, the performance of different number of bits with respect to different criteria on CIFAR-10 is revealed in Fig. 2. It can be seen that the performance with the restriction of similarity-preserving is outperformed by the performance with another restriction of maximum variance added. When the last restriction of balanced variance is added, the performance is enhanced again. This can be attributed to more information captured by the binary codes and equal information contained in each dimension.

Table 1. mean Average Precision (mAP) on the CIFAR-10 dataset and the NUS-WIDE dataset. The highest values are shown in boldface. The mAP for NUS-WIDE dataset is calculated based on the top 5, 000 returned neighbors.

Performance Compared with Conventional Methods with Hand-crafted Features: To validate the superiority of deep features over hand-crafted features, we compare the proposed method with those conventional methods having hand-crafted features including FastH [6], SDH [10], KSH [9], LFH [15], SPLH [11], ITQ [2], SH [13]. The results are shown in Table 1. It can be found that the BMGH method exceeds the contrastive algorithms dramatically no matter it is unsupervised or supervised. It is because that deep architectures provide richer semantic features despite the tremendous appearance variations. The results of CNNH, KSH and ITQ are copied from [4, 14] and the results of other contrastive methods are from [5]. Because the experimental setting and evaluation metric are all the same, so the above practice is reasonable.

Table 2. mean Average Precision (mAP) on the CIFAR-10 dataset and the NUS-WIDE dataset. The highest values are shown in boldface. The mAP for NUS-WIDE dataset is calculated based on the top 5, 000 returned neighbors.

Performance Compared with Conventional Methods with Deep Features: To verify the improvement in performance originates from our method instead of the deep network, we select some conventional methods with deep features extracted by the CNN-F network to compare with the proposed method. Table 2 shows the final results. We can see that the performance of our method outperforms that of contrastive methods on two datasets which demonstrates the progressiveness of the proposed method. In additional, the results also validate the effectiveness of the end-to-end framework. The results of the compared methods are from [5, 12] which is reasonable as the experimental setting and evaluation metric are all the same.

Table 3. mean Average Precision (mAP) on the CIFAR-10 dataset and the NUS-WIDE dataset. The highest values are shown in boldface. The mAP for NUS-WIDE dataset is calculated based on the top 50, 000 returned neighbors.

Performance Compared with Deep Hashing Methods: We compare the proposed method with some deep hashing methods including DPSH [5], NINH [4], CNNH [14], DSRH [17], DSCH [16] and DRSCH [16]. These methods have not considered maximizing the total variance on all projected dimensions and balancing the variance on each dimension.

When comparing with DSRH, DSCH and DRSCH, we leverage another experimental setting the same as [16] for fair comparison. Specifically, in CIFAR-10, we randomly sample 10, 000 images (1, 000 images per class) as the query set. The remaining images are used to form the database set. And the database set is used as the training set. In NUS-WIDE, we randomly sample 2, 100 images (100 images per class) from the 21 classes as the query set. Similarly, the remaining images are used as the database images and the training images simultaneously. The top 50,000 returned neighbors are used for evaluation when calculating the mAP values. DPSH# in Table 3 denotes the DPSH method under the new experimental setting. The results of the compared methods are directly from [5, 16] which is reasonable as the experimental setting and evaluation metric are all the same. From Tables 1 and 3 we can see that the performance of our method is superior to the compared deep hashing methods including the current state-of-the-art method DPSH. It verifies the effectiveness of maximizing the total variance on all dimensions and balancing the variance on each dimension. Additionally, the superiority of our method to CNNH can also demonstrate the advantage of simultaneous feature learning and hash-code learning.

5 Conclusion

In this paper, we present a novel deep hashing algorithm, dubbed BMDH. To the best of our knowledge, it is the first time that maximizing the total variance on all projected dimensions and balancing the variance on each dimension is simultaneously considered in the existing deep hashing methods. Back-propagation algorithm with stochastic gradient desent (SGD) method and the end-to-end way ensure the optimal parameters despite the complex objective function. Experimental results on two widely used data sets demonstrate that the proposed method can outperform other state-of-the-art algorithms in image retrieval applications.