Learning to Hash with Binary Deep Neural Network

Do, Thanh-Toan; Doan, Anh-Dzung; Cheung, Ngai-Man

doi:10.1007/978-3-319-46454-1_14

Thanh-Toan Do¹⁷,
Anh-Dzung Doan¹⁷ &
Ngai-Man Cheung¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 9909))

Included in the following conference series:

European Conference on Computer Vision

12k Accesses

Abstract

This work proposes deep network models and learning algorithms for unsupervised and supervised binary hashing. Our novel network design constrains one hidden layer to directly output the binary codes. This addresses a challenging issue in some previous works: optimizing non-smooth objective functions due to binarization. Moreover, we incorporate independence and balance properties in the direct and strict forms in the learning. Furthermore, we include similarity preserving property in our objective function. Our resulting optimization with these binary, independence, and balance constraints is difficult to solve. We propose to attack it with alternating optimization and careful relaxation. Experimental results on three benchmark datasets show that our proposed methods compare favorably with the state of the art.

You have full access to this open access chapter, Download conference paper PDF

Two-Stage Unsupervised Deep Hashing for Image Retrieval

Deep Supervised Hashing with Information Loss

Unsupervised Binary Representation Learning with Deep Variational Networks

Article 21 February 2019

Keywords

1 Introduction

We are interested in learning binary hash codes for large scale visual search. Two main difficulties with large scale visual search are efficient storage and fast searching. An attractive approach for handling these difficulties is binary hashing, where each original high dimensional vector ${\mathbf x}\in {\mathbb R}^D$ is mapped to a very compact binary vector ${\mathbf b}\in \{-1,1\}^L$, where $L \ll D$.

Hashing methods can be divided into two categories: data-independent and data-dependent. Methods in data-independent category [1–4] rely on random projections for constructing hash functions. Methods in data-dependent category use the available training data to learn the hash functions in unsupervised [5–9] or supervised manner [10–15]. The review of data-independent/data-dependent hashing methods can be found in recent surveys [16–18].

One difficult problem in hashing is to deal with the binary constraint on the codes. Specifically, the outputs of the hash functions have to be binary. In general, this binary constraint leads to a NP-hard mixed-integer optimization problem. To handle this difficulty, most aforementioned methods relax the constraint during the learning of hash functions. With this relaxation, the continuous codes are learned first. Then, the codes are binarized (e.g., with thresholding). This relaxation greatly simplifies the original binary constrained problem. However, the solution can be suboptimal, i.e., the binary codes resulting from thresholded continuous codes could be inferior to those that are obtained by including the binary constraint in the learning.

Furthermore, a good hashing method should produce binary codes with the properties [5]: (i) similarity preserving, i.e., (dis)similar inputs should likely have (dis)similar binary codes; (ii) independence, i.e., different bits in the binary codes are independent to each other; (iii) balance, i.e., each bit has a $50\,\%$ chance of being 1 or $-1$. The direct incorporation of the independent and balance properties can complicate the learning. Previous work has used some relaxation to work around the problem [6, 19, 20], but there may be some performance degradation.

1.1 Related Work

Our work is inspired by a few recent successful hashing methods which define hash functions as a neural network [19, 21, 22]. We propose an improved design to address their limitations. In Semantic Hashing [21], the model is formed by a stack of Restricted Boltzmann Machine, and a pretraining step is required. This model does not consider the independence and balance of the codes. In Binary Autoencoder [22], a linear autoencoder is used as hash functions. As this model only uses one hidden layer, it may not well capture the information of inputs. Extending [22] with multiple, nonlinear layers is not straight-forward because of the binary constraint. They also do not consider the independence and balance of codes. In Deep Hashing [19], a deep neural network is used as hash functions. However, this model does not fully take into account the similarity preserving. They also apply some relaxation in arriving the independence and balance of codes and this may degrade the performance.

In order to handle the binary constraint, Semantic Hashing [21] first solves the relaxed problem by discarding the constraint and then thresholds the solved continuous solution. In Deep Hashing (DH) [19], the output of the last layer, ${\mathbf H}^{n}$, is binarized by the sgn function. They include a term in the objective function to reduce this binarization loss: $\left( sgn({\mathbf H}^{n}) - {\mathbf H}^{n} \right) $. Solving the objective function of DH [19] is difficult because the sgn function is non-differentiable. The authors in [19] work around this difficulty by assuming that the sgn function is differentiable everywhere. In Binary Autoencoder (BA) [22], the outputs of the hidden layer are passed into a step function to binarize the codes. Incorporating the step function in the learning leads to a non-smooth objective function and the optimization is NP-complete. To handle this difficulty, they use binary SVMs to learn the model parameters in the case when there is only a single hidden layer.

1.2 Contribution

In this work, we first propose a novel deep network model and learning algorithm for unsupervised hashing. In order to achieve binary codes, instead of involving the sgn or step function as in [19, 22], our proposed network design constrains one layer to directly output the binary codes (hence the network is called as Binary Deep Neural Network). Moreover, we propose to directly incorporate the independence and balance properties without relaxing them. Furthermore, we include the similarity preserving in our objective function. The resulting optimization with these binary and direct constraints is NP-hard. We propose to attack this challenging problem with alternating optimization and careful relaxation. To enhance the discriminative power of the binary codes, we then extend our method to supervised hashing by leveraging the label information such that the binary codes preserve the semantic similarity between samples. The solid experiments on three benchmark datasets show the improvement of the proposed methods over state-of-the-art hashing methods.

The remaining of this paper is organized as follows. Section 2 and Sect. 3 present and evaluate the proposed unsupervised hashing method, respectively. Section 4 and Sect. 5 present and evaluate the proposed supervised hashing method, respectively. Section 6 concludes the paper.

Table 1. Notations and their corresponding meanings.

Full size table

2 Unsupervised Hashing with Binary Deep Neural Network (UH-BDNN)

2.1 Formulation of UH-BDNN

We summarize the notations in Table 1. In our work, the hash functions are defined by a deep neural network. In our proposed design, we use different activation functions in different layers. Specifically, we use the sigmoid function as activation function for layers $2,\cdots ,n-2$, and the identity function as activation function for layer $n-1$ and layer n. Our idea is to learn the network such that the output values of the penultimate layer (layer $n-1$) can be used as the binary codes. We introduce constraints in the learning algorithm such that the output values at the layer $n-1$ have the following desirable properties: (i) belonging to $\{-1,1\}$; (ii) similarity preserving; (iii) independent and (iv) balancing. Figure 1 illustrates our network for the case $D=4,L=2$.

Let us start with first two properties of the codes, i.e., belonging to $\{-1,1\}$ and similarity preserving. To achieve the binary codes having these two properties, we propose to optimize the following constrained objective function

$$\begin{aligned} \min _{{\mathbf W},{\mathbf c}} J= & {} \frac{1}{2m} \vert \vert {{\mathbf X}-\left( {\mathbf W}^{(n-1)}{\mathbf H}^{(n-1)}+{\mathbf c}^{(n-1)}{\mathbf 1}_{1\times m}\right) }\vert \vert ^2 +\frac{\lambda _1}{2}\sum _{l=1}^{n-1} \vert \vert {{\mathbf W}^{(l)}}\vert \vert ^2 \end{aligned}$$

(1)

$$\begin{aligned} \text {s.t. } {\mathbf H}^{(n-1)} \in \{-1,1\}^{L\times m} \end{aligned}$$

(2)

The constraint (2) is to ensure the first property. As the activation function for the last layer is the identity function, the term $\left( {\mathbf W}^{(n-1)}{\mathbf H}^{(n-1)}+{\mathbf c}^{(n-1)}{\mathbf 1}_{1\times m}\right) $ is the output of the last layer. The first term of (1) makes sure that the binary code gives a good reconstruction of ${\mathbf X}$. It is worth noting that the reconstruction criterion has been used as an indirect way for preserving the similarity in state-of-the-art unsupervised hashing methods [6, 21, 22], i.e., it encourages (dis)similar inputs map to (dis)similar binary codes. The second term is a regularization that tends to decrease the magnitude of the weights, and this helps to prevent overfitting. Note that in our proposed design, we constrain to directly output the binary codes at one layer, and this avoids the difficulties with the sgn/step function such as non-differentiability. On the other hand, our formulation with (1) under the binary constraint (2) is very difficult to solve. It is a mixed-integer problem which is NP-hard. We propose to attack the problem using alternating optimization by introducing an auxiliary variable. Using the auxiliary variable ${\mathbf B}$, we reformulate the objective function (1) under constraint (2) as

$$\begin{aligned} \min _{{\mathbf W},{\mathbf c},{\mathbf B}} J= & {} \frac{1}{2m} \vert \vert {{\mathbf X}-{\mathbf W}^{(n-1)}{\mathbf B}-{\mathbf c}^{(n-1)}{\mathbf 1}_{1\times m}}\vert \vert ^2 +\frac{\lambda _1}{2}\sum _{l=1}^{n-1} \vert \vert {{\mathbf W}^{(l)}}\vert \vert ^2 \end{aligned}$$

(3)

$$\begin{aligned} \text {s.t. }{\mathbf B}= {\mathbf H}^{(n-1)} \end{aligned}$$

(4)

$$\begin{aligned} {\mathbf B}\in \{-1,1\}^{L\times m} \end{aligned}$$

(5)

The benefit of introducing the auxiliary variable ${\mathbf B}$ is that we can decompose the difficult constrained optimization problem (1) into two sub-optimization problems. Then, we can iteratively solve the optimization by using alternating optimization with respect to $({\mathbf W},{\mathbf c})$ and ${\mathbf B}$ while holding the other fixed. We will discuss the details of the alternating optimization in a moment. Using the idea of the quadratic penalty method [23], we relax the equality constraint (4) by solving the following constrained objective function

$$\begin{aligned} \min _{{\mathbf W},{\mathbf c},{\mathbf B}} J= & {} \frac{1}{2m} \vert \vert {{\mathbf X}-{\mathbf W}^{(n-1)}{\mathbf B}-{\mathbf c}^{(n-1)}{\mathbf 1}_{1\times m}}\vert \vert ^2 \nonumber \\&+\frac{\lambda _1}{2}\sum _{l=1}^{n-1} \vert \vert {{\mathbf W}^{(l)}}\vert \vert ^2 + \frac{\lambda _2}{2m}\vert \vert {{\mathbf H}^{(n-1)}-{\mathbf B}}\vert \vert ^2 \end{aligned}$$

(6)

$$\begin{aligned} \text {s.t. }{\mathbf B}\in \{-1,1\}^{L\times m} \end{aligned}$$

(7)

The third term in (6) measures the (equality) constraint violation. By setting the penalty parameter $\lambda _2$ sufficiently large, we penalize the constraint violation severely, thereby forcing the minimizer of the penalty function (6) closer to the feasible region of the original constrained function (3).

Now let us consider the two remaining properties of the codes, i.e., independence and balance. Unlike previous works which use some relaxation or approximation on the independence and balance properties [6, 19, 20], we propose to encode these properties strictly and directly based on the binary outputs of our layer $n-1$ ^{Footnote 1}. Specifically, we encode the independence and balance properties of the codes by having the fourth and the fifth term respectively in the following constrained objective function

$$\begin{aligned}&\min _{{\mathbf W},{\mathbf c},{\mathbf B}} J = \frac{1}{2m} \vert \vert {{\mathbf X}-{\mathbf W}^{(n-1)}{\mathbf B}-{\mathbf c}^{(n-1)}{\mathbf 1}_{1\times m}}\vert \vert ^2 +\frac{\lambda _1}{2}\sum _{l=1}^{n-1} \vert \vert {{\mathbf W}^{(l)}}\vert \vert ^2\nonumber \\&+ \frac{\lambda _2}{2m}\vert \vert {{\mathbf H}^{(n-1)}-{\mathbf B}}\vert \vert ^2 +\frac{\lambda _3}{2}\vert \vert {\frac{1}{m}{\mathbf H}^{(n-1)}({\mathbf H}^{(n-1)})^T-{\mathbf I}}\vert \vert ^2 +\frac{\lambda _4}{2m}\vert \vert {{\mathbf H}^{(n-1)}{\mathbf 1}_{m\times 1}}\vert \vert ^2 \end{aligned}$$

(8)

$$\begin{aligned} \text {s.t. }{\mathbf B}\in \{-1,1\}^{L\times m} \end{aligned}$$

(9)

(8) under constraint (9) is our final formulation. Before discussing how to solve it, let us present the differences between our work and the recent deep learning based-hashing models Deep Hashing [19] and Binary Autoencoder [22].

The first important difference between our model and Deep Hashing [19] / Binary Autoencoder [22] is the way to achieve the binary codes. Instead of involving the sgn or step function as in [19, 22], we constrain the network to directly output the binary codes at one layer. Other differences are presented as follows.

Comparison to Deep Hashing (DH) [19]: the deep model of DH is learned by the following formulation:

The DH’s model does not have the reconstruction layer. They apply sgn function to the outputs at the top layer of the network to obtain the binary codes. The first term aims to minimize quantization loss when applying the sgn function to the outputs at the top layer. The balancing and the independent properties are contained in the second and the third terms [19]. It is worth noting that minimizing DH’s objective function is difficult due to the non-differentiable of sgn function. The authors work around this difficulty by assuming that sgn function is differentiable everywhere.

Contrary to DH, we propose a different model design. In particular, our model encourages the similarity preserving by having the reconstruction layer in the network. For the balancing property, they maximize $tr \left( {\mathbf H}^{(n)}({\mathbf H}^{(n)})^T \right) $. According to [20], maximizing this term is only an approximation in arriving the balancing property. In our objective function, the balancing property is directly enforced on the codes by the term $\vert \vert {{\mathbf H}^{(n-1)}{\mathbf 1}_{m\times 1}}\vert \vert ^2$. For the independent property, DH uses a relaxed orthogonality constraint $\vert \vert {{\mathbf W}^{(l)}({\mathbf W}^{(l)})^T - {\mathbf I}}\vert \vert ^2$, i.e., constraining on the network weights ${\mathbf W}$. On the contrary, we (once again) directly constrain on the codes using $\vert \vert {\frac{1}{m}{\mathbf H}^{(n-1)}({\mathbf H}^{(n-1)})^T-{\mathbf I}}\vert \vert ^2$. Incorporating the strict constraints can lead to better performance.

Comparison to Binary Autoencoder (BA) [22]: the differences between our model and BA are quite clear. BA as described in [22] is a shallow linear autoencoder network with one hidden layer. The BA’s hash function is a linear transformation of the input followed by the step function to obtain the binary codes. In BA, by treating the encoder layer as binary classifiers, they use binary SVMs to learn the weights of the linear transformation. On the contrary, our hash function is defined by multiple, hierarchical layers of nonlinear and linear transformations. It is not clear if the binary SVMs approach in BA can be used to learn the weights in our deep architecture with multiple layers. Instead, we use alternating optimization to derive a backpropagation algorithm to learn the weights in all layers. Another difference is that our model ensures the independence and balance of the binary codes while BA does not. Note that independence and balance properties may not be easily incorporated in their framework, as these would complicate their objective function and the optimization problem may become very difficult to solve.

2.2 Optimization

In order to solve (8) under constraint (9), we propose to use alternating optimization over $({\mathbf W},{\mathbf c})$ and ${\mathbf B}$.

$({\mathbf W},{\mathbf c})$ step. When fixing ${\mathbf B}$, the problem becomes unconstrained optimization. We use L-BFGS [24] optimizer with backpropagation for solving. The gradient of the objective function J (8) w.r.t. different parameters are computed as follows.

At $l = n-1$, we have

$$\begin{aligned} \frac{\partial J}{\partial {\mathbf W}^{(n-1)}} = \frac{-1}{m}({\mathbf X}-{\mathbf W}^{(n-1)}{\mathbf B}-{\mathbf c}^{(n-1)}{\mathbf 1}_{1\times m}){\mathbf B}^T + \lambda _1 {\mathbf W}^{(n-1)} \end{aligned}$$

(10)

$$\begin{aligned} \frac{\partial J}{\partial {\mathbf c}^{(n-1)}} = \frac{-1}{m}\left( ({\mathbf X}-{\mathbf W}^{(n-1)}{\mathbf B}){\mathbf 1}_{m\times 1}-m{\mathbf c}^{(n-1)} \right) \end{aligned}$$

(11)

For other layers, let us define

$$\begin{aligned} \varDelta ^{(n-1)}= & {} \left[ \frac{\lambda _2}{m}\left( {\mathbf H}^{(n-1)}-{\mathbf B}\right) +\frac{2\lambda _3}{m}\left( \frac{1}{m}{\mathbf H}^{(n-1)}({\mathbf H}^{(n-1)})^T - {\mathbf I}\right) {\mathbf H}^{(n-1)} \right. \nonumber \\ {}&\left. +\frac{\lambda _4}{m}\left( {\mathbf H}^{(n-1)}{\mathbf 1}_{m\times m} \right) \right] \odot f^{(n-1)'}({\mathbf Z}^{(n-1)}) \end{aligned}$$

(12)

$$\begin{aligned} \varDelta ^{(l)} = \left( ({\mathbf W}^{(l)})^T\varDelta ^{(l+1)} \right) \odot f^{(l)'}({\mathbf Z}^{(l)}),\forall l = n-2,\cdots ,2 \end{aligned}$$

(13)

where $\odot $ denotes Hadamard product; ${\mathbf Z}^{(l)} = {\mathbf W}^{(l-1)}{\mathbf H}^{(l-1)} + {\mathbf c}^{(l-1)}{\mathbf 1}_{1\times m}$, $l=2,\cdots ,n$.

Then, $\forall l = n-2,\cdots ,1$, we have

$$\begin{aligned} \frac{\partial J}{\partial {\mathbf W}^{(l)}} = \varDelta ^{(l+1)}({\mathbf H}^{(l)})^T +\lambda _1{\mathbf W}^{(l)} \end{aligned}$$

(14)

$$\begin{aligned} \frac{\partial J}{\partial {\mathbf c}^{(l)}} = \varDelta ^{(l+1)}{\mathbf 1}_{m\times 1} \end{aligned}$$

(15)

${\mathbf B}$ step. When fixing $({\mathbf W},{\mathbf c})$, we can rewrite problem (8) as

$$\begin{aligned} \min _{{\mathbf B}} J= & {} \vert \vert {{\mathbf X}-{\mathbf W}^{(n-1)}{\mathbf B}-{\mathbf c}^{(n-1)}{\mathbf 1}_{1\times m}}\vert \vert ^2 +\lambda _2 \vert \vert {{\mathbf H}^{(n-1)}-{\mathbf B}}\vert \vert ^2 \end{aligned}$$

(16)

$$\begin{aligned} \text {s.t. }{\mathbf B}\in \{-1,1\}^{L\times m} \end{aligned}$$

(17)

We adaptively use the recent method discrete cyclic coordinate descent [15] to iteratively solve ${\mathbf B}$, i.e., row by row. The advantage of this method is that if we fix $L-1$ rows of ${\mathbf B}$ and only solve for the remaining row, we can achieve a closed-form solution for that row.

Let ${\mathbf V}= {\mathbf X}-{\mathbf c}^{(n-1)}{\mathbf 1}_{1\times m}$; ${\mathbf Q}= ({\mathbf W}^{(n-1)})^T{\mathbf V}+\lambda _2{\mathbf H}^{(n-1)}$. For $k=1,\cdots L$, let ${\mathbf w}_k$ be $k^{th}$ column of ${\mathbf W}^{(n-1)}$; ${\mathbf W}_1$ be matrix ${\mathbf W}^{(n-1)}$ excluding ${\mathbf w}_k$; ${\mathbf q}_k$ be $k^{th}$ column of ${\mathbf Q}^T$; ${\mathbf b}_k^T$ be $k^{th}$ row of ${\mathbf B}$; ${\mathbf B}_1$ be matrix of ${\mathbf B}$ excluding ${\mathbf b}_k^T$. We have closed-form for ${\mathbf b}_k^T$ as

$$\begin{aligned} {\mathbf b}_k^T = sgn({\mathbf q}^T - {\mathbf w}_k^T{\mathbf W}_1{\mathbf B}_1) \end{aligned}$$

(18)

The proposed UH-BDNN method is summarized in Algorithm 1. In the Algorithm 1, ${\mathbf B}_{(t)}$ and $({\mathbf W},{\mathbf c})_{(t)}$ are values of ${\mathbf B}$ and $\{{\mathbf W}^{(l)},{\mathbf c}^{(l)}\}_{l=1}^{n-1}$ at iteration t.

3 Evaluation of Unsupervised Hashing with Binary Deep Neural Network (UH-BDNN)

This section evaluates the proposed UH-BDNN and compares it to the following state-of-the-art unsupervised hashing methods: Spectral Hashing (SH) [5], Iterative Quantization (ITQ) [6], Binary Autoencoder (BA) [22], Spherical Hashing (SPH) [8], K-means Hashing (KMH) [7]. For all compared methods, we use the implementations and the suggested parameters provided by the authors.

3.1 Dataset, Evaluation Protocol, and Implementation Note

Dataset. CIFAR10 [25] dataset consists of 60,000 images of 10 classes. The training set (also used as database for retrieval) contains 50,000 images. The query set contains 10,000 images. Each image is represented by a 800-dimensional feature vector extracted by PCA from 4096-dimensional CNN feature produced by AlexNet [26].

MNIST [27] dataset consists of 70,000 handwritten digit images of 10 classes. The training set (also used as database for retrieval) contains 60,000 images. The query set contains 10,000 images. Each image is represented by a 784 dimensional gray-scale feature vector by using its intensity.

SIFT1M [28] dataset contains 128 dimensional SIFT vectors [29]. There are M vectors used as database for retrieval; 100K vectors for training (separated from retrieval database) and 10 K vectors for query.

Evaluation protocol. We follow the standard setting in unsupervised hashing [6–8, 22] using Euclidean nearest neighbors as the ground truths for queries. Number of ground truths are set as in [22], i.e., for CIFAR10 and MNIST datasets, for each query, we use 50 its Euclidean nearest neighbors as ground truths; for large scale dataset SIFT1M, for each query, we use 10, 000 its Euclidean nearest neighbors as ground truths. We use the following evaluation metrics which have been used in state of the art [6, 19, 22] to measure the performance of methods. (1) mean Average Precision (mAP); (2) precision of Hamming radius 2 (precision@2) which measures precision on retrieved images having Hamming distance to query $\le 2$ (if no images satisfy, we report zero precision). Note that as computing mAP is slow on large dataset SIFT1M, we consider top 10, 000 returned neighbors when computing mAP.

Implementation note. In our deep model, we use $n=5$ layers. The parameters $\lambda _1$, $\lambda _2$, $\lambda _3$ and $\lambda _4$ are empirically set by cross validation as $10^{-5}$, $5\times 10^{-2}$, $10^{-2}$ and $10^{-6}$, respectively. The max iteration number T is empirically set to 10. The number of units in hidden layers 2, 3, 4 are empirically set as $[90 \rightarrow 20 \rightarrow 8]$, $[90 \rightarrow 30 \rightarrow 16]$, $[100 \rightarrow 40 \rightarrow 24]$ and $[120 \rightarrow 50 \rightarrow 32]$ for the 8, 16, 24 and 32 bits, respectively.

Table 2. Precision at Hamming distance $r=2$ comparison between UH-BDNN and state-of-the-art unsupervised hashing methods on CIFAR10, MNIST, and SIFT1M.

Full size table

3.2 Retrieval Results

Figure 2 and Table 2 show comparative mAP and precision of Hamming radius 2 (precision@2), respectively. We find the following observations are consistent for all three datasets. In term of mAP, the proposed UH-BDNN comparable or outperforms other methods at all code lengths. The improvement is more clear at high code length, i.e., $L=24,32$. The mAP of UH-BDNN consistently outperforms that of binary autoencoder (BA) [22], which is the current state-of-the-art unsupervised hashing method. In term of precision@2, UH-BDNN is comparable to other methods at low L, i.e., $L = 8, 16$. At $L = 24, 32$, UH-BDNN significantly outperforms other methods.

Comparison with Deep Hashing (DH): [19] As the implementation of DH is not available, we set up the experiments on CIFAR10 and MNIST similar to [19] to make a fair comparison. For each dataset, we randomly sample 1,000 images, 100 per class, as query set; the remaining images are used as training/database set. Follow [19], for CIFAR10, each image is represented by 512-D GIST descriptor [30]. The ground truths of queries are based on their class labels. Similar to [19], we report comparative results in term of mAP and the precision of Hamming radius $r=2$. The comparative results are presented in the Table 3. It is clearly showed in Table 3 that the proposed UH-BDNN outperforms DH [19] at all code lengths, in both mAP and precision of Hamming radius.

Table 3. Comparison with Deep Hashing (DH) [19]. The results of DH are cited from [19].

Full size table

4 Supervised Hashing with Binary Deep Neural Network (SH-BDNN)

In order to enhance the discriminative power of the binary codes, we extend UH-BDNN to supervised hashing by leveraging the label information. There are several approaches proposed to leverage the label information, leading to different criteria on binary codes. In [10, 31], binary codes are learned such that they minimize the Hamming distance among within-class samples, while maximizing the Hamming distance among between-class samples. In [15], the binary codes are learned such that they are optimal for linear classification.

In this work, in order to leverage the label information, we follow the approach proposed in Kernel-based Supervised Hashing (KSH) [11]. The benefit of this approach is that it directly encourages the Hamming distances between binary codes of within-class samples equal to 0, and the Hamming distances between binary codes of between-class samples equal to L. In the other words, it tries to perfectly preserve the semantic similarity. To achieve this goal, it enforces that the Hamming distance between learned binary codes has to highly correlate with the pre-computed pairwise label matrix.

In general, the network structure of SH-BDNN is similar to UH-BDNN, excepting that the last layer preserving reconstruction of UH-BDNN is removed. The layer $n-1$ in UH-BDNN becomes the last layer in SH-BDNN. All desirable properties, i.e. semantic similarity preserving, independence, and balance, in SH-BDNN are constrained on the outputs of its last layer.

4.1 Formulation of SH-BDNN

We define the pairwise label matrix ${\mathbf S}$ as

$$\begin{aligned} {\mathbf S}_{ij} = \left\{ \begin{array}{ll} 1 &{} \text {if} \, {\mathbf x}_i \, \text {and} \,\, {\mathbf x}_j \,\, \text {are same class}\\ -1 &{} \text {if} \, {\mathbf x}_i \, \text {and} \, \, {\mathbf x}_j \,\, \text {are not same class} \end{array} \right. \end{aligned}$$

(19)

To achieve the semantic similarity preserving property, we learn the binary codes such that the Hamming distance between learned binary codes highly correlates with the matrix ${\mathbf S}$, i.e., we want to minimize the quantity $\vert \vert {\frac{1}{L} ({\mathbf H}^{(n)})^T{\mathbf H}^{(n)} - {\mathbf S}}\vert \vert ^2$. In addition, to achieve the independence and balance properties of codes, we want to minimize the quantities $\vert \vert {\frac{1}{m}{\mathbf H}^{(n)}({\mathbf H}^{(n)})^T-{\mathbf I}}\vert \vert ^2$ and $\vert \vert {{\mathbf H}^{(n)}{\mathbf 1}_{m\times 1}}\vert \vert ^2$.

Follow the same reformulation and relaxation as UH-BDNN (Sect. 2.1), we solve the following constrained optimization which ensures the binary constraint, the semantic similarity preserving, the independence, and the balance properties of codes

$$\begin{aligned} \min _{{\mathbf W},{\mathbf c},{\mathbf B}} J= & {} \frac{1}{2m}\vert \vert {\frac{1}{L} ({\mathbf H}^{(n)})^T{\mathbf H}^{(n)} - {\mathbf S}}\vert \vert ^2 +\frac{\lambda _1}{2}\sum _{l=1}^{n-1} \vert \vert {{\mathbf W}^{(l)}}\vert \vert ^2 + \frac{\lambda _2}{2m}\vert \vert {{\mathbf H}^{(n)}-{\mathbf B}}\vert \vert ^2 \nonumber \\&+\frac{\lambda _3}{2}\vert \vert {\frac{1}{m}{\mathbf H}^{(n)}({\mathbf H}^{(n)})^T-{\mathbf I}}\vert \vert ^2 +\frac{\lambda _4}{2m}\vert \vert {{\mathbf H}^{(n)}{\mathbf 1}_{m\times 1}}\vert \vert ^2 \end{aligned}$$

(20)

$$\begin{aligned} \text {s.t. }{\mathbf B}\in \{-1,1\}^{L\times m} \end{aligned}$$

(21)

(20) under constraint (21) is our formulation for supervised hashing. The main difference in formulation between UH-BDNN (8) and SH-BDNN (20) is that the reconstruction term preserving the neighbor similarity in UH-BDNN (8) is replaced by the term preserving the label similarity in SH-BDNN (20).

4.2 Optimization

In order to solve (20) under constraint (21), we alternating optimize over $({\mathbf W},{\mathbf c})$ and ${\mathbf B}$.

$({\mathbf W},{\mathbf c})$ step. When fixing ${\mathbf B}$, (20) becomes unconstrained optimization. We used L-BFGS [24] optimizer with backpropagation for solving. The gradient of objective function J (20) w.r.t. different parameters are computed as follows.

Let us define

$$\begin{aligned} \varDelta ^{(n)}= & {} \left[ \frac{1}{mL}{\mathbf H}^{(n)}\left( {\mathbf V}+{\mathbf V}^T \right) +\frac{\lambda _2}{m}\left( {\mathbf H}^{(n)}-{\mathbf B}\right) +\frac{2\lambda _3}{m}\left( \frac{1}{m}{\mathbf H}^{(n)}({\mathbf H}^{(n)})^T - {\mathbf I}\right) {\mathbf H}^{(n)} \right. \nonumber \\&\left. +\frac{\lambda _4}{m}\left( {\mathbf H}^{(n)}{\mathbf 1}_{m\times m} \right) \right] \odot f^{(n)'}({\mathbf Z}^{(n)}) \end{aligned}$$

(22)

where ${\mathbf V}= \frac{1}{L}({\mathbf H}^{(n)})^T{\mathbf H}^{(n)} - {\mathbf S}$.

$$\begin{aligned} \varDelta ^{(l)} = \left( ({\mathbf W}^{(l)})^T\varDelta ^{(l+1)} \right) \odot f^{(l)'}({\mathbf Z}^{(l)}),\forall l = n-1,\cdots ,2 \end{aligned}$$

(23)

where $\odot $ denotes Hadamard product; ${\mathbf Z}^{(l)} = {\mathbf W}^{(l-1)}{\mathbf H}^{(l-1)} + {\mathbf c}^{(l-1)} {\mathbf 1}_{1\times m}$, $l=2,\cdots ,n$.

Then $\forall l = n-1,\cdots ,1$, we have

$$\begin{aligned} \frac{\partial J}{\partial {\mathbf W}^{(l)}} = \varDelta ^{(l+1)}({\mathbf H}^{(l)})^T +\lambda _1{\mathbf W}^{(l)} \end{aligned}$$

(24)

$$\begin{aligned} \frac{\partial J}{\partial {\mathbf c}^{(l)}} = \varDelta ^{(l+1)}{\mathbf 1}_{m\times 1} \end{aligned}$$

(25)

${\mathbf B}$ step When fixing $({\mathbf W},{\mathbf c})$, we can rewrite problem (20) as

$$\begin{aligned} \min _{{\mathbf B}} J =\vert \vert {{\mathbf H}^{(n)}-{\mathbf B}}\vert \vert ^2 \end{aligned}$$

(26)

$$\begin{aligned} \text {s.t.} \, {\mathbf B}\in \{-1,1\}^{L\times m} \end{aligned}$$

(27)

It is easy to see that the optimal solution for (26) under constraint (27) is ${\mathbf B}= sgn({\mathbf H}^{(n)})$.

The proposed SH-BDNN method is summarized in Algorithm 2. In the Algorithm, ${\mathbf B}_{(t)}$ and $({\mathbf W},{\mathbf c})_{(t)}$ are values of ${\mathbf B}$ and $\{{\mathbf W}^{(l)},{\mathbf c}^{(l)}\}_{l=1}^{n-1}$ at iteration t.

5 Evaluation of Supervised Hashing with Binary Deep Neural Network (SH-BDNN)

This section evaluates the proposed SH-BDNN and compares it to state-of-the-art supervised hashing methods: Supervised Discrete Hashing (SDH) [15], ITQ-CCA [6], Kernel-based Supervised Hashing (KSH) [11], Binary Reconstructive Embedding (BRE) [14]. For all compared methods, we use the implementation and the suggested parameters provided by the authors.

5.1 Dataset, Evaluation Protocol, and Implementation Note

Dataset We evaluate and compare methods on CIFAR-10 and MNIST datasets. The descriptions of these datasets are presented in Sect. 3.1.

Evaluation protocol. Follow the literature [6, 11, 15], we report the retrieval results in two metrics: (1) mean Average Precision (mAP) and (2) precision of Hamming radius 2 (precision@2).

Implementation note. The network configuration is same as UH-BDNN excepting the final layer is removed. The values of parameters $\lambda _1$, $\lambda _2$, $\lambda _3$ and $\lambda _4$ are empirically set using cross validation as $10^{-3}$, 5, 1 and $10^{-4}$, respectively. The max iteration number T is empirically set to 5.

Follow the settings in ITQ-CCA [6], SDH [15], all training samples are used in the learning for these two methods. For SH-BDNN, KSH [11] and BRE [14] where label information is leveraged by the pairwise label matrix, we randomly select 3, 000 training samples from each class and use them for learning. The ground truths of queries are defined by the class labels from the datasets.

Table 4. Precision at Hamming distance $r=2$ comparison between SH-BDNN and state-of-the-art supervised hashing methods on CIFAR10 and MNIST.

Full size table

5.2 Retrieval Results

On CIFAR10 dataset, Fig. 3(a) and Table 4 clearly show the proposed SH-BDNN outperforms all compared methods by a fair margin at all code lengths in both mAP and precision@2.

Table 5. Comparison between SH-BDNN and CNN-based hashing DSRH [32], DRSCH [33] on CIFAR10. The results of DSRH and DRSCH are cited from [33].

Full size table

On MNIST dataset, Fig. 3(b) and Table 4 show the proposed SH-BDNN significantly outperforms the current state-of-the-art SDH at low code length, i.e., $L=8$. When L increases, SH-BDNN and SDH [15] achieve similar performance. In comparison to remaining methods, i.e., KSH [11], ITQ-CCA [6], BRE [14], SH-BDNN outperforms these methods by a large margin in both mAP and precision@2.

Comparison with CNN-based hashing methods [32, 33]: We compare our proposed SH-BDNN to the recent CNN-based supervised hashing methods: Deep Semantic Ranking Hashing (DSRH) [32] and Deep Regularized Similarity Comparison Hashing (DRSCH) [33]. Note that the focus of [32, 33] are different from ours: in [32, 33], the authors focus on a framework in which the image features and hash codes are jointly learned by combining CNN layers (image feature extraction) and binary mapping layer into a single model. On the other hand, our work focuses on only the binary mapping layer given some image feature. In [32, 33], their binary mapping layer only applies a simple operation, i.e., an approximation of sgn function (i.e., logistic [32], tanh [33]), on CNN features for achieving the approximated binary codes. Our SH-BDNN advances [32, 33] in the way to map the image features to the binary codes (which is our main focus). Given the image features (i.e., pre-trained CNN features), we apply multiple transformations on these features; we constrain one layer to directly output the binary code, without involving sgn function. Furthermore, our learned codes ensure good properties, i.e. independence and balance, while DRSCH [33] does not consider such properties, and DSRH [32] only considers the balance of codes.

We follow strictly the comparison setting in [32, 33]. In [32, 33], when comparing their CNN-based hashing to other non CNN-based hashing methods, the authors use pre-trained CNN features (e.g. AlexNet [26], DeCAF [34]) as input for other methods. Follow that setting, we use AlexNet features [26] as input for SH-BDNN. We set up the experiments on CIFAR10 similar to [33], i.e., the query set contains 10 K images (1 K images per class) randomly sampled from the dataset; the rest 50 K image are used as the training set; in the testing step, each query image is searched within the query set itself by applying the leave-one-out procedure.

The comparative results between the proposed SH-BDNN and DSRH [32], DRSCH [33], presented in Table 5, clearly show that at the same code length, the proposed SH-BDNN outperforms [32, 33] in both mAP and precision@2.

6 Conclusion

We propose UH-BDNN and SH-BDNN for unsupervised and supervised hashing. Our network designs constrain to directly produce binary codes at one layer. Our models ensure good properties for codes: similarity preserving, independence and balance. Solid experimental results on three benchmark datasets show that the proposed methods compare favorably with the state of the art.

Notes

1.
Alternatively, we can constrain the independence and balance on ${\mathbf B}$. This, however, makes the optimization very difficult.

References

Gionis, A., Indyk, P., Motwani, R.: Similarity search in high dimensions via hashing. In: VLDB (1999)
Google Scholar
Kulis, B., Grauman, K.: Kernelized locality-sensitive hashing for scalable image search. In: ICCV (2009)
Google Scholar
Raginsky, M., Lazebnik, S.: Locality-sensitive binary codes from shift-invariant kernels. In: NIPS (2009)
Google Scholar
Kulis, B., Jain, P., Grauman, K.: Fast similarity search for learned metrics. PAMI 31(2), 2143–2157 (2009)
Article Google Scholar
Weiss, Y., Torralba, A., Fergus, R.: Spectral hashing. In: NIPS (2008)
Google Scholar
Gong, Y., Lazebnik, S.: Iterative quantization: a procrustean approach to learning binary codes. In: CVPR (2011)
Google Scholar
He, K., Wen, F., Sun, J.: K-means hashing: an affinity-preserving quantization method for learning binary compact codes. In: CVPR (2013)
Google Scholar
Heo, J.P., Lee, Y., He, J., Chang, S.F. Yoon, S.E.: Spherical hashing. In: CVPR (2012)
Google Scholar
Kong, W., Li, W.J.: Isotropic hashing. In: NIPS (2012)
Google Scholar
Strecha, C., Bronstein, A.M., Bronstein, M.M., Fua, P.: LDAHash: improved matching with smaller descriptors. PAMI 34(1), 66–78 (2012)
Article Google Scholar
Liu, W., Wang, J., Ji, R., Jiang, Y.G., Chang, S.F.: Supervised hashing with kernels. In: CVPR (2012)
Google Scholar
Norouzi, M., Fleet, D.J., Salakhutdinov, R.: Hamming distance metric learning. In: NIPS (2012)
Google Scholar
Lin, G., Shen, C., Shi, Q., van den Hengel, A., Suter, D.: Fast supervised hashing with decision trees for high-dimensional data. In: CVPR (2014)
Google Scholar
Kulis, B., Darrell, T.: Learning to hash with binary reconstructive embeddings. In: NIPS (2009)
Google Scholar
Shen, F., Shen, C., Liu, W., Tao Shen, H.: Supervised discrete hashing. In: CVPR (2015)
Google Scholar
Wang, J., Liu, W., Kumar, S., Chang, S.: Learning to hash for indexing big data - a survey. CoRR (2015)
Google Scholar
Wang, J., Shen, H.T., Song, J., Ji, J.: Hashing for similarity search: a survey. CoRR (2014)
Google Scholar
Grauman, K., Fergus, R.: Learning binary hash codes for large-scale image search. In: Cipolla, R., Battiato, S., Farinella, G.M. (eds.) Machine Learning for Computer Vision. SCI, vol. 411, pp. 55–93. Springer, Heidelberg (2013)
Chapter Google Scholar
Erin Liong, V., Lu, J., Wang, G., Moulin, P., Zhou, J.: Deep hashing for compact binary codes learning. In: CVPR (2015)
Google Scholar
Wang, J., Kumar, S., Chang, S.: Semi-supervised hashing for large-scale search. PAMI 34(12), 2393–2406 (2012)
Article Google Scholar
Salakhutdinov, R., Hinton, G.E.: Semantic hashing. Int. J. Approximate Reasoning 50(7), 969–978 (2009)
Article Google Scholar
Carreira-Perpinan, M.A., Raziperchikolaei, R.: Hashing with binary autoencoders. In: CVPR (2015)
Google Scholar
Nocedal, J., Wright, S.J.: Numerical Optimization, 2nd edn. World Scientific, New York (2006). Chap. 17
MATH Google Scholar
Liu, D.C., Nocedal, J.: On the limited memory BFGS method for large scale optimization. Math. Program. 45, 503–528 (1989)
Article MathSciNet MATH Google Scholar
Krizhevsky, A.: Learning multiple layers of features from tiny images. Technical report, University of Toronto (2009)
Google Scholar
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., Darrell, T.: Caffe: Convolutional architecture for fast feature embedding (2014). arXiv preprint: arXiv:1408.5093
Lecun, Y., Cortes, C.: The MNIST database of handwritten digits. http://yann.lecun.com/exdb/mnist/
Jégou, H., Douze, M., Schmid, C.: Product quantization for nearest neighbor search. PAMI 33(1), 117–128 (2011)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV 60(2), 91–110 (2004)
Article Google Scholar
Oliva, A., Torralba, A.: Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3), 145–175 (2001)
Article MATH Google Scholar
Nguyen, V.A., Lu, J., Do, M.N.: Supervised discriminative hashing for compact binary codes. In: ACM MM (2014)
Google Scholar
Zhao, F., Huang, Y., Wang, L., Tan, T.: Deep semantic ranking based hashing for multi-label image retrieval. In: CVPR (2015)
Google Scholar
Zhang, R., Lin, L., Zhang, R., Zuo, W., Zhang, L.: Bit-scalable deep hashing with regularized similarity learning for image retrieval and person re-identification. IEEE Trans. Image Process. 24(12), 4766–4779 (2015)
Article MathSciNet Google Scholar
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: DeCAF: a deep convolutional activation feature for generic visual recognition. In: ICML (2014)
Google Scholar

Download references

Author information

Authors and Affiliations

Singapore University of Technology and Design, Singapore, Singapore
Thanh-Toan Do, Anh-Dzung Doan & Ngai-Man Cheung

Authors

Thanh-Toan Do
View author publications
You can also search for this author in PubMed Google Scholar
Anh-Dzung Doan
View author publications
You can also search for this author in PubMed Google Scholar
Ngai-Man Cheung
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thanh-Toan Do .

Editor information

Editors and Affiliations

RWTH Aachen , Aachen, Germany
Bastian Leibe
Czech Technical University , Prague 2, Czech Republic
Jiri Matas
University of Trento , Povo - Trento, Italy
Nicu Sebe
University of Amsterdam , Amsterdam, The Netherlands
Max Welling

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Do, TT., Doan, AD., Cheung, NM. (2016). Learning to Hash with Binary Deep Neural Network. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds) Computer Vision – ECCV 2016. ECCV 2016. Lecture Notes in Computer Science(), vol 9909. Springer, Cham. https://doi.org/10.1007/978-3-319-46454-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-319-46454-1_14
Published: 16 September 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-46453-4
Online ISBN: 978-3-319-46454-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Learning to Hash with Binary Deep Neural Network

Abstract

Similar content being viewed by others

Two-Stage Unsupervised Deep Hashing for Image Retrieval

Deep Supervised Hashing with Information Loss

Unsupervised Binary Representation Learning with Deep Variational Networks

Keywords