1 Introduction

Recent years have witnessed an exponential growth of multimedia data on the Web. We are interested in the compact encoding of local descriptors (e.g. SIFT [8]) of images/videos to design a super vector representation, and thereby address the challenge of efficient indexing and retrieval of similar images in large image databases. Given a query image, the goal is to retrieve candidate images depicting the same object (semantic concept) or scene from the database. The representation must be discriminative, sufficiently invariant to transformations (geometric, viewpoints, illuminations, and occlusion). Three important constraints have to be considered jointly: accuracy (quality), efficiency (speed), and memory usage (footprint) of the representation [5].

Fig. 1.
figure 1

(a) Two samples n1 and n2 (assigned to centroids \(\mu _1\) and \(\mu _2\)) with dissimilar descriptor distributions but similar aggregated vectors (LDVs) by VLAD. (b) Discrimination is enhanced when distributional shape is encoded.

Recent works on Fisher Vector (FV) [11] and Vector of Locally Aggregated Descriptors (VLAD) [4, 5] are considered significant contributions towards image retrieval and classification. VLAD is known to outperform Bag-of-Words (BoW) [12] and the more sophisticated FV in terms of computational cost and accuracy [5, 7]. One of VLAD’s issues is that it ignores high-order information of the distribution of descriptors [4, 5]. As illustrated in Fig. 1a, two different descriptor distributions may have similar aggregated vectors obtained by original VLAD; but the distribution of the sets of descriptors are dissimilar when observed by fourth-order statistic.

All descriptors do not contribute equally to the residual and an outlier can far outweigh the contribution of many inliers close to the centroid [3] (Fig. 1a). Lower-order statistics (e.g. residuals in VLAD) are not sufficient to capture the nature of the descriptor distribution. High-order statistics (\(2^{nd}\) and \(3^{rd}\) order) have been used with VLAD to solve this problem in object categorization and action recognition [10]. We introduce HO-VLAD, a novel extension of VLAD which employs high-order information for scalable image retrieval. HO-VLAD captures the nature of typical non-Gaussian descriptor distributions and also takes into account the unequal contribution and effect of outliers.

Our contributions are two-fold:

  • We present a novel high-order VLAD (HO-VLAD) that leverages on fourth-order statistics for increased discriminative power.

  • We propose a light-weight framework for scalable image retrieval. The experiments and results demonstrate the proposed method’s effectiveness.

The rest of the paper is as follows: Sect. 2 reviews the related works in brief. Section 3 introduces HO-VLAD encoding and the proposed framework. Section 4 reports the datasets, evaluation protocol, experiments and results. Lastly, Sect. 5 presents the conclusion and future works.

2 Related Works

This section reviews the existing state-of-the-art feature encoding methods namely, BoW, FV, and VLAD.

Bag-of-Words (BoW) [12] is the dominant model for image/video representation. It requires a pre-trained codebook \(C = \{\mu _1,...,\mu _k\}\) of k visual words. Typically a k-means quantizer [12] maps high-dimensional local feature descriptors x, (e.g. SIFT [8]), to the nearest centroid:

$$\begin{aligned} \begin{aligned} NN(x): x \mapsto q(x) = \arg \min _{\mu _i \in C} ||x - \mu _i|| \\ \end{aligned} \end{aligned}$$
(1)

Here, NN(x) denotes the nearest -neighborhood mapping function. The BoW is the frequency histogram (i.e. count) of visual words, which is a zeroth-order statistic of the features.

Fisher Vector (FV) [11] extends the BoW by encoding second-order statistics (i.e. mean, variances) of the local descriptor distribution. Given a set of n local descriptors \( \mathcal {X} = \{x_1, x_2,...x_n\} \), \(x_j\) \(\in \mathbbm {R^{\textit{d}}} \), the distribution is modeled as a Gaussian Mixture Model (GMM) (using Expectation-Maximization (EM)) \( \theta = \{(\mu _k, \sum _k, \pi _k), i=1,2, ... k\} \) fitting the distribution of descriptors, where \( (\mu _k, \sum _k, \pi _k) \) are the mean, covariance, and prior of the k-th component. The GMM ‘soft’ assigns each descriptor \(x_i\) to a mode k in the mixture with a posterior probability:

$$\begin{aligned} q_{ik} = \frac{\exp [\frac{-1}{2}(x_i - \mu _k)^T \sum _k^{-1}(x_i - \mu _k)]}{\sum _{t=1}^{k} \exp [\frac{-1}{2}(x_i - \mu _t)^T \sum _k^{-1}(x_i - \mu _t)]} \end{aligned}$$
(2)

For each mode k, the mean and covariance deviation vectors are computed:

$$\begin{aligned} u_{jk} = \frac{1}{N \sqrt{\pi _{k}}} \sum _{i=1}^N q_{ik} \frac{x_{ji} - \mu _{jk}}{\sigma _{jk}}; v_{jk} = \frac{1}{N \sqrt{2\pi _{k}}} \sum _{i=1}^N q_{ik} \bigg [{\Big (\frac{x_{ji} - \mu _{jk}}{\sigma _{jk}}\Big )}^2 - 1\bigg ] \end{aligned}$$
(3)

Finally, the u and v vectors are stacked together to construct the high-dimensional (2dk) FV representation.

$$\begin{aligned} v=[u_1^T, u_2^T, ...u_k^T, v_1^T, v_2^T, ... v_k^T]^T, \qquad u_{i}, v_{i} \in \mathbbm {R^{{d}}} \end{aligned}$$
(4)

Vector of Locally Aggregated Descriptors (VLAD) [4, 5], aggregates a set \( \mathcal {X} = \{x_1, x_2,...x_n\} \), \(x_j\) \(\in \mathbbm {R^{\textit{d}}} \), of n local d-dimensional descriptors into a fixed-size, compact vector representation v. A codebook \( \mathcal {C}= \{\mu _1, \mu _2, ..., \mu _k\} \), \(\mu _i\) \( \in \mathbbm {R^{\textit{d}}} \) is obtained by applying k-means on the set of local descriptors of training samples. Each descriptor \(x_j\) \( \in \mathbbm {R^{\textit{d}}} \) is mapped to its nearest centroid in the codebook:

$$\begin{aligned} NN(x): x \mapsto q(x) = \arg \min _{\mu _i \in \mathcal {C}} \Vert {x_j - \mu _i}\Vert ^2 \end{aligned}$$
(5)

Typically \(\Vert {.}\Vert ^2\) (\(L_{2}\) norm) is used to solve the minimization problem. VLAD encodes the first-order statistic, i.e. residual - the vector difference \( (x_j - \mu _i) \), between a descriptor \(x_j\) and a centroid \( \mu _i \). The residuals are aggregated in a d-dimensional sub-vector \(v_i\), called Local Difference Vector (LDV).

$$\begin{aligned} v_i = \sum _{x_j:NN (x_j) = \mu _i} ({x_j - \mu _i}) \end{aligned}$$
(6)

The final VLAD encoding v for the query set \( \mathcal {X} \) is obtained by concatenating all sub-vectors \(v_i\), \(i = 1, ..., k\) (i.e. k LDVs) forming a \( D (= k \times d) \) dimensional image signature (unnormalized VLAD).

$$\begin{aligned} v=[v_1^T, v_2^T, ..., v_k^T]^T, \qquad v_{i} \in \mathbbm {R^{{d}}} \end{aligned}$$
(7)

As final steps, VLAD vector v is first Power-normalized, then L2-normalized.

3 Encoding High-Order Statistics in VLAD

We introduce an effective method to augment original VLAD with high-order statistical information. As shown in Fig. 1, VLAD’s discriminative power suffers because (a) it ignores high-order statistics, and (b) due to the effect of outliers. VLAD residuals describe the distribution but they are not sufficient to capture the nature of the distribution. This is because low-level descriptors (e.g. SIFT [8]) are not typically Gaussian in real life [10]. Consequently, we propose a high-order VLAD (HO-VLAD) which encodes fourth-order statistics i.e kurtosis of a distribution, to exploit complementary information.

Kurtosis represents the ‘peakedness’ or convexity of a probability distribution [2]. It is a measure of the outliers of a distribution. High kurtosis (\(\textit{leptokurtic},\,{>}3\)) indicates the data is heavy-tailed and there is profusion of outliers. Low kurtosis (\(\textit{platykurtic},\,{<}3\)) indicates lack of outliers. We design the fourth-order super vector as:

$$\begin{aligned} v_{i}^k = \frac{{\frac{1}{N}\sum \limits _{j=1}^N ({x_j - \mu _i})}^4}{{\frac{1}{N}\sum \limits _{j=1}^N ({x_j - \mu _i})}^4} \end{aligned}$$
(8)

where, \(v_{i}^k\) indicates the kurtosis of the i-th cluster with centroid \(\mu _i\). After intra-normalization [1] separately, the residual v and kurtosis \(v^k\) vectors are concatenated to produce a (kd + k)-dimensional vector, i.e. the final HO-VLAD representation (d = 128 for SIFT). We consider intra-normalization due to its good performance [1, 10]. Our method avoids soft weight computation and accommodates higher-order statistics in comparison with original VLAD.

Fig. 2.
figure 2

The scalable image retrieval framework/pipeline for HO-VLAD computation and retrieval performance evaluation

HO-VLAD Algorithm. In keeping with the spirit of original VLAD, we incorporate higher-order information and formulate the HO-VLAD computation algorithm as stated below (Algorithm 1).

Retrieval Framework. The light-weight retrieval framework (Fig. 2) consists of four main components: (a) Local feature (SIFT [8]) extraction (b) Codebook generation (independent Flickr60K datasetFootnote 1) (c) Proposed HO-VLAD encoding, normalization and dimensionality reduction (PCA) (d) Indexing and nearest neighbor search with k-d trees to produce ranked retrieval results.

figure a

4 Experiments and Evaluations

We verify the effectiveness of our method based on experiments on benchmark datasets, and compare with state-of-the-art feature encoding methods.

4.1 Benchmark Datasets and Descriptors

For local feature extraction, we have employed the experimental setup similar to [5] and the feature extraction libraryFootnote 2. More specifically, the regions of interest are extracted utilizing Hessian affine-invariant region detectors and described by the SIFT descriptor [8]. An independent dataset Flickr60k (67714 images, 140 M descriptors) was employed to train the codebook off-line for both Holidays and the UKB dataset. The evaluation is performed on two standard and publicly available image retrieval benchmarks:

INRIA Holidays: (See footnote 1) (1491 images, 4.456 M descriptors, 500 queries) [6] consists of personal holiday photos of 500 groups each representing a distinct scene. Mean Average Precision (mAP) is employed to evaluate retrieval accuracy, with the query removed (leave-one-out fashion) from the ranking list.

University of Kentucky Benchmark (UKB):Footnote 3 [9] is a collection of 10,200 images corresponding to 2,550 distinct classes and scenes of diverse categories. Every class is composed of 4 images; a query image and three groundtruth images. We use N-S score (ranging from 0–4) for retrieval accuracy.

4.2 Performance Evaluations and Analysis

Comparison with State-of-the-Art: Table 1 compares the results of our approach with the results from literature, in particular retrieval accuracies of BoW, FV, and VLAD, on the INRIA Holidays and UKB datasets. As can be seen from this table, an mAP of 0.611 is achieved on Holidays and a N-S score of 3.32 is obtained on UKBench with regular SIFT. The improvement provided by HO-VLAD over original VLAD is +4.6% on Holidays and +0.14 N-S score on UKB. For the sake of consistency, k = 64 is used in all experiments. The same SIFT descriptors as in [5] ensure a fair comparison. In our experiments, we found \(\alpha =0.5\) remains a good choice for normalization parameter and gives optimal results. The scheme avoids multiple vocabularies and soft assignments.

Table 1. Comparison of proposed image representation HO-VLAD with state-of-the-art (mAP performance and N-S score).

Memory Footprint: Using floating point numbers for each element, each number requires 4 bytes of memory. For a VLAD vector describing an image I, the total memory usage with k = 64 and 128-D SIFT descriptors is 64 * 128 * 4 = 32,768 Bytes or 32 KB. HO-VLAD vector for the same parameters requires ((64 * 128) + 64) * 4 = 33,024 Bytes or 33 KB to describe the same image. We believe with this limited computation cost, a significant precision is obtained.

Table 2. Comparative performance of FV, VLAD, and HO-VLAD on Holidays benchmark. Performance is given for full-dimensional D and PCA-reduced D\(^\prime \) descriptor.

Dimension Reduction: Table 2 shows the relative improvement of HO-VLAD is comparatively reduced when dimension reduction with PCA is applied. The gain shrinks significantly and one can observe that the dimension reduction reduces the gap between the different methods. For a finer vocabulary (k = 256), HO-VLAD attains 71.2% (Fig. 3a). It can be seen that even at low dimensions (D\(^\prime \) = 16), FV with k = 256 maintains a competitive accuracy (50.6%)(Fig. 3b).

Fig. 3.
figure 3

Impact of (a) vocabulary size k (left) (b) dimensionality reduction (right) on accuracy (mAP on Holidays). Parameters: \(\alpha = 0.5\).

Figure 3a shows the mAP score on Holidays as a function of k for full-sized VLAD descriptor. It is evident that with more number of centroids, the retrieval performance improves. For k = 2048, we achieve a mAP of 74.3%. However, for larger values of k, the cost of centroid assignment increases, hence we limit our analysis to k = 2048.

5 Conclusion

In this paper, we present HO-VLAD, a novel extension of the popular VLAD for scalable image retrieval. The proposed method encodes high-order statistics for greater discriminative encoding and considers effect of outliers. The tests on the image retrieval framework with a small-size codebook shows promising results on the benchmark INRIA Holidays and UKB datasets. Future work will examine comparisons with more recent encoding techniques, retrieval performance in presence of distractor images, and generating compact binary codes.