Encoding High-Order Statistics in VLAD for Scalable Image Retrieval

Bhowmick, Alexy; Saharia, Sarat; Hazarika, Shyamanta M.

doi:10.1007/978-3-030-34869-4_61

Encoding High-Order Statistics in VLAD for Scalable Image Retrieval

Alexy Bhowmick^14,15,
Sarat Saharia¹⁵ &
Shyamanta M. Hazarika¹⁶

Conference paper
First Online: 25 November 2019

1388 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 11941))

Abstract

We revisit the implicit design choices in the popular vector of locally aggregated descriptors (VLAD), which aggregates the residuals of local image descriptors. Since original VLAD ignores high-order statistics the resultant vector is not discriminative enough. We address this issue by exploiting high-order statistics for gaining complementary information. Our contributions are two-fold: First, we present a novel high-order VLAD (HO-VLAD) with increased discriminative power. Next, we propose a light-weight retrieval framework to demonstrate HO-VLAD’s effectiveness for scalable image retrieval. Systematic experiments on two challenging public databases (INRIA Holidays, UKBench) exhibit a consistent improvement of performance with limited computational costs.

You have full access to this open access chapter, Download conference paper PDF

1 Introduction

Recent years have witnessed an exponential growth of multimedia data on the Web. We are interested in the compact encoding of local descriptors (e.g. SIFT [8]) of images/videos to design a super vector representation, and thereby address the challenge of efficient indexing and retrieval of similar images in large image databases. Given a query image, the goal is to retrieve candidate images depicting the same object (semantic concept) or scene from the database. The representation must be discriminative, sufficiently invariant to transformations (geometric, viewpoints, illuminations, and occlusion). Three important constraints have to be considered jointly: accuracy (quality), efficiency (speed), and memory usage (footprint) of the representation [5].

Recent works on Fisher Vector (FV) [11] and Vector of Locally Aggregated Descriptors (VLAD) [4, 5] are considered significant contributions towards image retrieval and classification. VLAD is known to outperform Bag-of-Words (BoW) [12] and the more sophisticated FV in terms of computational cost and accuracy [5, 7]. One of VLAD’s issues is that it ignores high-order information of the distribution of descriptors [4, 5]. As illustrated in Fig. 1a, two different descriptor distributions may have similar aggregated vectors obtained by original VLAD; but the distribution of the sets of descriptors are dissimilar when observed by fourth-order statistic.

All descriptors do not contribute equally to the residual and an outlier can far outweigh the contribution of many inliers close to the centroid [3] (Fig. 1a). Lower-order statistics (e.g. residuals in VLAD) are not sufficient to capture the nature of the descriptor distribution. High-order statistics ($2^{nd}$ and $3^{rd}$ order) have been used with VLAD to solve this problem in object categorization and action recognition [10]. We introduce HO-VLAD, a novel extension of VLAD which employs high-order information for scalable image retrieval. HO-VLAD captures the nature of typical non-Gaussian descriptor distributions and also takes into account the unequal contribution and effect of outliers.

Our contributions are two-fold:

We present a novel high-order VLAD (HO-VLAD) that leverages on fourth-order statistics for increased discriminative power.
We propose a light-weight framework for scalable image retrieval. The experiments and results demonstrate the proposed method’s effectiveness.

The rest of the paper is as follows: Sect. 2 reviews the related works in brief. Section 3 introduces HO-VLAD encoding and the proposed framework. Section 4 reports the datasets, evaluation protocol, experiments and results. Lastly, Sect. 5 presents the conclusion and future works.

2 Related Works

This section reviews the existing state-of-the-art feature encoding methods namely, BoW, FV, and VLAD.

Bag-of-Words (BoW) [12] is the dominant model for image/video representation. It requires a pre-trained codebook $C = \{\mu _1,...,\mu _k\}$ of k visual words. Typically a k-means quantizer [12] maps high-dimensional local feature descriptors x, (e.g. SIFT [8]), to the nearest centroid:

$$\begin{aligned} \begin{aligned} NN(x): x \mapsto q(x) = \arg \min _{\mu _i \in C} ||x - \mu _i|| \\ \end{aligned} \end{aligned}$$

(1)

Here, NN(x) denotes the nearest -neighborhood mapping function. The BoW is the frequency histogram (i.e. count) of visual words, which is a zeroth-order statistic of the features.

Fisher Vector (FV) [11] extends the BoW by encoding second-order statistics (i.e. mean, variances) of the local descriptor distribution. Given a set of n local descriptors $ \mathcal {X} = \{x_1, x_2,...x_n\} $, $x_j$ $\in \mathbbm {R^{\textit{d}}} $, the distribution is modeled as a Gaussian Mixture Model (GMM) (using Expectation-Maximization (EM)) $ \theta = \{(\mu _k, \sum _k, \pi _k), i=1,2, ... k\} $ fitting the distribution of descriptors, where $ (\mu _k, \sum _k, \pi _k) $ are the mean, covariance, and prior of the k-th component. The GMM ‘soft’ assigns each descriptor $x_i$ to a mode k in the mixture with a posterior probability:

$$\begin{aligned} q_{ik} = \frac{\exp [\frac{-1}{2}(x_i - \mu _k)^T \sum _k^{-1}(x_i - \mu _k)]}{\sum _{t=1}^{k} \exp [\frac{-1}{2}(x_i - \mu _t)^T \sum _k^{-1}(x_i - \mu _t)]} \end{aligned}$$

(2)

For each mode k, the mean and covariance deviation vectors are computed:

$$\begin{aligned} u_{jk} = \frac{1}{N \sqrt{\pi _{k}}} \sum _{i=1}^N q_{ik} \frac{x_{ji} - \mu _{jk}}{\sigma _{jk}}; v_{jk} = \frac{1}{N \sqrt{2\pi _{k}}} \sum _{i=1}^N q_{ik} \bigg [{\Big (\frac{x_{ji} - \mu _{jk}}{\sigma _{jk}}\Big )}^2 - 1\bigg ] \end{aligned}$$

(3)

Finally, the u and v vectors are stacked together to construct the high-dimensional (2dk) FV representation.

$$\begin{aligned} v=[u_1^T, u_2^T, ...u_k^T, v_1^T, v_2^T, ... v_k^T]^T, \qquad u_{i}, v_{i} \in \mathbbm {R^{{d}}} \end{aligned}$$

(4)

Vector of Locally Aggregated Descriptors (VLAD) [4, 5], aggregates a set $ \mathcal {X} = \{x_1, x_2,...x_n\} $, $x_j$ $\in \mathbbm {R^{\textit{d}}} $, of n local d-dimensional descriptors into a fixed-size, compact vector representation v. A codebook $ \mathcal {C}= \{\mu _1, \mu _2, ..., \mu _k\} $, $\mu _i$ $ \in \mathbbm {R^{\textit{d}}} $ is obtained by applying k-means on the set of local descriptors of training samples. Each descriptor $x_j$ $ \in \mathbbm {R^{\textit{d}}} $ is mapped to its nearest centroid in the codebook:

$$\begin{aligned} NN(x): x \mapsto q(x) = \arg \min _{\mu _i \in \mathcal {C}} \Vert {x_j - \mu _i}\Vert ^2 \end{aligned}$$

(5)

Typically $\Vert {.}\Vert ^2$ ($L_{2}$ norm) is used to solve the minimization problem. VLAD encodes the first-order statistic, i.e. residual - the vector difference $ (x_j - \mu _i) $, between a descriptor $x_j$ and a centroid $ \mu _i $. The residuals are aggregated in a d-dimensional sub-vector $v_i$, called Local Difference Vector (LDV).

$$\begin{aligned} v_i = \sum _{x_j:NN (x_j) = \mu _i} ({x_j - \mu _i}) \end{aligned}$$

(6)

The final VLAD encoding v for the query set $ \mathcal {X} $ is obtained by concatenating all sub-vectors $v_i$, $i = 1, ..., k$ (i.e. k LDVs) forming a $ D (= k \times d) $ dimensional image signature (unnormalized VLAD).

$$\begin{aligned} v=[v_1^T, v_2^T, ..., v_k^T]^T, \qquad v_{i} \in \mathbbm {R^{{d}}} \end{aligned}$$

(7)

As final steps, VLAD vector v is first Power-normalized, then L2-normalized.

3 Encoding High-Order Statistics in VLAD

We introduce an effective method to augment original VLAD with high-order statistical information. As shown in Fig. 1, VLAD’s discriminative power suffers because (a) it ignores high-order statistics, and (b) due to the effect of outliers. VLAD residuals describe the distribution but they are not sufficient to capture the nature of the distribution. This is because low-level descriptors (e.g. SIFT [8]) are not typically Gaussian in real life [10]. Consequently, we propose a high-order VLAD (HO-VLAD) which encodes fourth-order statistics i.e kurtosis of a distribution, to exploit complementary information.

Kurtosis represents the ‘peakedness’ or convexity of a probability distribution [2]. It is a measure of the outliers of a distribution. High kurtosis ($\textit{leptokurtic},\,{>}3$) indicates the data is heavy-tailed and there is profusion of outliers. Low kurtosis ($\textit{platykurtic},\,{<}3$) indicates lack of outliers. We design the fourth-order super vector as:

$$\begin{aligned} v_{i}^k = \frac{{\frac{1}{N}\sum \limits _{j=1}^N ({x_j - \mu _i})}^4}{{\frac{1}{N}\sum \limits _{j=1}^N ({x_j - \mu _i})}^4} \end{aligned}$$

(8)

where, $v_{i}^k$ indicates the kurtosis of the i-th cluster with centroid $\mu _i$. After intra-normalization [1] separately, the residual v and kurtosis $v^k$ vectors are concatenated to produce a (kd + k)-dimensional vector, i.e. the final HO-VLAD representation (d = 128 for SIFT). We consider intra-normalization due to its good performance [1, 10]. Our method avoids soft weight computation and accommodates higher-order statistics in comparison with original VLAD.

HO-VLAD Algorithm. In keeping with the spirit of original VLAD, we incorporate higher-order information and formulate the HO-VLAD computation algorithm as stated below (Algorithm 1).

Retrieval Framework. The light-weight retrieval framework (Fig. 2) consists of four main components: (a) Local feature (SIFT [8]) extraction (b) Codebook generation (independent Flickr60K dataset^{Footnote 1}) (c) Proposed HO-VLAD encoding, normalization and dimensionality reduction (PCA) (d) Indexing and nearest neighbor search with k-d trees to produce ranked retrieval results.

4 Experiments and Evaluations

We verify the effectiveness of our method based on experiments on benchmark datasets, and compare with state-of-the-art feature encoding methods.

4.1 Benchmark Datasets and Descriptors

For local feature extraction, we have employed the experimental setup similar to [5] and the feature extraction library^{Footnote 2}. More specifically, the regions of interest are extracted utilizing Hessian affine-invariant region detectors and described by the SIFT descriptor [8]. An independent dataset Flickr60k (67714 images, 140 M descriptors) was employed to train the codebook off-line for both Holidays and the UKB dataset. The evaluation is performed on two standard and publicly available image retrieval benchmarks:

INRIA Holidays: (See footnote 1) (1491 images, 4.456 M descriptors, 500 queries) [6] consists of personal holiday photos of 500 groups each representing a distinct scene. Mean Average Precision (mAP) is employed to evaluate retrieval accuracy, with the query removed (leave-one-out fashion) from the ranking list.

University of Kentucky Benchmark (UKB):^{Footnote 3} [9] is a collection of 10,200 images corresponding to 2,550 distinct classes and scenes of diverse categories. Every class is composed of 4 images; a query image and three groundtruth images. We use N-S score (ranging from 0–4) for retrieval accuracy.

4.2 Performance Evaluations and Analysis

Comparison with State-of-the-Art: Table 1 compares the results of our approach with the results from literature, in particular retrieval accuracies of BoW, FV, and VLAD, on the INRIA Holidays and UKB datasets. As can be seen from this table, an mAP of 0.611 is achieved on Holidays and a N-S score of 3.32 is obtained on UKBench with regular SIFT. The improvement provided by HO-VLAD over original VLAD is +4.6% on Holidays and +0.14 N-S score on UKB. For the sake of consistency, k = 64 is used in all experiments. The same SIFT descriptors as in [5] ensure a fair comparison. In our experiments, we found $\alpha =0.5$ remains a good choice for normalization parameter and gives optimal results. The scheme avoids multiple vocabularies and soft assignments.

Table 1. Comparison of proposed image representation HO-VLAD with state-of-the-art (mAP performance and N-S score).

Full size table

Memory Footprint: Using floating point numbers for each element, each number requires 4 bytes of memory. For a VLAD vector describing an image I, the total memory usage with k = 64 and 128-D SIFT descriptors is 64 * 128 * 4 = 32,768 Bytes or 32 KB. HO-VLAD vector for the same parameters requires ((64 * 128) + 64) * 4 = 33,024 Bytes or 33 KB to describe the same image. We believe with this limited computation cost, a significant precision is obtained.

Table 2. Comparative performance of FV, VLAD, and HO-VLAD on Holidays benchmark. Performance is given for full-dimensional D and PCA-reduced D$^\prime $ descriptor.

Full size table

Dimension Reduction: Table 2 shows the relative improvement of HO-VLAD is comparatively reduced when dimension reduction with PCA is applied. The gain shrinks significantly and one can observe that the dimension reduction reduces the gap between the different methods. For a finer vocabulary (k = 256), HO-VLAD attains 71.2% (Fig. 3a). It can be seen that even at low dimensions (D$^\prime $ = 16), FV with k = 256 maintains a competitive accuracy (50.6%)(Fig. 3b).

Figure 3a shows the mAP score on Holidays as a function of k for full-sized VLAD descriptor. It is evident that with more number of centroids, the retrieval performance improves. For k = 2048, we achieve a mAP of 74.3%. However, for larger values of k, the cost of centroid assignment increases, hence we limit our analysis to k = 2048.

5 Conclusion

In this paper, we present HO-VLAD, a novel extension of the popular VLAD for scalable image retrieval. The proposed method encodes high-order statistics for greater discriminative encoding and considers effect of outliers. The tests on the image retrieval framework with a small-size codebook shows promising results on the benchmark INRIA Holidays and UKB datasets. Future work will examine comparisons with more recent encoding techniques, retrieval performance in presence of distractor images, and generating compact binary codes.

Notes

References

Arandjelovic, R., Zisserman, A.: All about VLAD. In: 2013 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1578–1585 (2013)
Google Scholar
Balanda, K.P., MacGillivray, H.L.: Kurtosis: a critical review. Am. Stat. 42(2), 111–119 (1988)
Google Scholar
Husain, S.S., Bober, M.: Improving large-scale image retrieval through robust aggregation of local descriptors. IEEE Trans. Pattern Anal. Mach. Intell. 39(9), 1783–1796 (2016)
Article Google Scholar
Jegou, H., Douze, M., Schmid, C., Prez, P.: Aggregating local descriptors into a compact image representation. In: 2010 IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 3304–3311, June 2010
Google Scholar
Jegou, H., Perronnin, F., Douze, M., Snchez, J., Prez, P., Schmid, C.: Aggregating local image descriptors into compact codes. IEEE Trans. Pattern Anal. Mach. Intell. 34(9), 1704–1716 (2012)
Article Google Scholar
Jegou, H., Douze, M., Schmid, C.: Improving bag-of-features for large scale image search. Int. J. Comput. Vis. 87(3), 316–336 (2010)
Article Google Scholar
Liu, Z., Houqiang, L., Wengang, Z., Ting, R., Qi, T.: Making residual vector distribution uniform for distinctive image representation. IEEE Trans. Circ. Syst. Video Technol. 26(2), 375–384 (2016)
Article Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60(2), 91–110 (2004)
Article Google Scholar
Nister, D., Stewenius, H.: Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR 2006), vol. 2, pp. 2161–2168 (2006)
Google Scholar
Peng, X., Wang, L., Qiao, Y., Peng, Q.: Boosting VLAD with supervised dictionary learning and high-order statistics. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8691, pp. 660–674. Springer, Cham (2014). https://doi.org/10.1007/978-3-319-10578-9_43
Chapter Google Scholar
Perronnin, F., Dance, C.: Fisher kernels on visual vocabularies for image categorization. In: 2007 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, June 2007
Google Scholar
Sivic, J., Zisserman, A.: Video Google: a text retrieval approach to object matching in videos. In: Proceedings Ninth IEEE International Conference on Computer Vision, vol. 2, pp. 1470–1477, October 2003
Google Scholar

Download references

Author information

Authors and Affiliations

School of Technology, Assam Don Bosco University, Guwahati, India
Alexy Bhowmick
Department of Computer Science and Engineering, Tezpur University, Tezpur, India
Alexy Bhowmick & Sarat Saharia
Department of Mechanical Engineering, IIT Guwahati, Guwahati, India
Shyamanta M. Hazarika

Authors

Alexy Bhowmick
View author publications
You can also search for this author in PubMed Google Scholar
Sarat Saharia
View author publications
You can also search for this author in PubMed Google Scholar
Shyamanta M. Hazarika
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Alexy Bhowmick .

Editor information

Editors and Affiliations

Tezpur University, Tezpur, India
Bhabesh Deka
Indian Statistical Institute, Kolkata, India
Pradipta Maji
Indian Statistical Institute, Kolkata, India
Sushmita Mitra
Tezpur University, Tezpur, India
Dhruba Kumar Bhattacharyya
Indian Institute of Technology Guwahati, Guwahati, India
Prabin Kumar Bora
Indian Statistical Institute, Kolkata, India
Sankar Kumar Pal

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bhowmick, A., Saharia, S., Hazarika, S.M. (2019). Encoding High-Order Statistics in VLAD for Scalable Image Retrieval. In: Deka, B., Maji, P., Mitra, S., Bhattacharyya, D., Bora, P., Pal, S. (eds) Pattern Recognition and Machine Intelligence. PReMI 2019. Lecture Notes in Computer Science(), vol 11941. Springer, Cham. https://doi.org/10.1007/978-3-030-34869-4_61

Download citation

DOI: https://doi.org/10.1007/978-3-030-34869-4_61
Published: 25 November 2019
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-34868-7
Online ISBN: 978-3-030-34869-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Association for Pattern Recognition (opens in a new tab)