Skip to main content
Log in

A Comprehensive Study on VLAD

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Recently, the vector of locally aggregated descriptor (VLAD) has shown its great effectiveness in diverse computer vision tasks including image retrieval, Scene classification, and action recognition. Its great success stems from its powerful representation ability and computational efficiency. However, it remains unclear about its theoretical foundation and how it is connected to basic while important algorithms, e.g., the bag-of-words model and match kernels, and how its performance is affected by parameter configurations, e.g., normalization and pooling, which are also widely used in state-of-the-art algorithms based on local features. In this paper, with an aim to achieve the full capacity of VLAD, we conduct a comprehensive and in-depth study from both theoretical analysis and experimental practice perspectives. As a theoretical contribution, we provide a new formulation of VLAD via match kernels, which serves to connect VLAD with existing important encoding methods based on local features. As a contribution to the practical use of VLAD, we comprehensively investigate the roles and effects of the two widely-used operations in local feature encoding: normalization and pooling. To the best of our knowledge, our work provides the first comprehensive study on VLAD, which will not only enable a full understanding of it but also provide an important guidance for state-of-the-art algorithms based on local features. We have conducted extensive experiments on three benchmark datasets: Scene-15, Caltech 101 and PPMI for both image classification and action recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

References

  1. Fekriershad S, Saberi M, Tajeripour F (2012) An innovative skin detection approach using color based image retrieval technique. Int J Multimed Appl 4(3):57–65

    Google Scholar 

  2. Yan S, Xu X, Xu D, Lin S, Li X (2015) Image classification with densely sampled image windows and generalized adaptive multiple kernel learning. IEEE Trans Cybern 45(3):381–390

    Article  Google Scholar 

  3. Yu J, Rui Y, Tang Y, Tao D (2014) High-order distance-based multiview stochastic learning in image classification. IEEE Trans Cybern 44(12):2431–2442

    Article  Google Scholar 

  4. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int Conf Comput Vision 60(2):91–110

    Article  Google Scholar 

  5. Sivic J, Zisserman A (2009) Efficient visual search of videos cast as text retrieval. IEEE Trans Pattern Anal Mach Intell 31(4):591–606

    Article  Google Scholar 

  6. Tang J, Shao L, Li X, Lu K (2016) A local structural descriptor for image matching via normalized graph laplacian embedding. IEEE Trans Cybern 46(2):410–420

    Article  Google Scholar 

  7. Boureau Y.-L, Bach F, LeCun Y, Ponce J (2010) Learning mid-level features for recognition, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 2559–2566

  8. Jégou H, Douze M, Schmid C, Pérez P (2010) Aggregating local descriptors into a compact image representation, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 3304–3311

  9. Jegou H, Perronnin F, Douze M, Sanchez J (2012) Aggregating local image descriptors into compact codes. IEEE Trans Pattern Anal Mach Intell 34(9):1704–1716

    Article  Google Scholar 

  10. Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features, In: European Conference on Computer Vision, Springer, pp. 392–407

  11. Cimpoi M, Maji S, Kokkinos I, Mohamed S, Vedaldi A (2014) Describing textures in the wild, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 3606–3613

  12. Kantorov V, Laptev I (2014) Efficient feature extraction, encoding and classification for action recognition, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 1–8

  13. Spyromitros-Xioufis E, Papadopoulos S, Kompatsiaris IY, Tsoumakas G, Vlahavas I (2014) A comprehensive study over vlad and product quantization in large-scale image retrieval. IEEE Trans Multimed 16(6):1713–1728

    Article  Google Scholar 

  14. Faraki M, Harandi M, Porikli F (2015) More about vlad: A leap from euclidean to riemannian manifolds, In: IEEE Conference on computer vision and pattern recognition, pp. 4951–4960

  15. Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large-scale image classification, In: European Conference on computer vision, pp. 143-156

  16. Husain SS, Bober M (2017) Improving large-scale image retrieval through robust aggregation of local descriptors. IEEE Trans Pattern Anal Mach Intell 99:1783–1796

    Article  Google Scholar 

  17. Delhumeau J, Gosselin P.-H, Jégou H, Pérez P (2013) Revisiting the vlad image representation, In: ACM international conference on multimedia, ACM, pp. 653–656

  18. Arandjelovic R, Zisserman A (2013) All about vlad, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 1578-1585

  19. Tolias G, Avrithis Y, Jégou H (2013) To aggregate or not to aggregate: Selective match kernels for image search, In: IEEE International Conference on computer vision, IEEE, pp. 1401–1408

  20. Jegou H, Douze M, Schmid C (2008) Hamming embedding and weak geometric consistency for large scale image search, In: European Conference on computer vision, Springer, pp. 304–317

  21. Jégou H, Douze M, Schmid C (2010) Improving bag-of-features for large scale image search. Int J Comput Vision 87(3):316–336

    Article  Google Scholar 

  22. Angelina Uy. Mikaela, Lee Gim Hee (2018) PointNetVLAD: Deep Point Cloud Based Retrieval for Large-Scale Place Recognition, In: IEEE Conference on computer vision and pattern recognition, pp. 4470-4479

  23. Qi C. R, Su H, Mo K, Guibas L. J (2017) Pointnet: Deep learning on point sets for 3d classification and segmentation, In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 652-660

  24. Arandjelovic R, Gronat P, Torii A, Pajdla T, Sivic J (2016) NetVLAD: CNN architecture for weakly supervised place recognition. In: IEEE Conference on computer vision and pattern recognition, IEEE, pp. 5297-5307

  25. Haussler D (1999) Convolution kernels on discrete structures, Technical report 7. University of California at Santa Cruz, Department of Computer Science, pp 95–174

    Google Scholar 

  26. Grauman K, Darrell T (2007) The pyramid match kernel: Efficient learning with sets of features. J Mach Learn Res 8:725–760

    MATH  Google Scholar 

  27. Bo L, Sminchisescu C (2009) Efficient match kernel between sets of features for visual recognition, In: Advances in neural information processing systems, pp. 135–143

  28. Murray N, Perronnin F (2014) Generalized max pooling, In: IEEE Conference on computer vision and pattern recognition, pp. 2473–2480

  29. Kondor R, Jebara T (2003) A kernel between sets of vectors, In: International conference on machine learning, pp. 361–368

  30. Grauman K, Darrell T (2005) The pyramid match kernel: Discriminative classification with sets of image features, In: IEEE International Conference on computer vision, pp. 1458–1465

  31. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories, In: IEEE Conference on computer vision and pattern recognition, pp. 2169–2178

  32. Van der Maaten L, Hinton G (2008) Visualizing data using t-SNE. J Mach Learn Res 9(11):1–48

    MATH  Google Scholar 

  33. Boureau Y.-L, Ponce J, LeCun Y (2010) A theoretical analysis of feature pooling in visual recognition, In: International Conference on machine learning, pp. 111–118

  34. Boureau Y, Roux N. L, Bach F, Ponce J, LeCun Y (2011) Ask the locals: multi-way local pooling for image recognition, In: International Conference on computer vision, IEEE, pp. 1–8

  35. Arandjelovic R, Zisserman A (2012) Three things everyone should know to improve object retrieval, In: IEEE Conference on computer vision and pattern recognition, pp. 1–8

  36. Douze M, Jégou H, Schmid C, Pérez P (2010) Compact video description for copy detection with precise temporal alignment, In: European Conference on computer vision, Springer, pp. 522–535

  37. Zhang X, Li Z, Zhang L, Ma W.-Y, Shum H.-Y (2009) Efficient indexing for large scale visual search, In: IEEE 12th International conference on computer vision, pp. 1103–1110

  38. Fei-Fei L, Fergus R, Perona P (2007) Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Comput Vision Imag Underst 106(1):59–70

    Article  Google Scholar 

  39. Yao B, Jiang X, Khosla A, Lin A. L, Guibas L, Fei-Fei L (2011) Human action recognition by learning bases of action attributes and parts, In: IEEE International Conference on computer vision (ICCV), pp. 1331–1338

  40. Fekriershad S, Tajeripour F (2017) Color texture classification based on proposed impulse-noise resistant color local binary patterns and significant points selection algorithm. Sens Rev 37(1):33–42

    Article  Google Scholar 

  41. Nowak E, Jurie F, Triggs B (2006) Sampling strategies for bag-of-features image classification, In: European conference on computer vision, Springer, pp. 490–503

  42. Chang C-C, Lin C-J (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol 2(3):27

    Article  Google Scholar 

  43. Wang J, Yang J, Yu K, Lv F, Huang T, Gong Y (2010) Locality-constrained linear coding for image classification, In: IEEE Conference on computer vision and pattern recognition, pp. 3360–3367

  44. Zuo Z, Wang G (2014) Learning discriminative hierarchical features for object recognition. Signal Process Lett 21(9):1159–1163

    Article  Google Scholar 

  45. Zhu F, Jiang Z, Shao L (2014) Submodular object recognition, In: IEEE Conference on computer vision and pattern recognition, pp. 2457–2464

  46. Long X, Lu H, Peng Y et al (2016) Image classification based on improved VLAD. Multimed Tools Appl 75(10):5533–5555

    Article  Google Scholar 

  47. Zhang L, Zhen X, Shao L (2014) Learning object-to-class kernels for scene classification. IEEE Trans Image Process 23(8):3241–3253

    Article  MathSciNet  Google Scholar 

  48. Wang P, Wang J, Zeng G, Xu W, Zha H, Li S (2013) Supervised kernel descriptors for visual recognition, In: IEEE Conference on computer vision and pattern recognition, pp. 2858–2865

  49. Bo L, Ren X, Fox D (2010) Kernel descriptors for visual recognition, In: Advances in neural information processing systems, pp. 244–252

  50. Li Q, Peng Q, Yan C (2017) Multiple VLAD encoding of CNNs for image classification. Comput Sci Eng 99:1–8

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Lei Zhang.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

This work was supported by the National Natural Science Foundation of China (61976060), Project of Educational Commission of Guangdong province of China (2018KCXTD019) and Natural Science Foundation of Guangdong Province of China (2021A1515011846).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Li, X., Zhang, L., Jian, Z. et al. A Comprehensive Study on VLAD. Neural Process Lett 53, 2129–2145 (2021). https://doi.org/10.1007/s11063-021-10502-0

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-021-10502-0

Keywords

Navigation