Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics

Peng, Xiaojiang; Wang, Limin; Qiao, Yu; Peng, Qiang

doi:10.1007/978-3-319-10578-9_43

Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics

Xiaojiang Peng^19,22,21,
Limin Wang^20,21,
Yu Qiao²¹ &
…
Qiang Peng¹⁹

Conference paper

18k Accesses
28 Citations

Part of the book series: Lecture Notes in Computer Science ((LNIP,volume 8691))

Abstract

Recent studies show that aggregating local descriptors into super vector yields effective representation for retrieval and classification tasks. A popular method along this line is vector of locally aggregated descriptors (VLAD), which aggregates the residuals between descriptors and visual words. However, original VLAD ignores high-order statistics of local descriptors and its dictionary may not be optimal for classification tasks. In this paper, we address these problems by utilizing high-order statistics of local descriptors and peforming supervised dictionary learning. The main contributions are twofold. Firstly, we propose a high-order VLAD (H-VLAD) for visual recognition, which leverages two kinds of high-order statistics in the VLAD-like framework, namely diagonal covariance and skewness. These high-order statistics provide complementary information for VLAD and allow for efficient computation. Secondly, to further boost the performance of H-VLAD, we design a supervised dictionary learning algorithm to discriminatively refine the dictionary, which can be also extended for other super vector based encoding methods. We examine the effectiveness of our methods in image-based object categorization and video-based action recognition. Extensive experiments on PASCAL VOC 2007, HMDB51, and UCF101 datasets exhibit that our method achieves the state-of-the-art performance on both tasks.

Download to read the full chapter text

Chapter PDF

References

Arandjelovic, R., Zisserman, A.: All about VLAD. In: CVPR (2013)
Google Scholar
Bengio, Y., Courville, A.C., Vincent, P.: Representation learning: A review and new perspectives. TPAMI 35(8) (2013)
Google Scholar
Boureau, Y.L., Bach, F., LeCun, Y., Ponce, J.: Learning mid-level features for recognition. In: CVPR (2010)
Google Scholar
Cai, Z., Wang, L., Peng, X., Qiao, Y.: Multi-view super vector for action recognition. In: CVPR (2014)
Google Scholar
Chatfield, K., Lempitsky, V.S., Vedaldi, A., Zisserman, A.: The devil is in the details: An evaluation of recent feature encoding methods. In: BMVC (2011)
Google Scholar
Delhumeau, J., Gosselin, P.H., Jégou, H., Pérez, P., et al.: Revisiting the vlad image representation. In: ACM MM (2013)
Google Scholar
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The PASCAL Visual Object Classes Challenge 2007 (VOC 2007) Results (2007)
Google Scholar
Gong, Y., Wang, L., Guo, R., Lazebnik, S.: Multi-scale orderless pooling of deep convolutional activation features. CoRR abs/1403.1840 (2014)
Google Scholar
Hogg, R.V., Craig, A.: Introduction to mathematical statistics (1994)
Google Scholar
Jaakkola, T., Haussler, D., et al.: Exploiting generative models in discriminative classifiers. In: NIPS (1999)
Google Scholar
Jain, M., Jégou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR (2013)
Google Scholar
Jégou, H., Perronnin, F., Douze, M., Schmid, C., et al.: Aggregating local image descriptors into compact codes. TPAMI (2012)
Google Scholar
Jia, Y., Darrell, T.: Heavy-tailed distances for gradient based image descriptors. In: NIPS (2011)
Google Scholar
Jiang, Y.G., Liu, J., Roshan Zamir, A., Laptev, I., Piccardi, M., Shah, M., Sukthankar, R.: THUMOS challenge: Action recognition with a large number of classes (2013), http://crcv.ucf.edu/ICCV13-Action-Workshop/
Kobayashi, T.: BoF meets HOG: Feature extraction based on histograms of oriented pdf gradients for image classification. In: CVPR (2013)
Google Scholar
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: NIPS (2012)
Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: HMDB: A large video database for human motion recognition. In: ICCV (2011)
Google Scholar
LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to document recognition. Proceedings of the IEEE 86(11) (1998)
Google Scholar
Liu, L., Wang, L., Liu, X.: In defense of soft-assignment coding. In: ICCV (2011)
Google Scholar
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. IJCV (2004)
Google Scholar
Mihir, J., Jegou, H., Bouthemy, P.: Better exploiting motion for better action recognition. In: CVPR (2013)
Google Scholar
Peng, X., Wang, L., Wang, X., Qiao, Y.: Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice. CoRR abs/1405.4506 (2014)
Google Scholar
Perronnin, F., Sánchez, J., Mensink, T.: Improving the fisher kernel for large-scale image classification. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part IV. LNCS, vol. 6314, pp. 143–156. Springer, Heidelberg (2010)
Chapter Google Scholar
Russakovsky, O., Lin, Y., Yu, K., Fei-Fei, L.: Object-centric spatial pooling for image classification. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012, Part II. LNCS, vol. 7573, pp. 1–15. Springer, Heidelberg (2012)
Chapter Google Scholar
Shi, F., Petriu, E., Laganiere, R.: Sampling strategies for real-time action recognition. In: CVPR (2013)
Google Scholar
Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. CoRR abs/1406.2199 (2014)
Google Scholar
Sivic, J., Zisserman, A.: Video google: A text retrieval approach to object matching in videos. In: ICCV (2003)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: UCF101: A dataset of 101 human actions classes from videos in the wild. ArXiv:1212.0402 (2012)
Google Scholar
Sydorov, V., Sakurada, M., Lampert, C.H.: Deep fisher kernels - end to end learning of the fisher kernel gmm parameters. In: CVPR (2014)
Google Scholar
Tariq, U., Yang, J., Huang, T.S.: Maximum margin gmm learning for facial expression recognition. In: FG Workshops (2013)
Google Scholar
Vedaldi, A., Fulkerson, B.: VLFeat: An open and portable library of computer vision algorithms (2008)
Google Scholar
Wang, H., Kläser, A., Schmid, C., Liu, C.L.: Dense trajectories and motion boundary descriptors for action recognition. IJCV (2013)
Google Scholar
Wang, H., Schmid, C., et al.: Action recognition with improved trajectories. In: ICCV (2013)
Google Scholar
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., Gong, Y.: Locality-constrained linear coding for image classification. In: CVPR (2010)
Google Scholar
Wang, L., Qiao, Y., Tang, X.: Motionlets: Mid-level 3D parts for human motion recognition. In: CVPR (2013)
Google Scholar
Wang, X., Wang, L., Qiao, Y.: A comparative study of encoding, pooling and normalization methods for action recognition. In: ACCV (2012)
Google Scholar
Wu, J., Zhang, Y., Lin, W.: Towards good practices for action video encoding. In: CVPR (2014)
Google Scholar
Wu, R., Yu, Y., Wang, W.: Scale: Supervised and cascaded laplacian eigenmaps for visual object recognition based on nearest neighbors. In: CVPR (2013)
Google Scholar
Yang, J., Yu, K., Gong, Y., Huang, T.: Linear spatial pyramid matching using sparse coding for image classification. In: CVPR (2009)
Google Scholar
Zhou, X., Yu, K., Zhang, T., Huang, T.S.: Image classification using super-vector coding of local image descriptors. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010, Part V. LNCS, vol. 6315, pp. 141–154. Springer, Heidelberg (2010)
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Southwest Jiaotong University, Chengdu, China
Xiaojiang Peng & Qiang Peng
Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong, China
Limin Wang
Shenzhen Key Lab of CVPR, Shenzhen Institutes of Advanced Technology, CAS, Shenzhen, China
Xiaojiang Peng, Limin Wang & Yu Qiao
Hengyang Normal University, Hengyang, China
Xiaojiang Peng

Authors

Xiaojiang Peng
View author publications
You can also search for this author in PubMed Google Scholar
Limin Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yu Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Qiang Peng
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, University of Toront, 6 King’s College Road, M5H 3S5, Toronto, ON, Canada
David Fleet
Faculty of Electrical Engineering, Department of Cybernetics, Czech Technical University in Prague, Technicka 2, 166 27, Prague 6, Czech Republic
Tomas Pajdla
Max-Planck-Institut für Informatik, Campus E1 4, 66123, Saarbrücken, Germany
Bernt Schiele
ESAT - PSI, iMinds, KU Leuven, Kasteelpark Arenberg 10, Bus 2441, 3001, Leuven, Belgium
Tinne Tuytelaars

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Peng, X., Wang, L., Qiao, Y., Peng, Q. (2014). Boosting VLAD with Supervised Dictionary Learning and High-Order Statistics. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds) Computer Vision – ECCV 2014. ECCV 2014. Lecture Notes in Computer Science, vol 8691. Springer, Cham. https://doi.org/10.1007/978-3-319-10578-9_43

Download citation

DOI: https://doi.org/10.1007/978-3-319-10578-9_43
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-10577-2
Online ISBN: 978-3-319-10578-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics