Abstract
Learning middle-level image representations is very important for the computer vision community, especially for scene classification tasks. Middle-level image representations currently available are not sparse enough to make training and testing times compatible with the increasing number of classes that users want to recognize. In this work, we propose a middle-level image representation based on the pattern that extremely shared among different classes to reduce both training and test time. The proposed learning algorithm first finds some class-specified patterns and then utilizes the lasso regularization to select the most discriminative patterns shared among different classes. The experimental results on some widely used scene classification benchmarks (15 Scenes, MIT-indoor 67, SUN 397) show that the fewest patterns are necessary to achieve very remarkable performance with reduced computation time.






Similar content being viewed by others
Notes
The “words”, “parts” and “patterns” are interchangeable and this paper chooses “patterns” to represent them.
15 Scenes: http://www-cvr.ai.uiuc.edu/ponce_grp/data/scene_categories/. MIT-indoor 67: http://web.mit.edu/torralba/www/indoor.html. SUN 397: http://vision.princeton.edu/projects/2010/SUN/.
The implementation code and trained models are available at https://github.com/hust-tp/ESMIR.
References
Argyriou A, Evgeniou T, Pontil M (2006) Multi-task feature learning. In: Proceedings of neural information processing systems, pp 41–48
Bourdev L, Malik J (2009) Poselets: body part detectors trained using 3d human pose annotations. In: Proceedings of international conference on computer vision, pp 1365–1372
Chatfield K, Simonyan K, Vedaldi A, Zisserman A (2014) Return of the devil in the details: delving deep into convolutional nets. In: Proceedings of the British machine vision conference
Cimpoi M, Maji S, Vedaldi A (2015) Deep filter banks for texture recognition and segmentation. In: Proceedings of computer vision and pattern recognition, pp 3828–3836
Crammer K, Singer Y (2001) On the algorithmic implementation of multiclass kernel-based vector machines. J Mach Learn Res 2:265–292
Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of workshop on statistical learning in computer vision, European conference on computer vision, pp 1–22
Dalal N, Triggs B (2005) Histograms of oriented gradients for human detection. In: Proceedings of computer vision and pattern recognition, pp 886–893
Deng J, Dong W, Socher R, Li LJ, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: Proceedings of computer vision and pattern recognition, pp 248–255
Dixit M, Chen S, Gao D, Rasiwasia N, Vasconcelos N (2015) Scene classification with semantic fisher vectors. In: Proceedings of computer vision and pattern recognition, pp 2974–2983
Doersch C, Gupta A, Efros AA (2013) Mid-level visual element discovery as discriminative mode seeking. In: Proceedings of neural information processing systems, pp 494–502
Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, New York
Fan RE, Chang KW, Hsieh CJ, Wang XR, Lin CJ (2008) Liblinear: a library for large linear classification. J Mach Learn Res 9:1871–1874
Farhadi A, Endres I, Hoiem D, Forsyth D (2009) Describing objects by their attributes. In: Proceedings of computer vision and pattern recognition, pp 1778–1785
Gong Y, Wang L, Guo R, Lazebnik S (2014) Multi-scale orderless pooling of deep convolutional activation features. In: Proceedings of European conference on computer vision, pp 392–407
Hwang SJ, Sha F, Grauman K (2011) Sharing features between objects and their attributes. In: Proceedings of computer vision and pattern recognition, pp 1761–1768
Jia Y, Shelhamer E, Donahue J, Karayev S, Long J, Girshick R, Guadarrama S, Darrell T (2014) Caffe: convolutional architecture for fast feature embedding. In: Proceedings of the ACM international conference on multimedia, pp 675–678
Juneja M, Vedaldi A, Jawahar CV, Zisserman A (2013) Blocks that shout: Distinctive parts for scene classification. In: Proceedings of computer vision and pattern recognition, pp 923–930
Krizhevsky A, Sutskever I, Hinton GE (2012) Imagenet classification with deep convolutional neural networks. In: Proceedings of neural information processing systems, pp 1097–1105
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: Proceedings of computer vision and pattern recognition, pp 2169–2178
Li LJ, Su H, Fei-Fei L, Xing EP (2010) Object bank: a high-level image representation for scene classification & semantic feature sparsification. In: Proceedings of neural information processing systems, pp 1378–1386
Li Q, Wu J, Tu Z (2013) Harvesting mid-level visual concepts from large-scale internet images. In: Proceedings of computer vision and pattern recognition, pp 851–858
Li P, Lu X, Wang Q (2015a) From dictionary of visual words to subspaces: locality-constrained affine subspace coding. In: Proceedings of computer vision and pattern recognition, pp 2348–2357
Li Y, Liu L, Shen C, van den Hengel A (2015b) Mid-level deep pattern mining. In: Proceedings of computer vision and pattern recognition, pp 971–980
Liu L, Wang L, Liu X (2011) In defense of soft-assignment coding. In: Proceedings of international conference on computer vision, pp 2486–2493
Liu L, Shen C, Wang L, van den Hengel A, Wang C (2014) Encoding high dimensional local features by sparse coding based fisher vectors. In: Proceedings of neural information processing systems, pp 1143–1151
Liu L, Shen C, van den Hengel A (2015) The treasure beneath convolutional layers: cross-convolutional-layer pooling for image classification. In: Proceedings of computer vision and pattern recognition, pp 4749–4757
Lobel H, Vidal R, Soto A (2013) Hierarchical joint max-margin learning of mid and top level representations for visual recognition. In: Proceedings of international conference on computer vision, pp 1697–1704
Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110
Neumann B, Möller R (2008) On scene interpretation with description logics. Image Vis Comput 26(1):82–101
NG AY (2004) Feature selection, L1 vs. L2 regularization, and rotational invariance. In: Proceedings of international conference on machine learning
Oquab M, Bottou L, Laptev I, Sivic J (2014) Learning and transferring mid-level image representations using convolutional neural networks. In: Proceedings of computer vision and pattern recognition, pp 1717–1724
Ortega JM, Rheinboldt WC (1970) Iterative solution of nonlinear equations in several variables. Academic Press, New York
Ott P, Everingham M (2011) Shared parts for deformable part-based models. In: Proceedings of computer vision and pattern recognition, pp 1513–1520
Pandey M, Lazebnik S (2011) Scene recognition and weakly supervised object localization with deformable part-based models. In: Proceedings of international conference on computer vision, pp 1307–1314
Parameswaran S, Weinberger KQ (2010) Large margin multi-task metric learning. In: Proceedings of neural information processing systems, pp. 1867–1875
Parikh D, Grauman K (2011) Relative attributes. In: Proceedings of international conference on computer vision, pp 503–510
Parizi SN, Vedaldi A, Zisserman A, Felzenszwalb P (2015) Automatic discovery and optimization of parts for image classification. In: Proceedings of international conference on learning representations
Pechyony D, Vapnik V (2010) On the theory of learning with privileged information. In: Proceedings of neural information processing systems, pp 1894–1902
Peraldi SE, Kaya A, Melzer S, Möller R, Wessel M (2007) Multimedia interpretation as abduction. In: Proceedings of the dl-2007: international workshop on description logics
Quattoni A, Torralba A (2009) Recognizing indoor scenes. In: Proceedings of computer vision and pattern recognition, pp 413–420
Razavian AS, Azizpour H, Sullivan J, Carlsson S (2014) Cnn features off-the-shelf: an astounding baseline for recognition. In: Proceedings of computer vision and pattern recognition workshop, pp 512–519
Singh S, Gupta A, Efros A (2012) Unsupervised discovery of mid-level discriminative patches. In: Proceedings of European conference on computer vision, pp 73–86
Song X, Jiang S, Herranz L (2015) Joint multi-feature spatial context for scene recognition in the semantic manifold. In: Proceedings of computer vision and pattern recognition, pp 1312–1320
Sun J, Ponce J (2013) Learning discriminative part detectors for image classification and cosegmentation. In: Proceedings of international conference on computer vision, pp 3400–3407
Tibshirani R (1996) Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol) 58(1):267–288
Torralba A, Murphy KP, Freeman WT (2007) Sharing visual features for multiclass and multiview object detection. IEEE Trans Pattern Anal Mach Intell 29(5):854–869
VanGemert J, Veenman C, Smeulders A, Geusebroek J (2010) Visual word ambiguity. IEEE Trans Pattern Anal Mach Intell 32(7):1271–1283
Vedaldi A, Fulkerson B (2010) Vlfeat: an open and portable library of computer vision algorithms. In: Proceedings of Multimedia, pp 1469–1472
Wang G, Forsyth DA (2009) Joint learning of visual attributes, object classes and visual saliency. In: Proceedings of international conference on computer vision, pp 537–544
Wang X, Wang B, Bai X, Liu W, Tu Z (2013) Max-margin multiple-instance dictionary learning. In: Proceedings of the international conference on machine learning, pp 846–854
Xiao J, Hays J, Ehinger KA, Oliva A, Torralba A (2010) Sun database: large-scale scene recognition from abbey to zoo. In: Proceedings of computer vision and pattern recognition, pp 3485–3492
Yuille AL, Rangarajan A (2003) The concave–convex procedure. Neural Comput 15(4):915–936
Zhou B, Lapedriza A, Xiao J, Torralba A, Oliva A (2014) Learning deep features for scene recognition using places database. In: Proceedings of neural information processing systems, pp 487–495
Acknowledgements
We thank anonymous reviewers for their very useful comments and suggestions. This work was supported in part by the National Natural Science Foundation of China under Grant 61572207 and Grant 61503145, and the CAST Young Talent Supporting Program.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Tang, P., Zhang, J., Wang, X. et al. Learning extremely shared middle-level image representation for scene classification. Knowl Inf Syst 52, 509–530 (2017). https://doi.org/10.1007/s10115-016-1015-z
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10115-016-1015-z