SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition

Xie, Guo-Sen; Zhang, Xu-Yao; Yan, Shuicheng; Liu, Cheng-Lin

doi:10.1007/s11263-017-1007-9

SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition

Published: 30 March 2017

Volume 124, pages 145–168, (2017)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Guo-Sen Xie ORCID: orcid.org/0000-0002-5995-1391^1,2,
Xu-Yao Zhang²,
Shuicheng Yan³ &
…
Cheng-Lin Liu^2,4

1447 Accesses
24 Citations
Explore all metrics

Abstract

Bag of Words (BoW) model and Convolutional Neural Network (CNN) are two milestones in visual recognition. Both BoW and CNN require a feature pooling operation for constructing the frameworks. Particularly, the max-pooling has been validated as an efficient and effective pooling method compared with other methods such as average pooling and stochastic pooling. In this paper, we first evaluate different pooling methods, and then propose a new feature pooling method termed as selective, discriminative and equalizing pooling (SDE). The SDE representation is a feature learning mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. We use bilevel optimization to solve the joint optimization problem. Experiments on seven benchmark datasets (including both single-label and multi-label ones) well validate the effectiveness of our framework. Particularly, we achieve the state-of-the-art fused results (mAP) of 93.21 and 93.97% on the PASCAL VOC2007 and VOC2012 datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SSD: Single Shot MultiBox Detector

CBAM: Convolutional Block Attention Module

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

Notes

http://host.robots.ox.ac.uk:/leaderboard/main_bootstrap.php.

References

Berg, T., & Belhumeur, P. N. (2013). Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 955–962).
Bo, L., Ren, X., & Fox, D. (2013). Multipath sparse coding using hierarchical matching pursuit. In CVPR.
Boureau, Y. L., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2559–2566).
Bradley, D. M., & Bagnell, J. A. (2008). Differential sparse coding. In Neural information processing systems.
Chai, Y., Lempitsky, V., & Zisserman, A. (2013). Symbiotic segmentation and part localization for fine-grained categorization. In 2013 IEEE international conference on computer vision (ICCV) (pp. 321–328).
Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. ArXiv preprint arXiv:1405.3531.
Chen, Q., Song, Z., Dong, J., Huang, Z., Hua, Y., & Yan, S. (2015). Contextualizing object detection and classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1), 13–27.
Article Google Scholar
Chen, Q., Song, Z., Hua, Y., Huang, Z., & Yan, S. (2012). Hierarchical matching with side information for image classification. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3426–3433).
Cimpoi, M., Maji, S., Kokkinos, I., & Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1), 65–94.
Article MathSciNet Google Scholar
Cimpoi, M., Maji, S., & Vedaldi, A. (2015) Deep filter banks for texture recognition and segmentation. In CVPR.
Colson, B., Marcotte, P., & Savard, G. (2007). An overview of bilevel optimization. Annals of Operations Research, 153(1), 235–256.
Article MathSciNet MATH Google Scholar
Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004) Visual categorization with bags of keypoints. In ECCV.
Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.
Doersch, C., Gupta, A., & Efros, A. A. (2013). Mid-level visual element discovery as discriminative mode seeking. In NIPS.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531.
Dong, J., Xia, W., Chen, Q., Feng, J., Huang, Z., & Yan, S. (2013). Subcategory-aware object classification. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 827–834).
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.
Article MathSciNet MATH Google Scholar
Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2014). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.
Article Google Scholar
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Article Google Scholar
Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. JMLR, 9, 1871–1874.
MATH Google Scholar
Fanello, S. R., Noceti, N., Ciliberto, C., Metta, G., & Odone, F. (2014). Ask the image: Supervised pooling to preserve feature locality. In CVPR.
Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU, 106(1), 59–70.
Google Scholar
Feng, J., Ni, B., Tian, Q., & Yan, S. (2011). Geometric \(\ell \)p-norm feature pooling for image classification. In CVPR.
Fernando, B., Fromont, E., & Tuytelaars, T. (2012). Effective use of frequent itemset mining for image classification. In Computer vision—ECCV 2012 (pp. 214–227).
Gao, S., Tsang, I. W. H., & Chia, L. T. (2013). Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 92–104.
Article Google Scholar
Gavves, E., Fernando, B., Snoek, C. G., Smeulders, A. W., & Tuytelaars, T. (2013). Fine-grained categorization by alignments. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1713–1720).
Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.
Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In ECCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. arXiv:1406.4729.
Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems (pp. 2017–2025).
Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3304–3311).
Jégou, H., & Zisserman, A. (2014). Triangulation embedding and democratic aggregation for image search. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3310–3317).
Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.
Jiang, Y., Yuan, J., & Yu, G. (2012). Randomized spatial partition for scene recognition. In Computer vision—ECCV 2012 (pp. 730–743).
Jiang, Z., Lin, Z., & Davis, L. S. (2013). Label consistent k-svd: Learning a discriminative dictionary for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2651–2664.
Article Google Scholar
Juneja, M., Vedaldi, A., Jawahar, C., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR (pp. 923–930).
Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Kwitt, R., Vasconcelos, N., & Rasiwasia, N. (2012). Scene recognition on the semantic manifold. In Computer vision—ECCV 2012 (pp. 359–372).
Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.
LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.
Article Google Scholar
Li, L. J., Su, H., Fei-Fei, L., & Xing, E. P. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Advances in neural information processing systems (pp. 1378–1386).
Li, Q., Wu, J., & Tu, Z. (2013). Harvesting mid-level visual concepts from large-scale internet images. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 851–858).
Lin, D., Lu, C., Liao, R., & Jia, J. (2014). Learning important spatial pooling regions for scene classification. In CVPR.
Lin, T. Y., RoyChowdhury, A., & Maji, S. (2015). Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1449–1457).
Liu, L., Wang, L., & Liu, X. (2011). In defense of soft-assignment coding. In ICCV.
Long, J., Shelhamer, E., & Darrell, T. (2014). Fully convolutional networks for semantic segmentation. ArXiv preprint arXiv:1411.4038.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.
Article Google Scholar
Mairal, J., Bach, F., & Ponce, J. (2012). Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 791–804.
Article Google Scholar
Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19–60.
MathSciNet MATH Google Scholar
Murray, N., & Perronnin, F. (2014). Generalized max pooling. In CVPR.
Nie, F., Huang, H., Cai, X., & Ding, C. H. (2010). Efficient and robust feature selection via joint \(\ell \)2, 1-norms minimization. In Advances in neural information processing systems (pp. 1813–1821).
Nilsback, M. E., & Zisserman, A. (2006). A visual vocabulary for flower classification. In CVPR.
Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In 2014 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1717–1724).
Pandey, M., & Lazebnik, S. (2011). Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV (pp. 1307–1314).
Parizi, S. N., Oberlin, J. G., & Felzenszwalb, P. F. (2012). Reconfigurable models for scene recognition. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2775–2782).
Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.
Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In Computer vision—ECCV 2010 (pp. 143–156).
Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.
Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR.
Sadeghi, F., & Tappen, M. F. (2012). Latent pyramidal regions for recognizing scenes. In Computer vision—ECCV 2012 (pp. 228–241).
Shabou, A., & LeBorgne, H. (2012). Locality-constrained and spatially regularized coding for scene categorization. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3618–3625).
Shao, M., Li, S., Liu, T., Tao, D., Huang, T. S., & Fu, Y. (2014). Learning relative features through adaptive pooling for image classification. In 2014 IEEE international conference on multimedia and expo (ICME) (pp. 1–6).
Sharma, G., Jurie, F., & Schmid, C. (2012). Discriminative Spatial Saliency for Image Classification. In CVPR 2012—Conference on computer vision and pattern recognition (pp. 3506–3513). IEEE, Providence, Rhode Island, United States. https://hal.inria.fr/hal-00714311.
Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.
Singh, S., Gupta, A., & Efros, A. A. (2012). Unsupervised discovery of mid-level discriminative patches. In ECCV (pp. 73–86).
Sun, J., & Ponce, J. (2013). Learning discriminative part detectors for image classification and cosegmentation. In 2013 IEEE international conference on computer vision (ICCV) (pp. 3400–3407). IEEE.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.
van Gemert, J. C., Geusebroek, J. M., Veenman, C. J., & Smeulders, A. W. (2008). Kernel codebooks for scene categorization. In Computer vision—ECCV 2008 (pp. 696–709).
Vedaldi, A., & Lenc, K. (2014). Matconvnet-convolutional neural networks for matlab. ArXiv preprint arXiv:1412.4564.
Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology.
Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.
Wang, X., Wang, B., Bai, X., Liu, W., & Tu, Z. (2013). Max-margin multiple-instance dictionary learning. In ICML (pp. 846–854).
Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., & Yan, S. (2014). CNN: Single-label to multi-label. ArXiv preprint arXiv:1406.5726.
Wu, J., & Rehg, J. M. (2011). Centrist: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1489–1501.
Article Google Scholar
Xiang, S., Nie, F., Meng, G., Pan, C., & Zhang, C. (2012). Discriminative least squares regression for multiclass classification and feature selection. TNNLS, 23(11), 1738–1754.
Google Scholar
Xie, G. S., Zhang, X. Y., & Liu, C. L. (2014). Efficient feature coding based on auto-encoder network for image classificatio. In ACCV.
Xie, G. S., Zhang, X. Y., Shu, X., Yan, S., & Liu, C. L. (2015). Task-driven feature pooling for image classification. In 2015 IEEE international conference on computer vision (ICCV).
Xie, N., Ling, H., Hu, W., & Zhang, X. (2010). Use bin-ratio information for category and scene classification. In CVPR.
Xu, Z., Yang, Y., & Hauptmann, A. G. (2014). A discriminative CNN video representation for event detection. ArXiv preprint arXiv:1411.4006.
Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 40–51.
Article Google Scholar
Yang, H., Zhou, J. T., Zhang, Y., Gao, B., Wu, J., & Cai, J. (2015). Can partial strong labels boost multi-label object recognition? CoRR arXiv:1504.05843.
Yang, J., Wang, Z., Lin, Z., Cohen, S., & Huang, T. (2012). Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing, 21(8), 3467–3478.
Article MathSciNet Google Scholar
Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.
Yang, M., Zhang, L., Feng, X., & Zhang, D. (2014). Sparse representation based fisher discrimination dictionary learning for image classification. International Journal of Computer Vision, 109(3), 209–232.
Article MathSciNet MATH Google Scholar
Ye, G., Liu, D., Jhuo, I. H., & Chang, S. F. (2012). Robust late fusion with rank minimization. In CVPR.
Yoo, D., Park, S., Lee, J. Y., & Kweon, I. S. (2015). Fisher kernel for deep neural activations. In CVPRW.
Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.
Zhang, L., Yang, M., & Feng, X. (2011). Sparse representation or collaborative representation: Which helps face recognition? In 2011 International conference on computer vision (pp. 471–478). IEEE.
Zhang, N., Donahue, J., Girshick, R., & Darrell, T. (2014). Part-based R-CNNs for fine-grained category detection. In Computer vision—ECCV 2014 (pp. 834–849).
Zhang, N., Farrell, R., Iandola, F., & Darrell, T. (2013). Deformable part descriptors for fine-grained recognition and attribute prediction. In 2013 IEEE international conference on computer vision (ICCV) (pp. 729–736).
Zhang, Z., Lai, Z., Xu, Y., Shao, L., j. Wu, & Xie, G. S. (2017). Discriminative elastic-net regularized linear regression. IEEE Transactions on Image Processing. doi:10.1109/TIP.2017.2651396.
Zhang, Z., Xu, Y., Yang, J., Li, X., & Zhang, D. (2015). A survey of sparse representation: Algorithms and applications. IEEE Access, 3, 490–530.
Article Google Scholar
Zheng, Y., Jiang, Y. G., & Xue, X. (2012). Learning hybrid part filters for scene recognition. In ECCV (pp. 172–185).
Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS.
Zhou, X., Yu, K., Zhang, T., & Huang, T. S. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.
Zhu, J., Li, L. J., Fei-Fei, L., & Xing, E. P. (2010). Large margin learning of upstream scene understanding models. In NIPS (pp. 2586–2594).
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.
Article MathSciNet MATH Google Scholar
Zuo, Z., Wang, G., Shuai, B., Zhao, L., Yang, Q., & Jiang, X. (2014). Learning discriminative and shareable features for scene classification. In Computer vision—ECCV 2014 (pp. 552–568).

Download references

Acknowledgements

This work was supported by the National Basic Research Program of China (973 Program) Grant 2012CB316302, the Strategic Priority Research Program of the CAS (Grant XDA06040102), the National Natural Science Foundation of China (NSFC) (Grant 61403380), and the Henan International Cooperation Project (Grant 152102410036).

Author information

Authors and Affiliations

Information Engineering College, Henan University of Science and Technology, Luoyang, 471023, China
Guo-Sen Xie
National Laboratory of Pattern Recognition, Institute of Automation of Chinese, Academy of Sciences, No. 95 Zhongguancun East Road, Beijing, 100190, China
Guo-Sen Xie, Xu-Yao Zhang & Cheng-Lin Liu
Department of Electrical and Computer Engineering, National University of Singapore, Singapore, 117583, Singapore
Shuicheng Yan
University of Chinese Academy of Sciences, and the CAS Center for Excellence in Brain Science and Intelligence Technology, Beijing, China
Cheng-Lin Liu

Authors

Guo-Sen Xie
View author publications
You can also search for this author in PubMed Google Scholar
Xu-Yao Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Shuicheng Yan
View author publications
You can also search for this author in PubMed Google Scholar
Cheng-Lin Liu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Cheng-Lin Liu.

Additional information

Communicated by Josef Sivic.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Xie, GS., Zhang, XY., Yan, S. et al. SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition. Int J Comput Vis 124, 145–168 (2017). https://doi.org/10.1007/s11263-017-1007-9

Download citation

Received: 06 November 2015
Accepted: 07 March 2017
Published: 30 March 2017
Issue Date: September 2017
DOI: https://doi.org/10.1007/s11263-017-1007-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

CBAM: Convolutional Block Attention Module

ImageNet Large Scale Visual Recognition Challenge

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition

Abstract

Access this article

Similar content being viewed by others

SSD: Single Shot MultiBox Detector

CBAM: Convolutional Block Attention Module

ImageNet Large Scale Visual Recognition Challenge

Notes

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation