Skip to main content
Log in

SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Bag of Words (BoW) model and Convolutional Neural Network (CNN) are two milestones in visual recognition. Both BoW and CNN require a feature pooling operation for constructing the frameworks. Particularly, the max-pooling has been validated as an efficient and effective pooling method compared with other methods such as average pooling and stochastic pooling. In this paper, we first evaluate different pooling methods, and then propose a new feature pooling method termed as selective, discriminative and equalizing pooling (SDE). The SDE representation is a feature learning mechanism by jointly optimizing the pooled representations with the target of learning more selective, discriminative and equalizing features. We use bilevel optimization to solve the joint optimization problem. Experiments on seven benchmark datasets (including both single-label and multi-label ones) well validate the effectiveness of our framework. Particularly, we achieve the state-of-the-art fused results (mAP) of 93.21 and 93.97% on the PASCAL VOC2007 and VOC2012 datasets, respectively.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15

Similar content being viewed by others

Notes

  1. http://host.robots.ox.ac.uk:/leaderboard/main_bootstrap.php.

References

  • Berg, T., & Belhumeur, P. N. (2013). Poof: Part-based one-vs.-one features for fine-grained categorization, face verification, and attribute estimation. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 955–962).

  • Bo, L., Ren, X., & Fox, D. (2013). Multipath sparse coding using hierarchical matching pursuit. In CVPR.

  • Boureau, Y. L., Bach, F., LeCun, Y., & Ponce, J. (2010). Learning mid-level features for recognition. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2559–2566).

  • Bradley, D. M., & Bagnell, J. A. (2008). Differential sparse coding. In Neural information processing systems.

  • Chai, Y., Lempitsky, V., & Zisserman, A. (2013). Symbiotic segmentation and part localization for fine-grained categorization. In 2013 IEEE international conference on computer vision (ICCV) (pp. 321–328).

  • Chatfield, K., Simonyan, K., Vedaldi, A., & Zisserman, A. (2014). Return of the devil in the details: Delving deep into convolutional nets. ArXiv preprint arXiv:1405.3531.

  • Chen, Q., Song, Z., Dong, J., Huang, Z., Hua, Y., & Yan, S. (2015). Contextualizing object detection and classification. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(1), 13–27.

    Article  Google Scholar 

  • Chen, Q., Song, Z., Hua, Y., Huang, Z., & Yan, S. (2012). Hierarchical matching with side information for image classification. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3426–3433).

  • Cimpoi, M., Maji, S., Kokkinos, I., & Vedaldi, A. (2016). Deep filter banks for texture recognition, description, and segmentation. International Journal of Computer Vision, 118(1), 65–94.

    Article  MathSciNet  Google Scholar 

  • Cimpoi, M., Maji, S., & Vedaldi, A. (2015) Deep filter banks for texture recognition and segmentation. In CVPR.

  • Colson, B., Marcotte, P., & Savard, G. (2007). An overview of bilevel optimization. Annals of Operations Research, 153(1), 235–256.

    Article  MathSciNet  MATH  Google Scholar 

  • Csurka, G., Dance, C., Fan, L., Willamowski, J., & Bray, C. (2004) Visual categorization with bags of keypoints. In ECCV.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009). ImageNet: A Large-Scale Hierarchical Image Database. In CVPR09.

  • Doersch, C., Gupta, A., & Efros, A. A. (2013). Mid-level visual element discovery as discriminative mode seeking. In NIPS.

  • Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv:1310.1531.

  • Dong, J., Xia, W., Chen, Q., Feng, J., Huang, Z., & Yan, S. (2013). Subcategory-aware object classification. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 827–834).

  • Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., et al. (2004). Least angle regression. The Annals of Statistics, 32(2), 407–499.

    Article  MathSciNet  MATH  Google Scholar 

  • Everingham, M., Eslami, S. A., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2014). The pascal visual object classes challenge: A retrospective. International Journal of Computer Vision, 111(1), 98–136.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Fan, R. E., Chang, K. W., Hsieh, C. J., Wang, X. R., & Lin, C. J. (2008). Liblinear: A library for large linear classification. JMLR, 9, 1871–1874.

    MATH  Google Scholar 

  • Fanello, S. R., Noceti, N., Ciliberto, C., Metta, G., & Odone, F. (2014). Ask the image: Supervised pooling to preserve feature locality. In CVPR.

  • Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. CVIU, 106(1), 59–70.

    Google Scholar 

  • Feng, J., Ni, B., Tian, Q., & Yan, S. (2011). Geometric \(\ell \)p-norm feature pooling for image classification. In CVPR.

  • Fernando, B., Fromont, E., & Tuytelaars, T. (2012). Effective use of frequent itemset mining for image classification. In Computer vision—ECCV 2012 (pp. 214–227).

  • Gao, S., Tsang, I. W. H., & Chia, L. T. (2013). Laplacian sparse coding, hypergraph laplacian sparse coding, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(1), 92–104.

    Article  Google Scholar 

  • Gavves, E., Fernando, B., Snoek, C. G., Smeulders, A. W., & Tuytelaars, T. (2013). Fine-grained categorization by alignments. In 2013 IEEE international conference on computer vision (ICCV) (pp. 1713–1720).

  • Girshick, R., Donahue, J., Darrell, T., & Malik, J. (2014). Rich feature hierarchies for accurate object detection and semantic segmentation. In CVPR.

  • Gong, Y., Wang, L., Guo, R., & Lazebnik, S. (2014). Multi-scale orderless pooling of deep convolutional activation features. In ECCV.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2014). Spatial pyramid pooling in deep convolutional networks for visual recognition. arXiv:1406.4729.

  • Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In Advances in neural information processing systems (pp. 2017–2025).

  • Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In 2010 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3304–3311).

  • Jégou, H., & Zisserman, A. (2014). Triangulation embedding and democratic aggregation for image search. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3310–3317).

  • Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadarrama, S., & Darrell, T. (2014). Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093.

  • Jiang, Y., Yuan, J., & Yu, G. (2012). Randomized spatial partition for scene recognition. In Computer vision—ECCV 2012 (pp. 730–743).

  • Jiang, Z., Lin, Z., & Davis, L. S. (2013). Label consistent k-svd: Learning a discriminative dictionary for recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35(11), 2651–2664.

    Article  Google Scholar 

  • Juneja, M., Vedaldi, A., Jawahar, C., & Zisserman, A. (2013). Blocks that shout: Distinctive parts for scene classification. In CVPR (pp. 923–930).

  • Krizhevsky, A., Sutskever, I., & Hinton, G. E. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.

  • Kwitt, R., Vasconcelos, N., & Rasiwasia, N. (2012). Scene recognition on the semantic manifold. In Computer vision—ECCV 2012 (pp. 359–372).

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In CVPR.

  • LeCun, Y., Bottou, L., Bengio, Y., & Haffner, P. (1998). Gradient-based learning applied to document recognition. Proceedings of the IEEE, 86(11), 2278–2324.

    Article  Google Scholar 

  • Li, L. J., Su, H., Fei-Fei, L., & Xing, E. P. (2010). Object bank: A high-level image representation for scene classification & semantic feature sparsification. In Advances in neural information processing systems (pp. 1378–1386).

  • Li, Q., Wu, J., & Tu, Z. (2013). Harvesting mid-level visual concepts from large-scale internet images. In 2013 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 851–858).

  • Lin, D., Lu, C., Liao, R., & Jia, J. (2014). Learning important spatial pooling regions for scene classification. In CVPR.

  • Lin, T. Y., RoyChowdhury, A., & Maji, S. (2015). Bilinear CNN models for fine-grained visual recognition. In Proceedings of the IEEE international conference on computer vision (pp. 1449–1457).

  • Liu, L., Wang, L., & Liu, X. (2011). In defense of soft-assignment coding. In ICCV.

  • Long, J., Shelhamer, E., & Darrell, T. (2014). Fully convolutional networks for semantic segmentation. ArXiv preprint arXiv:1411.4038.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. IJCV, 60(2), 91–110.

    Article  Google Scholar 

  • Mairal, J., Bach, F., & Ponce, J. (2012). Task-driven dictionary learning. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(4), 791–804.

    Article  Google Scholar 

  • Mairal, J., Bach, F., Ponce, J., & Sapiro, G. (2010). Online learning for matrix factorization and sparse coding. The Journal of Machine Learning Research, 11, 19–60.

    MathSciNet  MATH  Google Scholar 

  • Murray, N., & Perronnin, F. (2014). Generalized max pooling. In CVPR.

  • Nie, F., Huang, H., Cai, X., & Ding, C. H. (2010). Efficient and robust feature selection via joint \(\ell \)2, 1-norms minimization. In Advances in neural information processing systems (pp. 1813–1821).

  • Nilsback, M. E., & Zisserman, A. (2006). A visual vocabulary for flower classification. In CVPR.

  • Oquab, M., Bottou, L., Laptev, I., & Sivic, J. (2014). Learning and transferring mid-level image representations using convolutional neural networks. In 2014 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 1717–1724).

  • Pandey, M., & Lazebnik, S. (2011). Scene recognition and weakly supervised object localization with deformable part-based models. In ICCV (pp. 1307–1314).

  • Parizi, S. N., Oberlin, J. G., & Felzenszwalb, P. F. (2012). Reconfigurable models for scene recognition. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2775–2782).

  • Perronnin, F., & Dance, C. (2007). Fisher kernels on visual vocabularies for image categorization. In CVPR.

  • Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In Computer vision—ECCV 2010 (pp. 143–156).

  • Quattoni, A., & Torralba, A. (2009). Recognizing indoor scenes. In CVPR.

  • Razavian, A. S., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN features off-the-shelf: An astounding baseline for recognition. In CVPR.

  • Sadeghi, F., & Tappen, M. F. (2012). Latent pyramidal regions for recognizing scenes. In Computer vision—ECCV 2012 (pp. 228–241).

  • Shabou, A., & LeBorgne, H. (2012). Locality-constrained and spatially regularized coding for scene categorization. In 2012 IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3618–3625).

  • Shao, M., Li, S., Liu, T., Tao, D., Huang, T. S., & Fu, Y. (2014). Learning relative features through adaptive pooling for image classification. In 2014 IEEE international conference on multimedia and expo (ICME) (pp. 1–6).

  • Sharma, G., Jurie, F., & Schmid, C. (2012). Discriminative Spatial Saliency for Image Classification. In CVPR 2012—Conference on computer vision and pattern recognition (pp. 3506–3513). IEEE, Providence, Rhode Island, United States. https://hal.inria.fr/hal-00714311.

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In ICLR.

  • Singh, S., Gupta, A., & Efros, A. A. (2012). Unsupervised discovery of mid-level discriminative patches. In ECCV (pp. 73–86).

  • Sun, J., & Ponce, J. (2013). Learning discriminative part detectors for image classification and cosegmentation. In 2013 IEEE international conference on computer vision (ICCV) (pp. 3400–3407). IEEE.

  • Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In CVPR.

  • van Gemert, J. C., Geusebroek, J. M., Veenman, C. J., & Smeulders, A. W. (2008). Kernel codebooks for scene categorization. In Computer vision—ECCV 2008 (pp. 696–709).

  • Vedaldi, A., & Lenc, K. (2014). Matconvnet-convolutional neural networks for matlab. ArXiv preprint arXiv:1412.4564.

  • Wah, C., Branson, S., Welinder, P., Perona, P., & Belongie, S. (2011). The Caltech-UCSD Birds-200-2011 Dataset. Tech. Rep. CNS-TR-2011-001, California Institute of Technology.

  • Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., & Gong, Y. (2010). Locality-constrained linear coding for image classification. In CVPR.

  • Wang, X., Wang, B., Bai, X., Liu, W., & Tu, Z. (2013). Max-margin multiple-instance dictionary learning. In ICML (pp. 846–854).

  • Wei, Y., Xia, W., Huang, J., Ni, B., Dong, J., Zhao, Y., & Yan, S. (2014). CNN: Single-label to multi-label. ArXiv preprint arXiv:1406.5726.

  • Wu, J., & Rehg, J. M. (2011). Centrist: A visual descriptor for scene categorization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 33(8), 1489–1501.

    Article  Google Scholar 

  • Xiang, S., Nie, F., Meng, G., Pan, C., & Zhang, C. (2012). Discriminative least squares regression for multiclass classification and feature selection. TNNLS, 23(11), 1738–1754.

    Google Scholar 

  • Xie, G. S., Zhang, X. Y., & Liu, C. L. (2014). Efficient feature coding based on auto-encoder network for image classificatio. In ACCV.

  • Xie, G. S., Zhang, X. Y., Shu, X., Yan, S., & Liu, C. L. (2015). Task-driven feature pooling for image classification. In 2015 IEEE international conference on computer vision (ICCV).

  • Xie, N., Ling, H., Hu, W., & Zhang, X. (2010). Use bin-ratio information for category and scene classification. In CVPR.

  • Xu, Z., Yang, Y., & Hauptmann, A. G. (2014). A discriminative CNN video representation for event detection. ArXiv preprint arXiv:1411.4006.

  • Yan, S., Xu, D., Zhang, B., Zhang, H. J., Yang, Q., & Lin, S. (2007). Graph embedding and extensions: A general framework for dimensionality reduction. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 40–51.

    Article  Google Scholar 

  • Yang, H., Zhou, J. T., Zhang, Y., Gao, B., Wu, J., & Cai, J. (2015). Can partial strong labels boost multi-label object recognition? CoRR arXiv:1504.05843.

  • Yang, J., Wang, Z., Lin, Z., Cohen, S., & Huang, T. (2012). Coupled dictionary training for image super-resolution. IEEE Transactions on Image Processing, 21(8), 3467–3478.

    Article  MathSciNet  Google Scholar 

  • Yang, J., Yu, K., Gong, Y., & Huang, T. (2009). Linear spatial pyramid matching using sparse coding for image classification. In CVPR.

  • Yang, M., Zhang, L., Feng, X., & Zhang, D. (2014). Sparse representation based fisher discrimination dictionary learning for image classification. International Journal of Computer Vision, 109(3), 209–232.

    Article  MathSciNet  MATH  Google Scholar 

  • Ye, G., Liu, D., Jhuo, I. H., & Chang, S. F. (2012). Robust late fusion with rank minimization. In CVPR.

  • Yoo, D., Park, S., Lee, J. Y., & Kweon, I. S. (2015). Fisher kernel for deep neural activations. In CVPRW.

  • Zeiler, M. D., & Fergus, R. (2014). Visualizing and understanding convolutional networks. In ECCV.

  • Zhang, L., Yang, M., & Feng, X. (2011). Sparse representation or collaborative representation: Which helps face recognition? In 2011 International conference on computer vision (pp. 471–478). IEEE.

  • Zhang, N., Donahue, J., Girshick, R., & Darrell, T. (2014). Part-based R-CNNs for fine-grained category detection. In Computer vision—ECCV 2014 (pp. 834–849).

  • Zhang, N., Farrell, R., Iandola, F., & Darrell, T. (2013). Deformable part descriptors for fine-grained recognition and attribute prediction. In 2013 IEEE international conference on computer vision (ICCV) (pp. 729–736).

  • Zhang, Z., Lai, Z., Xu, Y., Shao, L., j. Wu, & Xie, G. S. (2017). Discriminative elastic-net regularized linear regression. IEEE Transactions on Image Processing. doi:10.1109/TIP.2017.2651396.

  • Zhang, Z., Xu, Y., Yang, J., Li, X., & Zhang, D. (2015). A survey of sparse representation: Algorithms and applications. IEEE Access, 3, 490–530.

    Article  Google Scholar 

  • Zheng, Y., Jiang, Y. G., & Xue, X. (2012). Learning hybrid part filters for scene recognition. In ECCV (pp. 172–185).

  • Zhou, B., Lapedriza, A., Xiao, J., Torralba, A., & Oliva, A. (2014). Learning deep features for scene recognition using places database. In NIPS.

  • Zhou, X., Yu, K., Zhang, T., & Huang, T. S. (2010). Image classification using super-vector coding of local image descriptors. In ECCV.

  • Zhu, J., Li, L. J., Fei-Fei, L., & Xing, E. P. (2010). Large margin learning of upstream scene understanding models. In NIPS (pp. 2586–2594).

  • Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 67(2), 301–320.

    Article  MathSciNet  MATH  Google Scholar 

  • Zuo, Z., Wang, G., Shuai, B., Zhao, L., Yang, Q., & Jiang, X. (2014). Learning discriminative and shareable features for scene classification. In Computer vision—ECCV 2014 (pp. 552–568).

Download references

Acknowledgements

This work was supported by the National Basic Research Program of China (973 Program) Grant 2012CB316302, the Strategic Priority Research Program of the CAS (Grant XDA06040102), the National Natural Science Foundation of China (NSFC) (Grant 61403380), and the Henan International Cooperation Project (Grant 152102410036).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng-Lin Liu.

Additional information

Communicated by Josef Sivic.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xie, GS., Zhang, XY., Yan, S. et al. SDE: A Novel Selective, Discriminative and Equalizing Feature Representation for Visual Recognition. Int J Comput Vis 124, 145–168 (2017). https://doi.org/10.1007/s11263-017-1007-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-017-1007-9

Keywords

Navigation