Elsevier

Neurocomputing

Volume 316, 17 November 2018, Pages 135-143
Neurocomputing

Learning discriminative visual elements using part-based convolutional neural network

https://doi.org/10.1016/j.neucom.2018.07.059Get rights and content

Abstract

Mid-level element based representations have been proven to be very effective for visual recognition. This paper presents a method to discover discriminative mid-level visual elements based on deep Convolutional Neural Networks (CNNs). We present a part-level CNN architecture, namely Part-based CNN (P-CNN), which acts as a role of encoding module in a part-based representation model. The P-CNN can be attached at arbitrary layer of a pre-trained CNN and be trained using image-level labels. The training of P-CNN essentially corresponds to the optimization and selection of discriminative mid-level visual elements. For an input image, the output of P-CNN is naturally the part-based coding and can be directly used for image recognition. By applying P-CNN to multiple layers of a pre-trained CNN, more diverse visual elements can be obtained for visual recognitions. We validate the proposed P-CNN on several visual recognition tasks, including scene categorization, action classification and multi-label object recognition. Extensive experiments demonstrate the competitive performance of P-CNN in comparison with state-of-the-arts.

Introduction

Visual recognition is one of fundamental problems in computer vision with various applications ranging from human computer interaction, video analysis to autonomous driving. The essential challenge of visual recognition is the large gap between low-level features and high-level semantics. In recent years, mid-level visual element based methods have attracted much attention because it provides an effective way to narrow the perception gap. The key problem is how to use image-level supervision to discover a few number of discriminative elements, which are referred to object parts, the object entirely and visual phrases, etc., but are not restricted to be any one of them.

A proper scheme of discovering distinctness visual elements may correspond to two mechanisms: (1) the candidate visual elements should represent specific semantic meanings; and (2) the optimization of visual elements should be related to certain tasks such as image classification or image reconstruction. With respect to the first point, some researchers have recently considered to discover the mid-level visual elements within the pre-trained convolutional neural networks (CNNs) [1], [2], because CNNs have excellent ability of abstracting semantic representation from image pixels. Some researchers also realized the second point and proposed a unified framework for jointly learning parts and classifiers [3].

However, most of existing part-based methods [1], [3], [4], [5], [6], [7], [8], [9] uses two individual phases for discovering parts and recognizing images. For example, MDPM [1] uses associate rule for part discovery, and then utilizes discovered parts as the convolutional filters to extract features for recognition. In [3], authors proposed to unify the training of part filters and classifier. Their method needs to firstly select good parts by training with l2/l1 regularization, and then to refine the selected parts with a newly initialized classifier using l2 regularizer. In this paper, we explore whether parts can be trained and ranked jointly with a classifier using a single objective.

We propose a CNN-based approach that discovers discriminative visual elements for visual recognition. Unlike existing methods that mine visual elements by optimizing and selecting filters within a conventional CNN model [1], [2], [3], we employ an additional single-layer CNN attached upon different layers of a pre-trained CNN to facilitate the learning of visual elements (see Fig. 1). The additional CNN is specially designed and works like an encoding module in a part-based representation model. We call the additional CNN as part-based CNN (P-CNN) since it forces to optimize and select part-like visual elements. The P-CNN can be used to promote the visual recognition performance of a pre-trained deep neural network without using extra training data.

Our P-CNN contains a successive convolution, pooling, and rectified linear unit (ReLU), followed by a Support Vector Machine (SVM) classifier. After a joint optimization for the P-CNN and SVM using samples with image-level labels, candidate visual elements can be ranked according to the classification losses, and further selected by the weights of classifiers to form the final visual element detectors. One advantage of the proposed method is that the formulation of P-CNN is the same to the pipeline of representing an image, which makes the part learning be tightly coupled with the recognition task. More importantly, our P-CNN and the classifier can be jointly optimized by the backpropagation algorithm, which is simpler than the iterative method utilized in [3]. We have applied our method on scene categorization, action classification, and multi-label object recognition. Extensive experiments on benchmarks show that the proposed method obtains competitive results with many state-of-the-art methods.

Section snippets

Related work

Mining ideal visual elements is a challenge problem. Many methods have been proposed to address this task. We briefly discuss many related methods.

Transfer from CNNs. The CNNs have been demonstrated its effectiveness for large scale visual recognitions [10], [11], [12], [13], [14], [15]. The CNNs pre-trained on a bunch of annotated data are often used as a generic feature extractor leading to impressive performance on a variety of visual tasks [16], [17]. Our work shares similar idea to

The proposed method

The goal of our method is to automatically learn discriminative visual elements from images. In this section, we will firstly introduce the architecture of our method, and then present the optimization of the proposed P-CNN. Finally, we will describe the encoding method for visual recognition applications.

Experiments

In this section, we demonstrate the effectiveness of our method on multiple recognition tasks. We first briefly describe the employed datasets, the implementation details, followed by an in-depth investigation on our P-CNN. Finally, we compare our technique with state-of-the-art methods.

Conclusions

We present a part-level convolutional neural network (P-CNN) model to discover discriminative visual elements (semantic parts), and derive part-based image classifiers that have achieved many state-of-the-arts for different recognition tasks on several Benchmark challenges. The key enabler of our model is to integrate the hierarchical abstraction capacity of deep Convolutional Neural Networks (CNNs) and the semantic concentration capacity of part-based image representation. Another merits of

Acknowledgments

This project is supported by the Natural Science Foundation of China (No. 61672544), Fundamental Research Funds for the Central Universities (No. 161gpy41), and Tip-top Scientific and Technical Innovative Youth Talents of Guangdong special support program (No. 2016TQ03X263).

Lingxiao Yang is currently a Ph.D. candidate of Hong Kong Polytechnic University. He received a B.E. degree from Beijing Union University, in 2010, and the M.E. degree in Computer Application from the South China Normal University, in 2013. He had been a research assistant at SIAT (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) from July, 2013 to July, 2015. His research interests include Computer Vision and Machine Learning.

References (72)

  • J.C. Rubio et al.

    Generative regularization with latent topics for discriminative object recognition

    Pattern Recognit.

    (2015)
  • TangP. et al.

    G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition

    Neurocomputing

    (2017)
  • LiY. et al.

    Mining mid-level visual patterns with deep CNN activations, in: Proceedings of the ICCV

    (2017)
  • M. Simon et al.

    Neural activation constellations: unsupervised part model discovery with convolutional networks

    Proceedings of the ICCV

    (2015)
  • S.N. Parizi et al.

    Automatic discovery and optimization of parts for image classification

    Proceedings of the ICLR

    (2015)
  • S. Singh et al.

    Unsupervised discovery of mid-level discriminative patches

    Proceedings of the ECCV

    (2012)
  • C. Doersch et al.

    Mid-level visual element discovery as discriminative mode seeking

    Proceedings of the NIPS

    (2013)
  • M. Juneja et al.

    Blocks that shout: distinctive parts for scene classification

    Proceedings of the CVPR

    (2013)
  • SunJ. et al.

    Learning discriminative part detectors for image classification and cosegmentation

    Proceedings of the ICCV

    (2013)
  • LiL.-J. et al.

    Object bank: a high-level image representation for scene classification & semantic feature sparsification

    Proceedings of the NIPS

    (2010)
  • G. Sharma et al.

    Expanded parts model for semantic description of humans in still images

    IEEE TPAMI

    (2017)
  • A. Krizhevsky et al.

    Imagenet classification with deep convolutional neural networks

    Proceedings of the NIPS

    (2012)
  • A. Karpathy et al.

    Large-scale video classification with convolutional neural networks

    Proceedings of the CVPR

    (2014)
  • K. Chatfield et al.

    Return of the devil in the details: Delving deep into convolutional nets

    Proceedings of the BMVC

    (2014)
  • K. Simonyan et al.

    Very deep convolutional networks for large-scale image recognition

    Proceedings of the ICLR

    (2014)
  • M.D. Zeiler et al.

    Visualizing and understanding convolutional networks

    Proceedings of the ECCV

    (2014)
  • ZhouB. et al.

    Learning deep features for scene recognition using places database

    Proceedings of the NIPS

    (2014)
  • J. Donahue et al.

    DeCAF: a deep convolutional activation feature for generic visual recognition

    Proceedings of the ICML

    (2014)
  • A.S. Razavian et al.

    CNN features off-the-shelf: an astounding baseline for recognition

    Proceedings of the CVPRW

    (2014)
  • M. Oquab et al.

    Is object localization for free? Weakly-supervised learning with convolutional neural networks

    Proceedings of the CVPR

    (2015)
  • S. Lazebnik et al.

    Beyond bags of features: spatial pyramid matching for recognizing natural scene categories

    Proceedings of the CVPR

    (2006)
  • L. Bossard et al.

    Food-101 – mining discriminative components with random forests

    Proceedings of the ECCV

    (2014)
  • A. Eigenstetter et al.

    Randomized max-margin compositions for visual recognition

    Proceedings of the CVPR

    (2014)
  • LiQ. et al.

    Harvesting mid-level visual concepts from large-scale internet images

    Proceedings of the CVPR

    (2013)
  • A. Jain et al.

    Representing videos using mid-level discriminative patches

    Proceedings of the CVPR

    (2013)
  • WangL. et al.

    Motionlets: mid-level 3d parts for human motion recognition

    Proceedings of the CVPR

    (2013)
  • S. Sadanand et al.

    Action bank: a high-level representation of activity in video

    Proceedings of the CVPR

    (2012)
  • G. Gkioxari et al.

    Actions and attributes from wholes and parts

    Proceedings of the CVPR

    (2015)
  • LeeY.J. et al.

    Style-aware mid-level representation for discovering visual connections in space and time

    Proceedings of the ICCV

    (2013)
  • R. Girshick et al.

    Deformable part models are convolutional neural networks

    Proceedings of the CVPR

    (2015)
  • W. Ouyang et al.

    DeepiID-Net: deformable deep convolutional neural networks for object detection

    Proceedings of the CVPR

    (2015)
  • P.F. Felzenszwalb et al.

    Object detection with discriminatively trained part-based models

    Trans. Pattern Anal. Mach. Intell.

    (2010)
  • YaoX. et al.

    Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering

    Trans. Image Process.

    (2017)
  • ZhangD. et al.

    Co-saliency detection via a self-paced multiple-instance learning framework

    Trans. Pattern Anal. Mach. Intell.

    (2017)
  • HanJ. et al.

    Advanced deep-learning techniques for salient and category-specific object detection: a survey

    Signal Process. Mag.

    (2018)
  • HanJ. et al.

    Robust object co-segmentation using background prior

    Trans. Image Process.

    (2018)
  • Cited by (8)

    • A joint framework for mining discriminative and frequent visual representation

      2022, Neurocomputing
      Citation Excerpt :

      Recently, convolutional neural network (CNN) has been widely utilized as a feature extractor [1,2,11,12], because it is able to learn the high-level semantic representation of images. The CNN features were associated with association rules [1], clustering algorithm [11], and unsupervised max-margin analysis [3] to discover visual patterns. The discrimination or frequency of these patterns was separately guaranteed through varied optimizations.

    • Stable Visual Pattern Mining via Pattern Probability Distribution

      2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    • Jointly Discriminating and Frequent Visual Representation Mining

      2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
    View all citing articles on Scopus

    Lingxiao Yang is currently a Ph.D. candidate of Hong Kong Polytechnic University. He received a B.E. degree from Beijing Union University, in 2010, and the M.E. degree in Computer Application from the South China Normal University, in 2013. He had been a research assistant at SIAT (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) from July, 2013 to July, 2015. His research interests include Computer Vision and Machine Learning.

    Xiaohua Xie is currently a Associate Professor at Sun Yat-Sen University. Prior to joining SYSU, Xiaohua was an Associate Professor at Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences. He received a B.Sc. in Mathematics and Applied Mathematics (2005) from Shantou University, a M.Sc. in Information of Computing Science (2007) and a Ph.D. in Applied Mathematics (2010) from Sun Yat-sen University in China (jointly supervised by Concordia University in Canada). His current research fields cover image processing, computer vision, pattern recognition, and computer graphics, especially focusing on image understanding and object modeling. He has published more than a dozen papers in the prestigious international journals and conferences. He is recognized as Overseas High-Caliber Personnel (Level B) in Shenzhen, China.

    Jianhuang Lai received the Ph.D. degree in mathematics in 1999 from Sun Yat-sen University, China. He joined Sun Yat-sen University in 1989 as an assistant professor, where he is currently a Professor of the School of Data and Computer Science. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, multiple target tracking, and wavelet and its applications. He has published over 200 scientific papers in the international journals and conferences on image processing and pattern recognition, e.g., IEEE TPAMI, IEEE TNN, IEEE TKDE, IEEE TIP, IEEE TSMC (Part B), IEEE TCSVT, Pattern Recognition, ICCV, CVPR, and ICDM. He serves as a standing member of the Image and Graphics Association of China and also serves as a standing director in the Image and Graphics Association of Guangdong.

    View full text