Learning discriminative visual elements using part-based convolutional neural network
Introduction
Visual recognition is one of fundamental problems in computer vision with various applications ranging from human computer interaction, video analysis to autonomous driving. The essential challenge of visual recognition is the large gap between low-level features and high-level semantics. In recent years, mid-level visual element based methods have attracted much attention because it provides an effective way to narrow the perception gap. The key problem is how to use image-level supervision to discover a few number of discriminative elements, which are referred to object parts, the object entirely and visual phrases, etc., but are not restricted to be any one of them.
A proper scheme of discovering distinctness visual elements may correspond to two mechanisms: (1) the candidate visual elements should represent specific semantic meanings; and (2) the optimization of visual elements should be related to certain tasks such as image classification or image reconstruction. With respect to the first point, some researchers have recently considered to discover the mid-level visual elements within the pre-trained convolutional neural networks (CNNs) [1], [2], because CNNs have excellent ability of abstracting semantic representation from image pixels. Some researchers also realized the second point and proposed a unified framework for jointly learning parts and classifiers [3].
However, most of existing part-based methods [1], [3], [4], [5], [6], [7], [8], [9] uses two individual phases for discovering parts and recognizing images. For example, MDPM [1] uses associate rule for part discovery, and then utilizes discovered parts as the convolutional filters to extract features for recognition. In [3], authors proposed to unify the training of part filters and classifier. Their method needs to firstly select good parts by training with l2/l1 regularization, and then to refine the selected parts with a newly initialized classifier using l2 regularizer. In this paper, we explore whether parts can be trained and ranked jointly with a classifier using a single objective.
We propose a CNN-based approach that discovers discriminative visual elements for visual recognition. Unlike existing methods that mine visual elements by optimizing and selecting filters within a conventional CNN model [1], [2], [3], we employ an additional single-layer CNN attached upon different layers of a pre-trained CNN to facilitate the learning of visual elements (see Fig. 1). The additional CNN is specially designed and works like an encoding module in a part-based representation model. We call the additional CNN as part-based CNN (P-CNN) since it forces to optimize and select part-like visual elements. The P-CNN can be used to promote the visual recognition performance of a pre-trained deep neural network without using extra training data.
Our P-CNN contains a successive convolution, pooling, and rectified linear unit (ReLU), followed by a Support Vector Machine (SVM) classifier. After a joint optimization for the P-CNN and SVM using samples with image-level labels, candidate visual elements can be ranked according to the classification losses, and further selected by the weights of classifiers to form the final visual element detectors. One advantage of the proposed method is that the formulation of P-CNN is the same to the pipeline of representing an image, which makes the part learning be tightly coupled with the recognition task. More importantly, our P-CNN and the classifier can be jointly optimized by the backpropagation algorithm, which is simpler than the iterative method utilized in [3]. We have applied our method on scene categorization, action classification, and multi-label object recognition. Extensive experiments on benchmarks show that the proposed method obtains competitive results with many state-of-the-art methods.
Section snippets
Related work
Mining ideal visual elements is a challenge problem. Many methods have been proposed to address this task. We briefly discuss many related methods.
Transfer from CNNs. The CNNs have been demonstrated its effectiveness for large scale visual recognitions [10], [11], [12], [13], [14], [15]. The CNNs pre-trained on a bunch of annotated data are often used as a generic feature extractor leading to impressive performance on a variety of visual tasks [16], [17]. Our work shares similar idea to
The proposed method
The goal of our method is to automatically learn discriminative visual elements from images. In this section, we will firstly introduce the architecture of our method, and then present the optimization of the proposed P-CNN. Finally, we will describe the encoding method for visual recognition applications.
Experiments
In this section, we demonstrate the effectiveness of our method on multiple recognition tasks. We first briefly describe the employed datasets, the implementation details, followed by an in-depth investigation on our P-CNN. Finally, we compare our technique with state-of-the-art methods.
Conclusions
We present a part-level convolutional neural network (P-CNN) model to discover discriminative visual elements (semantic parts), and derive part-based image classifiers that have achieved many state-of-the-arts for different recognition tasks on several Benchmark challenges. The key enabler of our model is to integrate the hierarchical abstraction capacity of deep Convolutional Neural Networks (CNNs) and the semantic concentration capacity of part-based image representation. Another merits of
Acknowledgments
This project is supported by the Natural Science Foundation of China (No. 61672544), Fundamental Research Funds for the Central Universities (No. 161gpy41), and Tip-top Scientific and Technical Innovative Youth Talents of Guangdong special support program (No. 2016TQ03X263).
Lingxiao Yang is currently a Ph.D. candidate of Hong Kong Polytechnic University. He received a B.E. degree from Beijing Union University, in 2010, and the M.E. degree in Computer Application from the South China Normal University, in 2013. He had been a research assistant at SIAT (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) from July, 2013 to July, 2015. His research interests include Computer Vision and Machine Learning.
References (72)
- et al.
Generative regularization with latent topics for discriminative object recognition
Pattern Recognit.
(2015) - et al.
G-MS2F: GoogLeNet based multi-stage feature fusion of deep CNN for scene recognition
Neurocomputing
(2017) - et al.
Mining mid-level visual patterns with deep CNN activations, in: Proceedings of the ICCV
(2017) - et al.
Neural activation constellations: unsupervised part model discovery with convolutional networks
Proceedings of the ICCV
(2015) - et al.
Automatic discovery and optimization of parts for image classification
Proceedings of the ICLR
(2015) - et al.
Unsupervised discovery of mid-level discriminative patches
Proceedings of the ECCV
(2012) - et al.
Mid-level visual element discovery as discriminative mode seeking
Proceedings of the NIPS
(2013) - et al.
Blocks that shout: distinctive parts for scene classification
Proceedings of the CVPR
(2013) - et al.
Learning discriminative part detectors for image classification and cosegmentation
Proceedings of the ICCV
(2013) - et al.
Object bank: a high-level image representation for scene classification & semantic feature sparsification
Proceedings of the NIPS
(2010)
Expanded parts model for semantic description of humans in still images
IEEE TPAMI
Imagenet classification with deep convolutional neural networks
Proceedings of the NIPS
Large-scale video classification with convolutional neural networks
Proceedings of the CVPR
Return of the devil in the details: Delving deep into convolutional nets
Proceedings of the BMVC
Very deep convolutional networks for large-scale image recognition
Proceedings of the ICLR
Visualizing and understanding convolutional networks
Proceedings of the ECCV
Learning deep features for scene recognition using places database
Proceedings of the NIPS
DeCAF: a deep convolutional activation feature for generic visual recognition
Proceedings of the ICML
CNN features off-the-shelf: an astounding baseline for recognition
Proceedings of the CVPRW
Is object localization for free? Weakly-supervised learning with convolutional neural networks
Proceedings of the CVPR
Beyond bags of features: spatial pyramid matching for recognizing natural scene categories
Proceedings of the CVPR
Food-101 – mining discriminative components with random forests
Proceedings of the ECCV
Randomized max-margin compositions for visual recognition
Proceedings of the CVPR
Harvesting mid-level visual concepts from large-scale internet images
Proceedings of the CVPR
Representing videos using mid-level discriminative patches
Proceedings of the CVPR
Motionlets: mid-level 3d parts for human motion recognition
Proceedings of the CVPR
Action bank: a high-level representation of activity in video
Proceedings of the CVPR
Actions and attributes from wholes and parts
Proceedings of the CVPR
Style-aware mid-level representation for discovering visual connections in space and time
Proceedings of the ICCV
Deformable part models are convolutional neural networks
Proceedings of the CVPR
DeepiID-Net: deformable deep convolutional neural networks for object detection
Proceedings of the CVPR
Object detection with discriminatively trained part-based models
Trans. Pattern Anal. Mach. Intell.
Revisiting co-saliency detection: a novel approach based on two-stage multi-view spectral rotation co-clustering
Trans. Image Process.
Co-saliency detection via a self-paced multiple-instance learning framework
Trans. Pattern Anal. Mach. Intell.
Advanced deep-learning techniques for salient and category-specific object detection: a survey
Signal Process. Mag.
Robust object co-segmentation using background prior
Trans. Image Process.
Cited by (8)
A joint framework for mining discriminative and frequent visual representation
2022, NeurocomputingCitation Excerpt :Recently, convolutional neural network (CNN) has been widely utilized as a feature extractor [1,2,11,12], because it is able to learn the high-level semantic representation of images. The CNN features were associated with association rules [1], clustering algorithm [11], and unsupervised max-margin analysis [3] to discover visual patterns. The discrimination or frequency of these patterns was separately guaranteed through varied optimizations.
Stable Visual Pattern Mining via Pattern Probability Distribution
2024, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)Strategies of Applying Visual Element Combination to Improve Visual Cognitive Efficiency in the Era of Big Data Network
2022, Mobile Information SystemsThe Influence of Artificial Intelligence on Visual Elements of Web Page Design under Machine Vision
2022, Computational Intelligence and NeuroscienceJointly Discriminating and Frequent Visual Representation Mining
2021, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Lingxiao Yang is currently a Ph.D. candidate of Hong Kong Polytechnic University. He received a B.E. degree from Beijing Union University, in 2010, and the M.E. degree in Computer Application from the South China Normal University, in 2013. He had been a research assistant at SIAT (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences) from July, 2013 to July, 2015. His research interests include Computer Vision and Machine Learning.
Xiaohua Xie is currently a Associate Professor at Sun Yat-Sen University. Prior to joining SYSU, Xiaohua was an Associate Professor at Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences. He received a B.Sc. in Mathematics and Applied Mathematics (2005) from Shantou University, a M.Sc. in Information of Computing Science (2007) and a Ph.D. in Applied Mathematics (2010) from Sun Yat-sen University in China (jointly supervised by Concordia University in Canada). His current research fields cover image processing, computer vision, pattern recognition, and computer graphics, especially focusing on image understanding and object modeling. He has published more than a dozen papers in the prestigious international journals and conferences. He is recognized as Overseas High-Caliber Personnel (Level B) in Shenzhen, China.
Jianhuang Lai received the Ph.D. degree in mathematics in 1999 from Sun Yat-sen University, China. He joined Sun Yat-sen University in 1989 as an assistant professor, where he is currently a Professor of the School of Data and Computer Science. His current research interests are in the areas of digital image processing, pattern recognition, multimedia communication, multiple target tracking, and wavelet and its applications. He has published over 200 scientific papers in the international journals and conferences on image processing and pattern recognition, e.g., IEEE TPAMI, IEEE TNN, IEEE TKDE, IEEE TIP, IEEE TSMC (Part B), IEEE TCSVT, Pattern Recognition, ICCV, CVPR, and ICDM. He serves as a standing member of the Image and Graphics Association of China and also serves as a standing director in the Image and Graphics Association of Guangdong.