Elsevier

Neurocomputing

Volume 249, 2 August 2017, Pages 19-27
Neurocomputing

Filtered shallow-deep feature channels for pedestrian detection

https://doi.org/10.1016/j.neucom.2017.03.007Get rights and content

Abstract

The semantic segmentation task is highly related to detection and apparently can provide complementary information for detection. In this paper, we propose integrating deep semantic segmentation feature maps into the original pedestrian detection framework which combines feature channels with AdaBoost classifiers. Firstly, we develop shallow-deep channels by concatenating shallow hand-crafted and deep segmentation channels to capture appearance clues as well as semantic attributes. Then a set of manually designed filters are utilized on the new channels to generate more response feature maps. Finally a cascade AdaBoost classifier is learned for hard negatives selection and pedestrian detection. With abundant feature information, our proposed detector achieves superior results on Caltech USA 10x and ETH dataset.

Introduction

Pedestrian detection has been a hot research topic in the past few years [1], [2], [3], [33]. It is aiming for returning bounding boxes which enclose pedestrians of given images. A majority of works follow all or parts of the pipeline including preprocessing, foreground segmentation, object classification, refinement and tracking [30]. Recent research demonstrates that there emerges two dominant pedestrian detection methods, namely deep learning incurred convolutional neural network (CNN) [5], [6], [17] and boosted decision trees with hand crafted features [2], [3] (e.g. Histogram of Gradient (HOG), Aggregated Channel Features (ACF)). In some scenarios, the straightforward usage of the deep learning framework [6] works no better than traditional methods [2]. Inspired by the close correlation between segmentation and detection [1], [4] and recent advances in the semantic segmentation field [10], [11], [12], many researchers attempt to utilize segmentation to aid the detection task. Cadena et al. [8] apply the Support Vector Machine (SVM) classifier to train segmentation features of proposals which are generated by ACF detector; then take the product of SVM score and original ACF score as the final acceptance decision. Based on the Regions with CNN features (R-CNN) framework, Gidaris and Komodakis [4] concatenate multi-region CNN features and CNN-based semantic segmentation-aware features as object representations. Hariharan et al. [9] extract features from bounding boxes and foreground regions by training separate networks.

In above-mentioned works, segmentation feature extraction based on coarse bounding box annotations [4] or simple foreground segmentation [9] may lead to insufficient description of objects in scenes. The public dataset of fine road scene attribute annotations [13] and the recent development in CNN [10], [11], [12] make it possible for us to learn more accurate semantic segmentation feature maps. Due to complicated scene layout exemplified by severe occlusion and dense crowd in natural images, however, it is still difficult to obtain the pedestrian bounding box directly from the semantic segmentation feature map. Relying on elaborately designed deep learning model, previous methods [1], [7] are able to achieve excellent results. In fact, the R-CNN framework [2], [6] is even surpassed by a straightforward scheme in which the conventional shallow features ACF are combined with boosted decision trees. In this paper, we therefore consider using filtered shallow-deep features that fuse traditional ACF and deep semantic segmentation feature channels under the conventional framework [1], [5], [7].

As shown in Fig. 1, the semantic segmentation feature map (SSFM) describes each pixel with a probability vector of semantic classes, including all objects in road scenes such as person, tree, road, pole, etc. With the capability of describing pixel-wise semantic attributes, SSFM somewhat provides auxiliary information complementary to low-level appearance features [2], [3], which further improves the detection performance.

By concatenating HOG+LUV channels (as those in ACF [3]) and SSFM (shown in Fig. 1), shallow-deep channels have both appearance descriptive power at the image level and deep semantic cues at the pixel level. Then similar to the convolution operation in deep learning, a family of rectangular hand-designed filters with the same size are operated on the shallow-deep feature channels. Consequently, more abstract and compact feature maps are gained after using manual filters on original shallow-deep feature channels. Finally, the feature vector as the final feature is fed into the decision trees. The flowchart of our method is shown in Fig. 2.

The contributions of this paper are summarized as follows:

  • To our knowledge, this is the first work that integrates the deep semantic segmentation feature maps into the process of learning the traditional ACF detector.

  • A set of hand-designed rectangular filters analogous to checkboard [2] are applied on the shallow-deep channels for more discriminative feature information. Superior performance is achieved with a small number of our checkboard-like filters used.

  • Filtered shallow-deep feature channels achieve a superior result on Caltech pedestrian dataset with a log-average miss rate 16.87 %, which significantly outperforms all hand-crafted features [3], [5], [15], [16] and some CNN based methods [6], [17].

The remainder of this paper is structured as follows. In the next section, we review the related work. Then we introduce our proposed method in Section 3 and evaluate our method on two benchmarks in Section 4. Finally, the conclusion is presented in Section 5.

Section snippets

Related work

The past few years witness massive efforts devoted to designing descriptive features for improving the performance in pedestrian detection. Without loss of generality, features in recent works can be categorized into two groups, namely hand-crafted features and CNN features. Recently, segmentation features are combined to improve the detection performance. The following related work is introduced from the three aspects.

Hand-crafted features. After generating original feature points, Li et al.

ACF detector

Similar to the ACF detector, our approach is based on the traditional Viola–Jones objection detection framework (VJ framework) [21] which consists of feature extraction and cascade classifier training. The process of training the ACF detector [3] can be summarized as follows: Given an image, 10 feature channels including LUV color channels, gradient magnitude channel and gradient histogram channels are calculated, on which lower resolution channels are obtained by summing and smoothing. Then

Experiments

In this section, we conduct experiments on Caltech [3] and ETH dataset [25] to verify the performance of our proposed pedestrian detector.

Datasets. The Caltech 10x dataset is one of the most popular pedestrian dataset [2], which consists of 250k frames from the urban traffic video. Each frame is annotated and one out of three frames are used for the training set. The ETH dataset has three sets of sequences (ETH0, ETH1, ETH2) and the number of frames in each set is respectively 999, 451 and 354.

Conclusion and future work

In this paper, we introduce new feature channels to describe more image information, on which simple checkboard-like filters are utilized. Deep semantic segmentation features generated by the powerful off-the-self CNN model is able to capture fine semantic attributes and provide complementary information for original HOG+LUV channels. The set of checkboard-like filters are extended from the aggregation operation in ACF methods and help improve the accuracy. The effectiveness of our method is

Acknowledgments

This work is supported by National Natural Science Foundation (NNSF) of China under Grant. 61473086, 61375001 partly supported by the open fund of Key Laboratory of Measurement and partly supported by Control of Complex Systems of Engineering, Ministry of Education (No. MCCSE2013B01), the NSF of Jiangsu Province (Grants No BK20140566, BK20150470) and China Postdoctoral science Foundation (2014M561586).

Biyun Sheng received the B.S. and M.S. degrees in the School of Electrical and Information Engineering, Jiangsu University, Zhenjiang, China, respectively in 2010 and 2013. Now, she is a Ph.D. student in School of Automation at the Southeast University, Nanjing, China.

References (38)

  • ShenJ. et al.

    Learning discriminative shape statistics distribution features for pedestrian detection

    Neurocomputing

    (2016)
  • TianY. et al.

    Pedestrian detection aided by deep learning semantic tasks

    Proceedings of the ICCV

    (2015)
  • ZhangS. et al.

    Filtered channel features for pedestrian detection

    Proceedings of the CVPR

    (2015)
  • P. Dollar et al.

    Fast feature pyramids for object detection

    PAMI

    (2014)
  • S. Gidaris et al.

    Object detection via a multi-region & semantic segmentation-aware CNN model

    Proceedings of the ICCV

    (2015)
  • Q. Hu, P. Wang, C. Shen, A. Hengel, F. Porikli, Pushing thelimits of deep CNNs for pedestrian detection, ArXiv:...
  • HosangJ. et al.

    Taking a deeper look at pedestrians

    Proceedings of the CVPR

    (2015)
  • CaiZ. et al.

    Learning complexity-aware cascades for deep pedestrian detection

    Proceedings of the ICCV

    (2015)
  • C. Cadena et al.

    A fast, modular scene understanding system using context-aware object

    Proceedings of the ICRA

    (2015)
  • B. Hariharan et al.

    Simultaneous detection and segmentation

    Proceedings of the ECCV

    (2014)
  • LongJ. et al.

    Fully convolutional networks for semantic segmentation

    Proceedings of the CVPR

    (2015)
  • ChenL. et al.

    Semantic image segmentation with deep convolutional nets and fully connected CRFs

    Proceedings of the ICLR

    (2015)
  • LinG. et al.

    Efficient piecewise training of deep structured models for semantic segmentation

    Proceedings of the CVPR

    (2016)
  • M. Cordts et al.

    The cityscapes dataset for semantic urban scene understanding

    Proceedings of the CVPR

    (2016)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

    Proceedings of the CVPR

    (2005)
  • WangX. et al.

    An hog-lbp human detector with partial occlusion handling

    Proceedings of the ICCV

    (2009)
  • S. Paisitkriangkrai et al.

    Strengthening the effectiveness of pedestrian detection with spatially pooled features

    Proceedings of the ECCV

    (2014)
  • YangB. et al.

    Convolutional channel features

    Proceedings of the ICCV

    (2015)
  • CaiZ. et al.

    Learning complexity-aware cascades for deep pedestrian detection

    Proceedings of the ICCV

    (2015)
  • Cited by (7)

    • Urban scene based Semantical Modulation for Pedestrian Detection

      2022, Neurocomputing
      Citation Excerpt :

      However, these methods regard the background as a single class, ignoring the rich scene information that could be used as detection cues. There are also various works [54,55,7,18] that do use scene semantics for pedestrian detection. For example, Zhang et al. [7] and Sheng et al. [55] improved detection by combining the segmentation maps with the original images and traditional hand-crafted features, respectively.

    • Computer vision and deep learning techniques for pedestrian detection and tracking: A survey

      2018, Neurocomputing
      Citation Excerpt :

      Video acquisition technology is one of the fundamental aspects that concern with pedestrian, and generally, object detection and tracking. In literature, most of the works dealing with pedestrian detection use a 2D acquisition system and Machine Learning (ML) techniques to perform a large variety of tasks [12,13,21–23,25,26,50,51,54,56,59–64,67–69,71,72,74–81,84–91,93,94,96–99,131–134]. In the most of cases, a 2D video is sufficient for pedestrian detection since videos contain extremely valuable information that can be extracted after an appropriate processing, i.e. the 2D coordinates of a detected person.

    • From Handcrafted to Deep Features for Pedestrian Detection: A Survey

      2022, IEEE Transactions on Pattern Analysis and Machine Intelligence
    • A new dataset benchmark for Pedestrian detection

      2018, ACM International Conference Proceeding Series
    View all citing articles on Scopus

    Biyun Sheng received the B.S. and M.S. degrees in the School of Electrical and Information Engineering, Jiangsu University, Zhenjiang, China, respectively in 2010 and 2013. Now, she is a Ph.D. student in School of Automation at the Southeast University, Nanjing, China.

    Qichang Hu is a PhD Candidature with the Australian Centre for Visual Technologies, University of Adelaide, Adelaide, SA, Australia. He received the bachelor’s degree in computer science from the University of Adelaide, Adelaide, SA, Australia in 2012. His research interests include deep learning, object detection, and machine learning.

    Jun Li received the B.S. Degree in electrical engineering & automation from Nanjing Normal University, Nanjing, China, and the M.S. Degree in Control theory & engineering from Southeast University, Nanjing, China, in 2008 and 2011, respectively. He is currently working toward the Ph.D. Degree with School of Automation, Southeast University, Nanjing, China. His research interests include multimedia search and computer vision.

    Wankou Yang received the B.S., M.S. and Ph.D. degrees in the School of Computer Science and Technology, Nanjing University of Science and Technology (NUST), China, respectively in 2002, 2004, and 2009. From July 2009 to Aug. 2011, he worked as a Postdoctoral Fellow in the School of Automation, Southeast University, China. Since Sep. 2011, he has been an assistant professor in School of Automation, Southeast University. His research interests include pattern recognition, computer vision and machine learning.

    Baochang Zhang received the B.S., M.S., and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 1999, 2001, and 2006, respectively. From 2006 to 2008, he was a Research Fellow with the Chinese University of Hong Kong, Hong Kong, and with Griffith University, Brisbane, Australia. Currently, he is an associate professor with the Science and Technology on Aircraft Control Laboratory, School of Automation Science and Electrical Engineering, Beihang University, Beijing, China. He also holds a senior postdoc position in PAVIS Department, IIT, Italy. He was supported by the Program for New Century Excellent Talents in University of Ministry of Education of China. His current research interests include pattern recognition, machine learning, face recognition, and wavelets.

    Changyin Sun is a professor in School of Automation at the Southeast University, China. He received the M.S. and Ph.D. degrees in Electrical Engineering from the Southeast University, Nanjing, China, respectively, in 2001 and 2003. His research interests include Intelligent Control, Neural Networks, SVM, Pattern Recognition, Optimal Theory, etc. He has received the First Prize of Nature Science of Ministry of Education, China. Professor Sun is a member of an IEEE, an Associate Editor of IEEE Transactions on Neural Networks, Neural Processing Letters and International Journal of Swarm Intelligence Research, Recent Patents on Computer Science.

    View full text