Filtered shallow-deep feature channels for pedestrian detection

doi:10.1016/j.neucom.2017.03.007

Neurocomputing

Volume 249, 2 August 2017, Pages 19-27

https://doi.org/10.1016/j.neucom.2017.03.007 Get rights and content

Abstract

The semantic segmentation task is highly related to detection and apparently can provide complementary information for detection. In this paper, we propose integrating deep semantic segmentation feature maps into the original pedestrian detection framework which combines feature channels with AdaBoost classifiers. Firstly, we develop shallow-deep channels by concatenating shallow hand-crafted and deep segmentation channels to capture appearance clues as well as semantic attributes. Then a set of manually designed filters are utilized on the new channels to generate more response feature maps. Finally a cascade AdaBoost classifier is learned for hard negatives selection and pedestrian detection. With abundant feature information, our proposed detector achieves superior results on Caltech USA 10x and ETH dataset.

Introduction

Pedestrian detection has been a hot research topic in the past few years [1], [2], [3], [33]. It is aiming for returning bounding boxes which enclose pedestrians of given images. A majority of works follow all or parts of the pipeline including preprocessing, foreground segmentation, object classification, refinement and tracking [30]. Recent research demonstrates that there emerges two dominant pedestrian detection methods, namely deep learning incurred convolutional neural network (CNN) [5], [6], [17] and boosted decision trees with hand crafted features [2], [3] (e.g. Histogram of Gradient (HOG), Aggregated Channel Features (ACF)). In some scenarios, the straightforward usage of the deep learning framework [6] works no better than traditional methods [2]. Inspired by the close correlation between segmentation and detection [1], [4] and recent advances in the semantic segmentation field [10], [11], [12], many researchers attempt to utilize segmentation to aid the detection task. Cadena et al. [8] apply the Support Vector Machine (SVM) classifier to train segmentation features of proposals which are generated by ACF detector; then take the product of SVM score and original ACF score as the final acceptance decision. Based on the Regions with CNN features (R-CNN) framework, Gidaris and Komodakis [4] concatenate multi-region CNN features and CNN-based semantic segmentation-aware features as object representations. Hariharan et al. [9] extract features from bounding boxes and foreground regions by training separate networks.

In above-mentioned works, segmentation feature extraction based on coarse bounding box annotations [4] or simple foreground segmentation [9] may lead to insufficient description of objects in scenes. The public dataset of fine road scene attribute annotations [13] and the recent development in CNN [10], [11], [12] make it possible for us to learn more accurate semantic segmentation feature maps. Due to complicated scene layout exemplified by severe occlusion and dense crowd in natural images, however, it is still difficult to obtain the pedestrian bounding box directly from the semantic segmentation feature map. Relying on elaborately designed deep learning model, previous methods [1], [7] are able to achieve excellent results. In fact, the R-CNN framework [2], [6] is even surpassed by a straightforward scheme in which the conventional shallow features ACF are combined with boosted decision trees. In this paper, we therefore consider using filtered shallow-deep features that fuse traditional ACF and deep semantic segmentation feature channels under the conventional framework [1], [5], [7].

As shown in Fig. 1, the semantic segmentation feature map (SSFM) describes each pixel with a probability vector of semantic classes, including all objects in road scenes such as person, tree, road, pole, etc. With the capability of describing pixel-wise semantic attributes, SSFM somewhat provides auxiliary information complementary to low-level appearance features [2], [3], which further improves the detection performance.

By concatenating HOG+LUV channels (as those in ACF [3]) and SSFM (shown in Fig. 1), shallow-deep channels have both appearance descriptive power at the image level and deep semantic cues at the pixel level. Then similar to the convolution operation in deep learning, a family of rectangular hand-designed filters with the same size are operated on the shallow-deep feature channels. Consequently, more abstract and compact feature maps are gained after using manual filters on original shallow-deep feature channels. Finally, the feature vector as the final feature is fed into the decision trees. The flowchart of our method is shown in Fig. 2.

The contributions of this paper are summarized as follows:

•
To our knowledge, this is the first work that integrates the deep semantic segmentation feature maps into the process of learning the traditional ACF detector.
•
A set of hand-designed rectangular filters analogous to checkboard [2] are applied on the shallow-deep channels for more discriminative feature information. Superior performance is achieved with a small number of our checkboard-like filters used.
•
Filtered shallow-deep feature channels achieve a superior result on Caltech pedestrian dataset with a log-average miss rate 16.87 %, which significantly outperforms all hand-crafted features [3], [5], [15], [16] and some CNN based methods [6], [17].

The remainder of this paper is structured as follows. In the next section, we review the related work. Then we introduce our proposed method in Section 3 and evaluate our method on two benchmarks in Section 4. Finally, the conclusion is presented in Section 5.

Section snippets

Related work

The past few years witness massive efforts devoted to designing descriptive features for improving the performance in pedestrian detection. Without loss of generality, features in recent works can be categorized into two groups, namely hand-crafted features and CNN features. Recently, segmentation features are combined to improve the detection performance. The following related work is introduced from the three aspects.

Hand-crafted features. After generating original feature points, Li et al.

ACF detector

Similar to the ACF detector, our approach is based on the traditional Viola–Jones objection detection framework (VJ framework) [21] which consists of feature extraction and cascade classifier training. The process of training the ACF detector [3] can be summarized as follows: Given an image, 10 feature channels including LUV color channels, gradient magnitude channel and gradient histogram channels are calculated, on which lower resolution channels are obtained by summing and smoothing. Then

Experiments

In this section, we conduct experiments on Caltech [3] and ETH dataset [25] to verify the performance of our proposed pedestrian detector.

Datasets. The Caltech 10x dataset is one of the most popular pedestrian dataset [2], which consists of 250k frames from the urban traffic video. Each frame is annotated and one out of three frames are used for the training set. The ETH dataset has three sets of sequences (ETH0, ETH1, ETH2) and the number of frames in each set is respectively 999, 451 and 354.

Conclusion and future work

In this paper, we introduce new feature channels to describe more image information, on which simple checkboard-like filters are utilized. Deep semantic segmentation features generated by the powerful off-the-self CNN model is able to capture fine semantic attributes and provide complementary information for original HOG+LUV channels. The set of checkboard-like filters are extended from the aggregation operation in ACF methods and help improve the accuracy. The effectiveness of our method is

Acknowledgments

This work is supported by National Natural Science Foundation (NNSF) of China under Grant. 61473086, 61375001 partly supported by the open fund of Key Laboratory of Measurement and partly supported by Control of Complex Systems of Engineering, Ministry of Education (No. MCCSE2013B01), the NSF of Jiangsu Province (Grants No BK20140566, BK20150470) and China Postdoctoral science Foundation (2014M561586).

Biyun Sheng received the B.S. and M.S. degrees in the School of Electrical and Information Engineering, Jiangsu University, Zhenjiang, China, respectively in 2010 and 2013. Now, she is a Ph.D. student in School of Automation at the Southeast University, Nanjing, China.

References (38)

ShenJ. et al.
Learning discriminative shape statistics distribution features for pedestrian detection
Neurocomputing
(2016)
TianY. et al.
Pedestrian detection aided by deep learning semantic tasks
Proceedings of the ICCV
(2015)
ZhangS. et al.
Filtered channel features for pedestrian detection
Proceedings of the CVPR
(2015)
P. Dollar et al.
Fast feature pyramids for object detection
PAMI
(2014)
S. Gidaris et al.
Object detection via a multi-region & semantic segmentation-aware CNN model
Proceedings of the ICCV
(2015)
Q. Hu, P. Wang, C. Shen, A. Hengel, F. Porikli, Pushing thelimits of deep CNNs for pedestrian detection, ArXiv:...
HosangJ. et al.
Taking a deeper look at pedestrians
Proceedings of the CVPR
(2015)
CaiZ. et al.
Learning complexity-aware cascades for deep pedestrian detection
Proceedings of the ICCV
(2015)
C. Cadena et al.
A fast, modular scene understanding system using context-aware object
Proceedings of the ICRA
(2015)
B. Hariharan et al.
Simultaneous detection and segmentation
Proceedings of the ECCV
(2014)

LongJ. et al.

Fully convolutional networks for semantic segmentation

Proceedings of the CVPR

(2015)

ChenL. et al.

Semantic image segmentation with deep convolutional nets and fully connected CRFs

Proceedings of the ICLR

(2015)

LinG. et al.

Efficient piecewise training of deep structured models for semantic segmentation

Proceedings of the CVPR

(2016)

M. Cordts et al.

The cityscapes dataset for semantic urban scene understanding

Proceedings of the CVPR

(2016)

N. Dalal et al.

Histograms of oriented gradients for human detection

Proceedings of the CVPR

(2005)

WangX. et al.

An hog-lbp human detector with partial occlusion handling

Proceedings of the ICCV

(2009)

S. Paisitkriangkrai et al.

Strengthening the effectiveness of pedestrian detection with spatially pooled features

Proceedings of the ECCV

(2014)

YangB. et al.

Convolutional channel features

Proceedings of the ICCV

(2015)

CaiZ. et al.

Learning complexity-aware cascades for deep pedestrian detection

Proceedings of the ICCV

(2015)

Cited by (7)

Urban scene based Semantical Modulation for Pedestrian Detection
2022, Neurocomputing
Citation Excerpt :
However, these methods regard the background as a single class, ignoring the rich scene information that could be used as detection cues. There are also various works [54,55,7,18] that do use scene semantics for pedestrian detection. For example, Zhang et al. [7] and Sheng et al. [55] improved detection by combining the segmentation maps with the original images and traditional hand-crafted features, respectively.
Despite recent progress, pedestrian detection still suffers from the troublesome problems of small objects, occlusions, and numerous false positives. Intuitively, the rich context information available from urban scenes could help determine the presence and location of pedestrians. For example, roads and sidewalks are good cues for potential pedestrians, while detections on buildings and trees are often false positives. However, most existing pedestrian detectors ignore or inadequately utilize semantic context. In this paper, in order to make full use of the urban-scene semantics to facilitate pedestrian detection, we propose a new method called Semantical Modulation based Pedestrian Detector (SMPD). First, for efficiency, a semantic prediction module is jointly learned with a baseline detector for semantic predictions. Second, a semantic integration module is designed to exploit the urban-scene semantic context for detection. Specifically, we force it to be an independent detection branch based solely on semantic information. In this way, together with the baseline detector, the fused detection results explicitly depend on both the learned appearance features and the scene context around pedestrians. In addition, while existing methods cannot be applied to the datasets where semantic annotations are not available for training, we introduce a semi-supervised transfer learning approach to make our method suitable for more scenarios. We demonstrate experimentally that, thanks to the integration of semantic context from urban scenes, SMPD can accurately detect small and occluded pedestrians, as well as effectively remove false positives. As a result, SMPD achieves the new state of the art on the Citypersons and Caltech datasets.
Computer vision and deep learning techniques for pedestrian detection and tracking: A survey
2018, Neurocomputing
Citation Excerpt :
Video acquisition technology is one of the fundamental aspects that concern with pedestrian, and generally, object detection and tracking. In literature, most of the works dealing with pedestrian detection use a 2D acquisition system and Machine Learning (ML) techniques to perform a large variety of tasks [12,13,21–23,25,26,50,51,54,56,59–64,67–69,71,72,74–81,84–91,93,94,96–99,131–134]. In the most of cases, a 2D video is sufficient for pedestrian detection since videos contain extremely valuable information that can be extracted after an appropriate processing, i.e. the 2D coordinates of a detected person.
Pedestrian detection and tracking have become an important field in the computer vision research area. This growing interest, started in the last decades, might be explained by the multitude of potential applications that could use the results of this research field, e.g. robotics, entertainment, surveillance, care for the elderly and disabled, and content-based indexing. In this survey paper, vision-based pedestrian detection systems are analysed based on their field of application, acquisition technology, computer vision techniques and classification strategies. Three main application fields have been individuated and discussed: video surveillance, human-machine interaction and analysis. Due to the large variety of acquisition technologies, this paper discusses both the differences between 2D and 3D vision systems, and indoor and outdoor systems. The authors reserved a dedicated section for the analysis of the Deep Learning methodologies, including the Convolutional Neural Networks in pedestrian detection and tracking, considering their recent exploding adoption for such a kind systems. Finally, focusing on the classification point of view, different Machine Learning techniques have been analysed, basing the discussion on the classification performances on different benchmark datasets. The reported results highlight the importance of testing pedestrian detection systems on different datasets to evaluate the robustness of the computed groups of features used as input to classifiers.
From Handcrafted to Deep Features for Pedestrian Detection: A Survey
2022, IEEE Transactions on Pattern Analysis and Machine Intelligence
From handcrafted to deep features for pedestrian detection: A survey
2020, arXiv
A new dataset benchmark for Pedestrian detection
2018, ACM International Conference Proceeding Series
Fast Pedestrian Detection Based on the Selective Window Differential Filter
2018, Neural Processing Letters

View all citing articles on Scopus

Qichang Hu is a PhD Candidature with the Australian Centre for Visual Technologies, University of Adelaide, Adelaide, SA, Australia. He received the bachelor’s degree in computer science from the University of Adelaide, Adelaide, SA, Australia in 2012. His research interests include deep learning, object detection, and machine learning.

Jun Li received the B.S. Degree in electrical engineering & automation from Nanjing Normal University, Nanjing, China, and the M.S. Degree in Control theory & engineering from Southeast University, Nanjing, China, in 2008 and 2011, respectively. He is currently working toward the Ph.D. Degree with School of Automation, Southeast University, Nanjing, China. His research interests include multimedia search and computer vision.

Wankou Yang received the B.S., M.S. and Ph.D. degrees in the School of Computer Science and Technology, Nanjing University of Science and Technology (NUST), China, respectively in 2002, 2004, and 2009. From July 2009 to Aug. 2011, he worked as a Postdoctoral Fellow in the School of Automation, Southeast University, China. Since Sep. 2011, he has been an assistant professor in School of Automation, Southeast University. His research interests include pattern recognition, computer vision and machine learning.

Baochang Zhang received the B.S., M.S., and Ph.D. degrees in computer science from the Harbin Institute of Technology, Harbin, China, in 1999, 2001, and 2006, respectively. From 2006 to 2008, he was a Research Fellow with the Chinese University of Hong Kong, Hong Kong, and with Griffith University, Brisbane, Australia. Currently, he is an associate professor with the Science and Technology on Aircraft Control Laboratory, School of Automation Science and Electrical Engineering, Beihang University, Beijing, China. He also holds a senior postdoc position in PAVIS Department, IIT, Italy. He was supported by the Program for New Century Excellent Talents in University of Ministry of Education of China. His current research interests include pattern recognition, machine learning, face recognition, and wavelets.

Changyin Sun is a professor in School of Automation at the Southeast University, China. He received the M.S. and Ph.D. degrees in Electrical Engineering from the Southeast University, Nanjing, China, respectively, in 2001 and 2003. His research interests include Intelligent Control, Neural Networks, SVM, Pattern Recognition, Optimal Theory, etc. He has received the First Prize of Nature Science of Ministry of Education, China. Professor Sun is a member of an IEEE, an Associate Editor of IEEE Transactions on Neural Networks, Neural Processing Letters and International Journal of Swarm Intelligence Research, Recent Patents on Computer Science.

View full text

Filtered shallow-deep feature channels for pedestrian detection

Abstract

Introduction

Section snippets

Related work

ACF detector

Experiments

Conclusion and future work

Acknowledgments

Neurocomputing

Pedestrian detection aided by deep learning semantic tasks

Proceedings of the ICCV

Filtered channel features for pedestrian detection

Proceedings of the CVPR

Fast feature pyramids for object detection

PAMI

Object detection via a multi-region & semantic segmentation-aware CNN model

Proceedings of the ICCV

Taking a deeper look at pedestrians

Proceedings of the CVPR

Learning complexity-aware cascades for deep pedestrian detection

Proceedings of the ICCV

A fast, modular scene understanding system using context-aware object

Proceedings of the ICRA

Simultaneous detection and segmentation

Proceedings of the ECCV

Fully convolutional networks for semantic segmentation

Proceedings of the CVPR

Semantic image segmentation with deep convolutional nets and fully connected CRFs

Proceedings of the ICLR

Efficient piecewise training of deep structured models for semantic segmentation

Proceedings of the CVPR

The cityscapes dataset for semantic urban scene understanding

Proceedings of the CVPR

Histograms of oriented gradients for human detection

Proceedings of the CVPR

An hog-lbp human detector with partial occlusion handling

Proceedings of the ICCV

Strengthening the effectiveness of pedestrian detection with spatially pooled features

Proceedings of the ECCV

Convolutional channel features

Proceedings of the ICCV

Learning complexity-aware cascades for deep pedestrian detection

Proceedings of the ICCV