Pedestrian detection by learning a mixture mask model and its implementation
Introduction
Quickly and accurately detecting pedestrians in videos is an important ability for industrial electronics and intelligent systems [29], [38], [39]. Many applications such as intelligent transportation, smart cameras, and vehicle automation rely on accurate pedestrian detection. Due to the great application potential, real-time pedestrian detection can be deemed as one of the most important object detection tasks. However, because the appearance of a pedestrian varies significantly in terms of different poses and from different camera views, low-level visual descriptors might be unreliable due to feature misalignment or even missing features.
Conventional pedestrian detection approaches can be roughly categorized into the generative and the discriminative models. For the former, a rich variety of local descriptors such as shape cues [15], texture cues [12], human parts [20], and silhouettes [19] are detected first. Afterward, these local visual features [30], [35] or combinations of them [6] are fed into a pre-trained pedestrian model to form a class-conditional density function. In combination with the class priors, the posterior probability for the pedestrian class is typically calculated through a Bayesian inference process [10]. The main drawback of these models is the necessity of a large number of local features that are manually labeled at the training stage to cover the entire feature space. This is highly time-consuming and may be computationally intractable for real-world applications. For the discriminative models, the sliding window scheme is typically applied. A video frame is densely scanned at every possible position in every possible scale. For each sliding window, several visual descriptors such as histogram of gradient (HOG) [9], Haar wavelet [25], and intensity patch [2] are extracted. These visual descriptors are then processed by a subspace selection algorithm such as principle component analysis (PCA) and linear discriminant analysis (LDA) [33], [41], [42]. Afterward, they are fed into a classifier that is off-line trained from the labeled data. The classifier delivers positive responses to pedestrian areas and negative responses to background areas. As far as we know, support vector machine (SVM) [44] and AdaBoost [14], [32] are the two most well-known classifiers. The conference version of this work is proposed by Liu et al. [24], who describes a mixture mask model to detect pedestrians with different actions. Notably, Liu et al.’s model is implemented using C#, and the training and test phases of this model are highly time-consuming, which limits its applications in the real world. Compared with the generative pedestrian detection models, empirical results show that discriminative methods can easily lead to a higher accuracy and a real-time system response.
Many object detection models have been developed in the literature. In [11], Felzenszwalb et al. proposed an efficient and accurate object detection framework based on the mixture of multi-scale deformable part models. It discriminatively trains a set of classifiers using latent information. It relies on efficiently matching deformable models to images. Object proposal model [5], [43] generates a set of potential bounding boxes with rapid computational speed and high recall. It outperforms those inefficient sliding window algorithms for object detection. In addition to object proposal, action proposal [36], [47] is proposed to detect humans in various poses in a video. In [36], a weakly supervised model based on multiple instance learning is developed to slide the spatial-temporal sub-volumes for action detection. Moreover, a spatiotemporal branch-and-bound algorithm is deployed in [47] to alleviate the computational burden.
One disadvantage of the sliding window-based approaches for pedestrian detection is the interference/noise of the non-person areas. Previous methods typically apply rectangular sliding windows to extract features, as the example shown in Fig. 1. Obviously, rectangles cannot seamlessly fit the various human shapes. Thus, large non-person areas will be included in the sliding windows, and noisy features will be incurred. Features from the non-person areas provide statistical cues for localizing a pedestrian but also increase its dimensionality. That is, they decrease the discrimination of the detection model. In this paper, we solve this problem by proposing a mixture mask model. A mask is defined as a binary mapping vector that projects a feature matrix into the mask space. To represent the human appearance flexibly, we roughly divide a pedestrian into three parts: human head, upper human body, and lower human body. Masks for different parts are carefully designed so that person areas and non-person areas can be optimally separated. By mixing masks of different parts, we project the original dense feature matrix into the mixture mask space, where only features from the person area are preserved for pedestrian detection (Fig. 1).
Recent technologies can guarantee the high performance of field programmable gate array (FPGA). Most FPGAs support dynamic partial reconfiguration (DPR) [16], which is highly flexible. With this benefit, it is feasible to reconfigure the FPGA during runtime. Reconfigurable computing is an effective solution for computationally intensive applications. Based on the dynamic programming scheme, a reconfigurable computing system is computationally efficient and can maximize the hardware utilization. In this work, we present a simulation of the proposed pedestrian tracking algorithm and the implementation on a FPGA board. Based on the FPGA implementation, the pedestrian tracking achieves a speed of over 30 fps under a 640 × 480 video resolution. This speed far exceeds that of a Matlab/C++ implementation on a desktop PC.
The contributions of this paper can be summarized as follows:
- •
A mixture mask model is proposed to reduce the noise of non-person areas in pedestrian detection.
- •
Human geometry is leveraged to learn a part-based pedestrian representation. We propose a multiple instance learning (MIL) [8], [46] model to select the most discriminative masks for pedestrian detection.
- •
We implement the proposed pedestrian detection algorithm on a FPGA board, based on which a detection speed of over 30 fps can be achieved.
The rest of this paper is organized as follows: Sections 2 and 3 introduce the proposed pedestrian detection and its implementation on FPGA, respectively. Experimental results in Section 4 thoroughly demonstrate the effectiveness of our system. Section 5 concludes the whole paper.
Section snippets
The concept of mask features
Denote I as a sliding window in an image and as the feature matrix densely extracted from I, where the dimensionality of a local descriptor (i.e., a SIFT descriptor [23] extracted from each grid as shown in Fig. 2) is d and f denotes a SIFT descriptor. For each sliding window, there are h local descriptors along each column and w local descriptors along each row. In our approach, we assume a binary mask vector where is a Boolean set. Thus, our
Base system builder platform
All of the embedded development kit (EDK) designs are built upon a base system builder (BSB), which is a convenient platform with various building blocks. Associated with the industrial video processing kit (IVK), each EDK design is constructed based on the BSB platform. Rather than a specific design delivered with the kit, the BSB platform is the starting point from which all the other designs were built. The board we used to implement our pedestrian detection algorithm is Xilinx IVK Spartan-6
Experiments and analysis
This section evaluates our approach based on four experiments. The first experiment reports the performance of our proposed pedestrian detection on two benchmark data sets. The second experiment compares our approach with well-known human/object detection algorithms. Third, we evaluate the influence of important parameters.
Conclusions and future work
Real-time pedestrian detection from videos is an important task in modern intelligent systems [17], [22] and computer vision [27], [28]. This paper proposes a real-time system to detect pedestrians in videos. One key challenge is to quickly discover pose-invariant visual descriptors for SVM classification. We make three primary contributions to tackle this problem. 1) We extract a set of masks from videos. They can more accurately capture different pedestrians than the conventional rectangles.
Acknowledgments
This research was supported in part by the National Natural Science Foundation of China under Grant no. 61572169, 61472266, by the National University of Singapore (Suzhou) Research Institute, 377 Lin Quan Street, Suzhou Industrial Park, Jiang Su, People's Republic of China, 215123, and by the Fundamental Research Funds for the Central Universities.
References (47)
- et al.
Pedestrian registration in static images with unconstrained background
Pattern Recogn.
(2003) A reduced support vector machine approach for interval regression analysis
Inf. Sci.
(2012)A ν-twin support vector machine (ν-TSVM) classifier and its geometric algorithms
Inf. Sci.
(2010)- et al.
Personalized mode transductive spanning SVM classification tree
Inf. Sci.
(2011) Building sparse twin support vector machine classifiers in primal space
Inf. Sci.
(2011)- et al.
Image-based facial sketch-to-photo synthesis via online coupled dictionary learning
Inf. Sci.
(2012) - Spartan-6 industrial video processing kit c EDK reference design tutorial, Xilinx Inc....
- et al.
Support vector machines for multiple-instance learning
Adv. Neural Inf. Process. Syst.
(2003) - et al.
Learning to detect objects in images via a sparse, part-based representation
IEEE T-PAMI
(2004) - et al.
Evaluating multiple object tracking performance: the CLEAR MOT metrics
J. Image Video Process.
(2008)
Real-time visionbased stop sign detection system on FPGA
Proceedings of DICTA
BING: binarized normed gradients for objectness estimation at 300 fps
Proceedings of CVPR
Statistical models of appearance for computer vision
Tech. Rep.
On the algorithmic implementation of multi-class kernel-based vector machines
JMRL
Histograms of oriented gradients for human detection
Proceedings of CVPR
Monocular pedestrian detection: survey and experiments
IEEE T-PAMI
Object detection with discriminatively trained part based models
IEEE T-PAMI
Object detection with discriminatively trained part-based models
IEEE T-PAMI
A decision-theoretic generalization of online learning and an application to boosting
Proceedings of European Conference Computational Learning Theory
A bayesian exemplar-based approach to hierarchical shape matching
IEEE T-PAMI
Reconfigurable System Design and Verification
Classification using intersection kernel support vector machines is efficient
Cited by (13)
Pedestrian evacuation within limited-space buildings based on different exit design schemes
2020, Safety ScienceCitation Excerpt :Pioneering work within the recent two decades has been conducted to analyze human behaviors and enhance the safety strategies implemented in crowded spaces (Roh et al., 2009; Bruyelle et al., 2014; Shi et al., 2012). The analysis of evacuation behavior is the key link that ensures evacuation safety, optimizes facility layout, and maintains competent congestion management (Song et al., 2006; Lovreglio et al., 2015; Mu et al., 2013; Chen et al., 2016; Cao et al., 2009). Human escape behavior can be represented in several manners by simulation (Zhou et al., 2016).
Small sample image recognition using improved Convolutional Neural Network
2018, Journal of Visual Communication and Image RepresentationCitation Excerpt :CNN can even extract rich correlated features automatically from images. On account of above features of CNN, it has achieved excellent results in all kinds of image recognition tasks such as face recognition, eye detection and pedestrian detection [15–17]. Though CNN has achieved big success in image recognition, it still has its own limitations.
A multi-view camera-based anti-fraud system and its applications
2018, Journal of Visual Communication and Image RepresentationCitation Excerpt :And in [5], researchers proposed a new algorithm for extracting key frames based on unsupervised clustering. Face recognition is an important technique in intelligent systems [6–8,16,17,32–34] and computer graphics [9,10,15]. Many algorithms for face recognition have been proposed in the literature.
Investigating the Dynamics of Pedestrian Flow through Different Transition Bottlenecks
2024, Sustainability (Switzerland)Improved pedestrian detection with peer AdaBoost cascade
2020, Journal of Central South UniversityMultiple-Instance Learning Support Vector Machine Algorithm based Pedestrian Detection
2020, Proceedings of the 2020 IEEE International Conference on Communication and Signal Processing, ICCSP 2020