Elsevier

Information Sciences

Volume 372, 1 December 2016, Pages 148-161
Information Sciences

Pedestrian detection by learning a mixture mask model and its implementation

https://doi.org/10.1016/j.ins.2016.08.050Get rights and content

Abstract

Pedestrian detection from videos is a useful technique in intelligent transportation systems. Some key challenges of accurate pedestrian detection are the large variations in pedestrian appearance as the pedestrians assume different poses and the different camera views that are involved. This makes the generic visual descriptors unreliable for real-world pedestrian detection. In this paper, we propose a high-level human-specific descriptor for detecting pedestrians in multiple videos. More specifically, by obtaining the feature matrix from a sliding window, we use multiple mapping vectors to project the original feature matrix into different mask spaces. Inspired by the part-based model [12], it is natural to formulate the pedestrian detection into a multiple-instance learning (MIL) framework. Afterward, we adopt an MI-SVM [9] to solve it. To evaluate the proposed detection algorithm, we implement the pedestrian detection algorithm in FPGA, which can process over 30 fps. Moreover, our method outperforms many existing object detection algorithms in terms of accuracy.

Introduction

Quickly and accurately detecting pedestrians in videos is an important ability for industrial electronics and intelligent systems [29], [38], [39]. Many applications such as intelligent transportation, smart cameras, and vehicle automation rely on accurate pedestrian detection. Due to the great application potential, real-time pedestrian detection can be deemed as one of the most important object detection tasks. However, because the appearance of a pedestrian varies significantly in terms of different poses and from different camera views, low-level visual descriptors might be unreliable due to feature misalignment or even missing features.

Conventional pedestrian detection approaches can be roughly categorized into the generative and the discriminative models. For the former, a rich variety of local descriptors such as shape cues [15], texture cues [12], human parts [20], and silhouettes [19] are detected first. Afterward, these local visual features [30], [35] or combinations of them [6] are fed into a pre-trained pedestrian model to form a class-conditional density function. In combination with the class priors, the posterior probability for the pedestrian class is typically calculated through a Bayesian inference process [10]. The main drawback of these models is the necessity of a large number of local features that are manually labeled at the training stage to cover the entire feature space. This is highly time-consuming and may be computationally intractable for real-world applications. For the discriminative models, the sliding window scheme is typically applied. A video frame is densely scanned at every possible position in every possible scale. For each sliding window, several visual descriptors such as histogram of gradient (HOG) [9], Haar wavelet [25], and intensity patch [2] are extracted. These visual descriptors are then processed by a subspace selection algorithm such as principle component analysis (PCA) and linear discriminant analysis (LDA) [33], [41], [42]. Afterward, they are fed into a classifier that is off-line trained from the labeled data. The classifier delivers positive responses to pedestrian areas and negative responses to background areas. As far as we know, support vector machine (SVM) [44] and AdaBoost [14], [32] are the two most well-known classifiers. The conference version of this work is proposed by Liu et al. [24], who describes a mixture mask model to detect pedestrians with different actions. Notably, Liu et al.’s model is implemented using C#, and the training and test phases of this model are highly time-consuming, which limits its applications in the real world. Compared with the generative pedestrian detection models, empirical results show that discriminative methods can easily lead to a higher accuracy and a real-time system response.

Many object detection models have been developed in the literature. In [11], Felzenszwalb et al. proposed an efficient and accurate object detection framework based on the mixture of multi-scale deformable part models. It discriminatively trains a set of classifiers using latent information. It relies on efficiently matching deformable models to images. Object proposal model [5], [43] generates a set of potential bounding boxes with rapid computational speed and high recall. It outperforms those inefficient sliding window algorithms for object detection. In addition to object proposal, action proposal [36], [47] is proposed to detect humans in various poses in a video. In [36], a weakly supervised model based on multiple instance learning is developed to slide the spatial-temporal sub-volumes for action detection. Moreover, a spatiotemporal branch-and-bound algorithm is deployed in [47] to alleviate the computational burden.

One disadvantage of the sliding window-based approaches for pedestrian detection is the interference/noise of the non-person areas. Previous methods typically apply rectangular sliding windows to extract features, as the example shown in Fig. 1. Obviously, rectangles cannot seamlessly fit the various human shapes. Thus, large non-person areas will be included in the sliding windows, and noisy features will be incurred. Features from the non-person areas provide statistical cues for localizing a pedestrian but also increase its dimensionality. That is, they decrease the discrimination of the detection model. In this paper, we solve this problem by proposing a mixture mask model. A mask is defined as a binary mapping vector that projects a feature matrix into the mask space. To represent the human appearance flexibly, we roughly divide a pedestrian into three parts: human head, upper human body, and lower human body. Masks for different parts are carefully designed so that person areas and non-person areas can be optimally separated. By mixing masks of different parts, we project the original dense feature matrix into the mixture mask space, where only features from the person area are preserved for pedestrian detection (Fig. 1).

Recent technologies can guarantee the high performance of field programmable gate array (FPGA). Most FPGAs support dynamic partial reconfiguration (DPR) [16], which is highly flexible. With this benefit, it is feasible to reconfigure the FPGA during runtime. Reconfigurable computing is an effective solution for computationally intensive applications. Based on the dynamic programming scheme, a reconfigurable computing system is computationally efficient and can maximize the hardware utilization. In this work, we present a simulation of the proposed pedestrian tracking algorithm and the implementation on a FPGA board. Based on the FPGA implementation, the pedestrian tracking achieves a speed of over 30 fps under a 640 × 480 video resolution. This speed far exceeds that of a Matlab/C++ implementation on a desktop PC.

The contributions of this paper can be summarized as follows:

  • A mixture mask model is proposed to reduce the noise of non-person areas in pedestrian detection.

  • Human geometry is leveraged to learn a part-based pedestrian representation. We propose a multiple instance learning (MIL) [8], [46] model to select the most discriminative masks for pedestrian detection.

  • We implement the proposed pedestrian detection algorithm on a FPGA board, based on which a detection speed of over 30 fps can be achieved.

The rest of this paper is organized as follows: Sections 2 and 3 introduce the proposed pedestrian detection and its implementation on FPGA, respectively. Experimental results in Section 4 thoroughly demonstrate the effectiveness of our system. Section 5 concludes the whole paper.

Section snippets

The concept of mask features

Denote I as a sliding window in an image and Φ(I)=[f1,f2,,fh×w]Rd as the feature matrix densely extracted from I, where the dimensionality of a local descriptor (i.e., a SIFT descriptor [23] extracted from each grid as shown in Fig. 2) is d and f denotes a SIFT descriptor. For each sliding window, there are h local descriptors along each column and w local descriptors along each row. In our approach, we assume a binary mask vector M=[m1,m2,,mh×w]B, where B={0,1} is a Boolean set. Thus, our

Base system builder platform

All of the embedded development kit (EDK) designs are built upon a base system builder (BSB), which is a convenient platform with various building blocks. Associated with the industrial video processing kit (IVK), each EDK design is constructed based on the BSB platform. Rather than a specific design delivered with the kit, the BSB platform is the starting point from which all the other designs were built. The board we used to implement our pedestrian detection algorithm is Xilinx IVK Spartan-6

Experiments and analysis

This section evaluates our approach based on four experiments. The first experiment reports the performance of our proposed pedestrian detection on two benchmark data sets. The second experiment compares our approach with well-known human/object detection algorithms. Third, we evaluate the influence of important parameters.

Conclusions and future work

Real-time pedestrian detection from videos is an important task in modern intelligent systems [17], [22] and computer vision [27], [28]. This paper proposes a real-time system to detect pedestrians in videos. One key challenge is to quickly discover pose-invariant visual descriptors for SVM classification. We make three primary contributions to tackle this problem. 1) We extract a set of masks from videos. They can more accurately capture different pedestrians than the conventional rectangles.

Acknowledgments

This research was supported in part by the National Natural Science Foundation of China under Grant no. 61572169, 61472266, by the National University of Singapore (Suzhou) Research Institute, 377 Lin Quan Street, Suzhou Industrial Park, Jiang Su, People's Republic of China, 215123, and by the Fundamental Research Funds for the Central Universities.

References (47)

  • T.P. Cao et al.

    Real-time visionbased stop sign detection system on FPGA

    Proceedings of DICTA

    (2008)
  • M. Cheng et al.

    BING: binarized normed gradients for objectness estimation at 300 fps

    Proceedings of CVPR

    (2014)
  • T.F. Cootes et al.

    Statistical models of appearance for computer vision

    Tech. Rep.

    (2004)
  • CAVIAR dataset [online], 2004, Available:...
  • K. Crammer et al.

    On the algorithmic implementation of multi-class kernel-based vector machines

    JMRL

    (2002)
  • N. Dalal et al.

    Histograms of oriented gradients for human detection

    Proceedings of CVPR

    (2005)
  • M. Enzweiler et al.

    Monocular pedestrian detection: survey and experiments

    IEEE T-PAMI

    (2009)
  • P. Felzenszwalb et al.

    Object detection with discriminatively trained part based models

    IEEE T-PAMI

    (2009)
  • P. Felzenszwalb et al.

    Object detection with discriminatively trained part-based models

    IEEE T-PAMI

    (2009)
  • Y. Freund et al.

    A decision-theoretic generalization of online learning and an application to boosting

    Proceedings of European Conference Computational Learning Theory

    (1995)
  • D.M. Gavrila

    A bayesian exemplar-based approach to hierarchical shape matching

    IEEE T-PAMI

    (2007)
  • P.-A. Hsiung et al.

    Reconfigurable System Design and Verification

    (2009)
  • S. Maji et al.

    Classification using intersection kernel support vector machines is efficient

    (2008)
  • Cited by (13)

    • Pedestrian evacuation within limited-space buildings based on different exit design schemes

      2020, Safety Science
      Citation Excerpt :

      Pioneering work within the recent two decades has been conducted to analyze human behaviors and enhance the safety strategies implemented in crowded spaces (Roh et al., 2009; Bruyelle et al., 2014; Shi et al., 2012). The analysis of evacuation behavior is the key link that ensures evacuation safety, optimizes facility layout, and maintains competent congestion management (Song et al., 2006; Lovreglio et al., 2015; Mu et al., 2013; Chen et al., 2016; Cao et al., 2009). Human escape behavior can be represented in several manners by simulation (Zhou et al., 2016).

    • Small sample image recognition using improved Convolutional Neural Network

      2018, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      CNN can even extract rich correlated features automatically from images. On account of above features of CNN, it has achieved excellent results in all kinds of image recognition tasks such as face recognition, eye detection and pedestrian detection [15–17]. Though CNN has achieved big success in image recognition, it still has its own limitations.

    • A multi-view camera-based anti-fraud system and its applications

      2018, Journal of Visual Communication and Image Representation
      Citation Excerpt :

      And in [5], researchers proposed a new algorithm for extracting key frames based on unsupervised clustering. Face recognition is an important technique in intelligent systems [6–8,16,17,32–34] and computer graphics [9,10,15]. Many algorithms for face recognition have been proposed in the literature.

    • Improved pedestrian detection with peer AdaBoost cascade

      2020, Journal of Central South University
    • Multiple-Instance Learning Support Vector Machine Algorithm based Pedestrian Detection

      2020, Proceedings of the 2020 IEEE International Conference on Communication and Signal Processing, ICCSP 2020
    View all citing articles on Scopus
    View full text