Personness estimation for real-time human detection on mobile devices
Introduction
Vast numbers of pictures of people are captured and stored daily by mobile devices such as digital cameras and mobile phones. As a result, human detection on mobile devices has attracted significant research interest in recent years. Applications of human detection include human tracking, human segmentation for automatic backlight compensation and selfie enhancement (Kim, Oh, & Sohn, 2016). Since the introduction of the discriminatively trained part-based model by Felzenszwalb, Girshick, McAllester, and Ramanan (2010), the deformable part model (DPM) and its variants have become increasingly popular for human detection (Benenson, Omran, Hosang, Schiele, 2014, Sadeghi, Forsyth, 2014). However, the practical applicability of DPM human detection is limited by the significant computational overhead on mobile devices.
DPM detectors construct a feature pyramid of multiscale feature maps and search each feature map through a sliding window. DPM detectors also require several mixture models describing various poses and viewpoints. Each mixture model contains one root filter representing the overall object shape at a low resolution and several part filters representing different object parts at a higher resolution. DPM improves the rate of object detection because of these sophisticated procedures and configurations. However, the filter scores computed by DPM require large computational resources because the sliding window approach performs many convolutions between the filters and the feature maps, e.g., for human detection in a (375 × 500)-pixel image, the OpenCV (Bradski & Kaebler, 2008) implementation of DPM constructs a feature pyramid with 33 scale levels and performs 1,786,962 convolution operations between this feature pyramid and 14 filters. Such convolution operations can dominate the total detection time (approximately 0.75 s or 53.47% of the total detection time on a regular PC).
In order to accelerate the object detection task, many researchers have optimized the DPM procedure by improving the algorithms and using hardware-specific features such as complex instructions and many GPUs on a desktop PC (Benenson, Mathias, Timofte, Van Gool, 2012, Sadeghi, Forsyth, 2014). However, once the software is developed and submitted to certain application stores, the algorithm can be executed on a variety of devices with different specifications. A more significant problem is that the processors in mobile devices are designed for low power consumption and lack high-performance CPUs with complex instructions or a sufficient number of GPU cores to boost the algorithm speed. Therefore, to implement DPM on mobile devices, the target objects should be searched from the most promising windows. Like other algorithms, detection algorithms executing on mobile platforms are time-constrained. Consequently, intensive detection algorithms will deteriorate the performance of the whole system and cause inconvenience to users. Considering the high-resolution imaging and hardware restrictions of mobile devices, the impracticality of an exhaustive sliding window search becomes obvious.
Detection proposal (or the objectness measure) has recently emerged as an alternative object detection technique (Hosang, Benenson, Dollar, & Schiele, 2015). A detection proposal method generates person windows that probably contain generic objects, avoiding exhaustive searching. Its intention to improve the detection speed appears to be perfectly matched with real-time detection. However, when our DPM implementation consumes approximately 200 ms searching over all multi-scale feature maps on a regular PC, most existing detection proposal methods consume more than 250 ms on the same device (Hosang et al., 2015).1 In real-time detection, the time required for generating candidate windows at the preprocessing stage should be markedly less than the actual detection time. Therefore, the detection proposal method must be significantly faster than the exhaustive search time of real-time detection.
In the existing methods for detection proposals (Hosang et al., 2015), the generated candidate windows are generic over categories. Consequently, these methods extract object segments or well-defined boundaries by solving complex segmentation problems (Alexe, Deselaers, Ferrari, 2012, Carreira, Sminchisescu, 2012, Chen, Ma, Wang, Zhao, 2015, Humayun, Li, Rehg, 2014, Manen, Guillaumin, Van Gool, 2013, Uijlings, van de Sande, Gevers, Smeulders, 2013) or by performing sophisticated edge detection (Krähenbühl, Koltun, 2014, Zitnick, Dollár, 2014). However, the computational overhead of exploring unseen categories is too high for real-time processing. Furthermore, a large number of windows are generated for all possible objects, which reduces the speed of the category-specific detectors in the latter stage. To resolve these problems and achieve real-time frame rates on mobile devices, we concentrate on categories that are relevant to the situation. When only person category is relevant, simultaneously considering all possible categories is a substantial waste of computational resources. Therefore, we propose a more efficient and accurate method that estimates person windows in an image, while ignoring category-agnostic candidate windows. The proposed method efficiently utilizes the simple color and edge features, as explained in Section 3. Therefore, our approach shares strong correlation with the human visual system in the sense that the human attentional mechanisms also preferentially notes simple features such as color and orientation when isolating possible candidates in distracting backgrounds (Wolfe & Horowitz, 2004). For convenience, we refer to ‘objectness estimation for people’ as personness estimation. Examples of human detection by personness estimation are presented in Fig. 1.
1. We present a fast and accurate personness estimation and demonstrate its effectiveness on a low-power mobile processor. The personness estimation rapidly captures the important edge and color features of the person category from the normed gradients (Cheng et al., 2014) and color attributes (Van De Weijer, Schmid, Verbeek, & Larlus, 2009). In this way, our approach generates a limited number of windows using the linear support vector machine (SVM). Evaluated on the person category of the PASCAL VOC dataset (Everingham, Van Gool, Williams, Winn, & Zisserman, 2010), the detection proposals generated by personness estimation allow the DPM detector to obtain more than 50% of its original performance within a 20 ms window search on a low-power mobile processor. The window search process includes both window generation and convolution calculation.
2. We show the improved use of detection proposals by the DPM detector. On mobile devices, much importance should be placed on interruptible object detection, or anytime detection, which yields reasonable results even before all tasks are complete (Karayev, Fritz, Darrell, 2014, Sadeghi, Forsyth, 2014). To improve the anytime performance (Karayev et al., 2014), our DPM design efficiently computes the filter responses by imposing time constraints on the provided candidate windows. The DPM implementation also considers two important factors such as aspect-ratio threshold and patch size for pinpoint to achieve better detection performance using window proposals (see Section 3.4).
3. The detection proposal methods for real-time DPM detection are evaluated by a novel measure called the recall-time curves. As speed is a critical factor in comparing detection proposal methods for anytime detection, it should be considered in the evaluation methodology. Our recall-time graph methodology simultaneously evaluates the speed and quality of detection proposal methods. Specifically, the recall-time curve indicates the extent to which the proposal generator supports the following object-specific detector in a given time. Hence, the recall-time curves identify the proposal generator that best balances the speed and quality of the detection.
The present study introduces several improvements to our preliminary study (Kim & Sohn, 2015). First, the skin color feature is replaced with the color attributes (Van De Weijer et al., 2009), which might generalize the proposed method to categories other than people. Second, our present experiments are performed on a real mobile device (a Samsung Galaxy Note5). Finally, an additional comparison performed with a state-of-the-art detection proposal method, Edge-Boxes.
The rest of this paper is organized as follows. In Section 2, we briefly review recent works on proposal generation. Section 3 explains the proposed personness estimation. Our experimental results and conclusions are presented in Sections 4 and 5, respectively.
Section snippets
Related work
When humans view an object, they perceive an independent, stand-alone entity, regardless of whether they can name that entity. Likewise, assuming that such human attributes can be mimicked by good algorithms, many researchers have developed detection proposal methods that very likely enclose objects in rectangular bounding boxes (BBs) or pixel-level masks. This section reviews some of the major studies on detection proposals, which can be broadly categorized into segment-based approaches and
Proposed method
We choose the following two features to take the discriminative approach on the PASCAL VOC (Everingham et al., 2010):
Edge. As one can see from the success of HOG (Dalal & Triggs, 2005), various strong edges (or oriented gradients) are identified in and around objects. Thus, the performance of object detection can be boosted by category-specific learning of edges. In our proposal generation, we adopt the normed gradients (NGs) (Cheng et al., 2014) as an edge feature and rapidly determine the
Experimental settings
We compared our personness estimation with the NG, BING, random guess, sliding windows, and RAND-SCORE (RS) (Zhao et al., 2014) methods. Zhao et al. (2014) reported that their RS method generates candidate windows for IoUs above 0.5. Among the detection proposal methods (Hosang et al., 2015), we could evaluate only BING and NG; the other methods execute more slowly than the sliding window approach of our DPM implementation. However, we evaluate and discuss the Edge-Boxes algorithm (Zitnick &
Conclusions and future work
Our proposed personness measure, designed for anytime detection, generates promising object windows within a short time frame. In addition to the normed gradients, the personness measure elaborately incorporates color attributes into the proposal generation. In order to demonstrate the efficiency and practicality of the personness measure, we introduced recall-time curves and effectively exploited the personness estimation in the anytime DPM detection. In experiments on the PASCAL VOC 2007/2012
Acknowledgment
This work was supported by Institute for Information & communications Technology Promotion (IITP) grant funded by the Korea government (MSIP) (No. R0115-16-1007).
References (40)
- et al.
SLIC superpixels compared to state-of-the-art superpixel methods
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2012) - et al.
Measuring the objectness of image windows
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2012) - et al.
Pedestrian detection at 100 frames per second
Proceedings of the IEEE conference on computer vision and pattern recognition
(2012) - et al.
Ten years of pedestrian detection, what have we learned?
Proceedings of the european conference on computer vision
(2014) - et al.
Basic color terms: Their universality and evolution
(1991) - et al.
Learning OpenCV: Computer vision with the OpenCV library
(2008) - et al.
BRIEF: Binary robust independent elementary features
Proceedings of the european conference on computer vision
(2010) - et al.
CPMC: Automatic object segmentation using constrained parametric min-cuts
IEEE Transactions on Pattern Analysis and Machine Intelligence
(2012) - et al.
Improving object proposals with multi-thresholding straddling expansion
Proceedings of the ieee conference on computer vision and pattern recognition
(2015) - et al.
BING: Binarized normed gradients for objectness estimation at 300fps
Proceedings of the ieee conference on computer vision and pattern recognition
(2014)
Histograms of oriented gradients for human detection
Proceedings of the ieee conference on computer vision and pattern recognition
Pedestrian detection: An evaluation of the state of the art
IEEE Transactions on Pattern Analysis and Machine Intelligence
Structured forests for fast edge detection
Proceedings of the ieee international conference on computer vision
The PASCAL visual object classes (VOC) challenge
International Journal of Computer Vision
LIBLINEAR: A library for large linear classification
Journal of Machine Learning Research
Object detection with discriminatively trained part-based models
IEEE Transactions on Pattern Analysis and Machine Intelligence
Efficient graph-based image segmentation
International Journal of Computer Vision
Gene selection for cancer classification using support vector machines
Machine Learning
What makes for effective detection proposals?
IEEE Transactions on Pattern Analysis and Machine Intelligence
RIGOR: Reusing inference in graph cuts for generating object regions
Proceedings of the ieee conference on computer vision and pattern recognition
Cited by (5)
ReSTiNet: On Improving the Performance of Tiny-YOLO-Based CNN Architecture for Applications in Human Detection
2022, Applied Sciences (Switzerland)Non-Invasive Methods of Detecting Human Using Computer Vision by Incorporating Machine-Learning Techniques in Open Environment at Diverse Viewpoints
2022, 2022 10th International Conference on Reliability, Infocom Technologies and Optimization (Trends and Future Directions), ICRITO 2022Human detection with log-polar transform and HOG-LBP features
2018, ICIC Express LettersLazy dragging: Effortless bounding-box drawing for touch-screen devices
2017, IEEE Transactions on Consumer Electronics