Keywords

1 Introduction

As a fundamental and challenging problem in computer vision, human detection has a wide range of applications, including visual surveillance, human-computer interaction, self-driving, crowd counting, etc. However, due to variations in illumination, partial occlusions, articulated structure, complexity of the backgrounds, a lot of problems still remain in human detection.

Following the general detection framework which is composed of feature extraction and classifier design, most human detection methods mainly concentrate on improving the performance of the above two components. Oren et al. [1] first introduce the idea of machine learning and propose to adopt wavelet combined with Support Vector Machine (SVM) for human detection. VJ model [2] combines haar wavelet feature with a cascade architecture detector to improve the detection efficiency. Other features, like Local binary Pattern (LBP) [3], edgelet [4] and Histograms of Oriented Gradients (HOG) [5] are also chosen as feature descriptors to model the global property of human body. Recently, a real-time human detector, named \(C^4\) [6, 7] is proposed, which adopts CENTRIST (CENsus TRansform hISTogram) visual feature [8] to capture the global contour information of human body. The aforementioned methods have something in common, they all capture global features for detection. Although these features have been verified with good performance in some cases, when confronted with large pose variations and partial occlusion, the global features may lead to unsuccessful detections.

Subsequently, part-based feature models are getting more and more popular. Because they are more imperative for handling partial object/inter-occlusions and are flexible in modeling shape articulations in human detection. A pose-invariant descriptor for human detection is proposed through constructing a part-template tree model [9]. Felzenszwalb et al. [10, 11] propose a human detection method based on deformable part models (DPM) which are capable of coping with object pose changes. Different parts of object can be represented by several higher resolution part templates in conjunction with their spatial layout. The above part-template based models are often used for full-body human detection, which easily suffers from occlusions among individuals and scenes in which people are not necessarily standing [12]. Hence, different from the full-body human detection, lots of researchers instead focus on the upper part of human body. Especially for top view surveillance scenes, head-shoulder might be the last part being occluded and is not as flexible as entire human body. Omega-like shape has been proved to be a salient feature of head-shoulder [1216]. Li et al. [12, 13] propose an effective head-shoulder detection method based on boosting local HOG features. Experimental results from [13] indicate HOG feature performs better than Haar feature [2] and SIFT descriptor. Another edge-based feature similar to HOG, named Oriented Integration of Gradients (OIG), is introduced to describe subparts of human head-shoulder in [14]. Julio et al. [15, 16] propose a graph-based segmentation model to estimate the head-shoulder contour.

Inspired by the above analysis, in our method we also adopt human head-shoulder Omega-like contour not the whole human body as shape cue for human detection. A matrix decomposition technique is introduced to obtain the meaningful local curvelets of Omega-like shape. Different from the above introduced methods lacking of the further feature learning and analysis, based on the HOG feature, we further perform the task of part-based Omega shape modeling by constructing a part-based shape model that captures the intrinsic semantic local shape variations. The problem of part-based Omega-shape feature learning is converted to that of Nonnegative Matrix Factorization (NMF)-based local feature learning. Especially for NMF method, only additive, not subtractive combinations are allowed and some background clutter can be well suppressed or downweighted.

Moreover, under different view angles, e.g., front view and side view, the intra-class variations of human head-shoulder contours still exist. In order to effectively measure the similarities of omega-shape samples with multimodal distribution, we introduce distance metric learning into support vector machine (SVM) classification. With learned distance metric, the local neighborhood property that examples within the same class are preserved while examples from different classes are separated by a large margin. Both intra-class compactness and inter-class separability are improved.

Therefore, in this paper, we focus on learning the Omega-like shape features for human detection. Specifically, the contributions of this work lie in the following two aspects:

  1. 1.

    We convert the local Omega shape feature learning problem to the part-based semantic shape representation problem. Orthogonal Nonnegative Matrix Factorization (ONMF) is introduced to encode the shape dictionary with each word representing the semantic part of Omega-like shape and reduce the feature dimensionality simultaneously. The learned features are robust to partial occlusion and background clutter.

  2. 2.

    In order to cope with Omega shape intra-class multimodal problem, we introduce distance metric learning into SVM classification where object/nonobject classification is performed in a learned Mahalanobis distance metric space. The intra-class compactness and inter-class separability are guaranteed.

2 The Proposed Human Detection Method

We give the overview of our method at first. During the off-line training stage, positive training samples are used to learn the part-based Omega shape bases by ONMF and then negative training samples are included to learn a Mahalanobis distance metric with neighborhood constraint. For the on-line detection stage, image patches with different scales are generated firstly. Secondly, after HOG feature extraction, part-based features are represented based on the learned local shape bases. Thirdly, SVM classifier is used to evaluate each input feature within a feasible metric space. Figure 1 is the framework of our proposed human detection method.

Fig. 1.
figure 1

Illustration of our proposed Omega shape feature learning-based human detection method

In the following subsections, we firstly introduce the construction of ONMF-based Omega shape feature descriptor, then describe SVM classifier training with learned distance metric, and finally the on-line detection process is presented.

2.1 ONMF-based Omega Shape Feature Learning

HOG feature [5] has been verified to be effective for capturing the omage-shape contour information [13]. In our method, we also adopt HOG to describe the global Omega-shape feature. However, due to its cell-based calculation mechanism, it is inevitable for HOG to include some background noise and has a high dimensionality with some redundant information. Moreover, HOG features are weak in local Omega-shape variations modeling.

Therefore, to alleviate the above problems, we need to do further feature learning and dimension reduction based on HOG features. We present a simple but efficient Nonnegative Matrix Factorization (NMF)-based Omega shape feature learning method. NMF is a powerful tool of learning local features [18]. In this model, the construction of shape variations is viewed as a shape dictionary learning problem with each shape word encoding a local semantic part of shapes. Generally, the NMF problem is associated with the following constrained least-square optimization problem:

$$\begin{aligned} \min \limits _{W,H} \Vert V-WH\Vert _2 \ \ \text {s.t.}\ W\ge 0, H\ge 0 \end{aligned}$$
(1)

in our method, \(V=(V_{x_1},V_{x_2},...V_{x_n})\) is a nonnegative \(m \times n\) Omega shape vector matrix, in which each column \(\{V_{x_i}\}_{i=1}^n\) represents a HOG feature vector corresponding to a positive training sample, and W is an \(m \times p\) matrix with each column representing a basis image. Each column of \(H (p \times n)\) consists of the coefficients by which a sample is represented with a linear combination of basis images.

After W is learnt according to Eq. (1), given an input test sample \(V_{x_t}\), we adopt its recovery coefficients as the corresponding NMF-based shape descriptor. By solving the following nonnegative least square (NNLS) problem, shape descriptor \(f_{x_t}\) can be obtained:

(2)

In order to verify the advantage of using NMF method to represent the local Omega features, i.e., the ability to selectively encode only the foreground of regions of interest, hence effectively rejecting unwanted background clutter or noise. We use NMF bases only additively to reconstruct test samples with clutter or partial occlusion according to their NNLS coefficients, i.e., \(W\cdot \mathbf {f}_{x_t}\). Figure 2(a) shows the illustration of decomposed Omega shape basis images. Our proposed NMF-based Omega shape model is capable of representing variations of shapes with a set of basis images denoting the meaningful local curvelet features (shown in the \(5\times \)10 montages, highlighted by darker grey).

Fig. 2.
figure 2

Illustration of obtained Omega shape basis images and the recovery results of input samples with background clutter and partial occlusion. (a) Several examples of Omega shape basis images based on NMF; (b) reconstruction result of input sample with background clutter; (c) reconstruction result of input sample with partial occlusion.

Some recovery results of input samples with background clutter and partial occlusion are shown in Fig. 2(b) and (c). For a better visualization, before applying NMF, we perform edge extraction on the gray images. From the recovery results, we can find the recovered silhouette images preserve the Omega contour information. The background clutter (Fig. 2(b)) can be clearly suppressed and partial occlusions (Fig. 2(c)) are well handled.

NMF-based shape descriptor \(f_{x_t}\) is often obtained by recursively solving a NNLS problem in Eq. (2), which is time-consuming. Another solution is to adopt a generalized NNLS method: \(H=(W^TW)^{-1}W^TV\), which is not effective for high dimensional data. Thus, lots of improved NMF-based methods are proposed. In our method, we adopt an orthogonal NMF (ONMF) algorithm which constrains the basis matrix should be an orthogonal matrix, Eq. (1) can be re-written as:

$$\begin{aligned} \min \limits _{W,H} \Vert V-WH\Vert _2 \\ \text {s.t.}\ W,H\ge 0, W^TW=I \nonumber \end{aligned}$$
(3)

The above optimization problem can be effectively solved based on the canonical metric of the Stiefel manifolds [19]. The iterative equations for W and H are directly given by [19]:

$$\begin{aligned} H=H\odot \frac{[W^TV]}{[W^TWH]} \end{aligned}$$
(4)
$$\begin{aligned} W=W\odot \frac{[VH^T]}{[WHVW]} \end{aligned}$$
(5)

where \(\odot \) denotes the Hadamard product and \(\frac{[.]}{[.]}\) represents an element-wise division. Similar to [20], a hierarchies of ONMF training mechanism is adopted in our method, allowing to represent parts of Omega shapes on different scales for a more robust classification. In the hierarchical structure, the value for \(W_1\) and \(H_1\) in the first layer are randomly initialized where the number of bases is set to be \(b_1\). The initial value for the basis matrix of the next layer is obtained by copying s times of \(W_1\), i.e., \(\tilde{W_2}=[W_1,...,W_1]\). Using \(\tilde{W_2}\) and a randomly initialized encoding \(\tilde{H_2}\) both matrices are updated according to Eqs. (4) and (5), obtaining a basis \(W_2\) and an encoding \(H_2\). The hierarchical tree is growing continuously until the number of the basis matrix satisfies the convergence condition. The final representation of W is estimated by: \(W=[W_1,...,W_L]\), wherein L denotes the number of layers in the ONMF hierarchies. Overall, with respect to a test sample \(V_{x_t}\), the shape descriptor \(f_{x_t}\) can be obtained by a linear transformation and no iterative procedure is required, i.e., \(f_{x_t}=W^TV_{x_t}\).

2.2 Improved SVM Training with Learned Distance Metric

After the local Omega-shape descriptors are obtained, the next step is to train a SVM classifier using the label information. Due to its strong performance, SVM [21] has been widely adopted in classification and pattern recognition tasks. For a better separation, SVM usually projects the input data into a higher dimensional feature space through a kernel function, most of which measure the similarity between pairs of features using their Euclidean inner product or Euclidean distance. However, Euclidean measure treats different feature dimensions equally and ignores the correlation information between feature dimensions. As a result, the Euclidean distance measure is incapable of well reflecting the intrinsic affinity relationships between samples [22].

To address the above issue, distance metric learning has emerged as a useful tool in recent years [2325]. Motivated by this observation, we introduce distance metric learning into the SVM classifier training and therefore adopt such a metric learning method called large margin nearest neighbor (LMNN) [25]. Different from the other distance metric learning methods which require the samples with the same labels are close to each other, LMNN aims to guarantee only the k nearest neighbors always belong to the same class while samples from different classes are separated by a large margin. Thus, LMNN is capable of effectively handling the metric learning problem with the training samples of multi-modal distributions. There exists intra-class variations about Omega-shapes, which has been shown in Fig. 3(a) with each row corresponding to a kind of Omega-shape. Due to different view angels (different rows in Fig. 3(a)), even if all of them are from Omega-shape class, there are obvious differences between different shapes generated from different view angles. Therefore, the goal of distance metric learning is to require in the learned feature space shapes from the same row move closer, and shapes from different rows move away, which is quite reasonable in real applications. This objective coincides with the key idea of learning a distance metric by LMNN, which is illustrated in Fig. 3(b). From the figure we can find the distance metric is optimized so that: (1) its \(k=3\) nearest neighbors lie within a smaller radius after training; (2) differently labeled inputs lie outside this smaller radius by some finite margin.

Fig. 3.
figure 3

Illustration of Omega shape intra-class variations and schematic explanation of LMNN [25]. (a) The learned ONMF-based Omega shape basis images are shown in the 3 \(\times \) 4 montages with different rows corresponding to different view angles; (b) schematic illustration of one input’s neighborhood before training (left) versus after training (right) in LMNN. The example images are placed next to their corresponding nodes for better illustration.

After metric learning operated by LMNN [25], we obtain a discriminative distance metric denoted as \(\mathbf {M}=\mathbf {L}^T\mathbf {L}\). As a result, with respect to two feature vectors, we have a Mahalanobis distance measure:

$$\begin{aligned} D_M(f_{x_i},f_{x_j})=\Vert \mathbf {L}(f_{x_i}-f_{x_j})\Vert ^2_2=(f_{x_i}-f_{x_j})^T\mathbf {M}(f_{x_i}-f_{x_j}) \end{aligned}$$
(6)

where \(f_{x_i}\) and \(f_{x_j}\) denote the ONMF-based Omega shape descriptors.

Assume a set of training samples \(\{f_{x_i}\}_{i=1}^n\) and their labels \(y_i\in \{-1,1\}\) are given. Training a SVM classifier is to find an optimal hyperplane that maximizes the margin between two classes. The corresponding Lagrangian dual problem is formulated by [21]:

(7)

where \(\varphi \) is a kernel feature mapping to a higher dimensional feature space, C is the regularization parameter and \((\cdot )\) denotes an inner product operator.

Without explicitly computing the feature mapping \(\varphi \), the kernel function \(k(f_{x_i},f_{x_j})=\varphi (f_{x_i})\cdot \varphi (f_{x_j})\) is introduced to handle nonlinearly separable cases. One of the most common used kernel is the exponential radial basis function (RBF) kernel:

$$\begin{aligned} k_{rbf}(f_{x_i},f_{x_j})=\text {exp}(-\frac{D^2(f_{x_i},f_{x_j})}{\sigma ^2}) \end{aligned}$$
(8)

where \(\sigma \) is the bandwidth parameter of RBF kernel and \(D(\cdot )\) is a distance measure.

A standard distance measure is the Euclidean distance such that \(D_E(f_{x_i},f_{x_j})=(f_{x_i}-f_{x_j})^T(f_{x_i}-f_{x_j})\). In our method, we make use of the statistical regularities estimated from the training data and introduce a learned Mahalanobis metric. In this way, we directly replace \(D(\cdot )\) with \(D_M\) (referred to Eq. (6)) in the kernel function in Eq. (8). The optimization problem in Eq. (7) is solved using quadratic programming.

2.3 On-Line Human Detection

For an input image, we adopt a variable-scale searching mechanism to obtain the test image patches, and ONMF-based Omega shape descriptor on each image patch is extracted. Then SVM classification is applied within a learned distance metric space. Detected candidate results are fused to generate the final detection results. Specifically, the detection procedure is given as follows:

  1. step 1

    Generating Test Image Patches. In our method, we fix the size of sliding window, and change the scale of input image from 0.8 to 1.2 with interval 0.1. Test image patches with different scales are generated in the first step.

  2. step 2

    Extracting ONMF-based Omega Shape Descriptors. With respect to each test image patch \(x_{t}\), HOG feature vector \(V_{x_t}\) is obtained firstly. Then a linear transformation is employed to get a local part-based shape descriptor with reduced dimensionality, i.e., \(f_{x_t}=W^TV_{x_t}\), where W is the trained ONMF-based shape basis matrix.

  3. step 3

    Applying SVM Classification. The feature descriptor \(f_{x_t}\) obtained in the last step is classified according to the output value of SVM classifier in the learned Mahalanobis distance metric space: \(y(f_{x_t})=\text {sign}(\sum _{i=1}^n\alpha _iy_ik(f_{x_i},f_{x_t})+b)\), where \(y(\cdot )\) is the decision function, \(\alpha _i\) is the supporting vector coefficients, \(k(\cdot )\) is the kernel function measured in the learned metric space and b is a constant item.

  4. step 4

    Fusing Detection Results. Due to different scales being detected, there may exist overlap, inclusion and intersection among detected bounding boxes for one object. Therefore, the subsequent fusion process is needed. Obtained candidates from step 3 are grouped based on their location and size similarities. The bounding boxes belonging to the same class are fused to one and the others not satisfying the threshold condition are discarded.

3 Experimental Results

3.1 Experimental Setups

In order to evaluate the proposed human detection method, we have conducted a set of experiments on several challenging clips from CAVIAR dataset [26]. These clips include people walking along corridor browsing, going inside and coming out of stores in a shopping center and contain 1500 frames on average. Some challenging factors are considered: partial occlusion, background disturbance, pose variation, etc. Some sample images from CAVIAR dataset are shown in Fig. 4. Silhouettes of 600 human head-shoulders under different view angles with clean background are used to train ONMF-based Omega shape basis matrix. For SVM classifier training, we randomly choose one clip of CAVIAR dataset to generate training samples which are composed of 6085 images. In the 3085 positive images, human head-shoulders are automatically cropped based on the ground truth benchmark and other 3000 negative samples are human free. All training samples are divided into 10 groups for cross validation. During each training, 9 groups of samples are used as training set while the left one as testing set.

Fig. 4.
figure 4

Some sample images from CAVIAR dataset.

Fig. 5.
figure 5

Performance comparison of four feature extraction methods. The horizontal axis displays four different feature extraction methods, from left to right, namely, NMF, HOG, CENTRIST and Ours. The left and right vertical axes denote detection error rate and feature dimensions, respectively.

The proposed detection algorithm is implemented in MATLAB on a workstation with Intel Core 3.6 GHz CPU and 4 GB RAM. The parameter configuration of our experiments is listed as follows. We extract 2592D HOG feature. The size of sliding window is \(54\times 60\). The number of bases (\(b_1\)) in the first layer of ONMF hierarchies, parameter s, level L parameters are set as 10, 2 and 4, respectively, leading to a 150D ONMF Omega shape descriptor. Three nearest neighbors are considered in the LMNN distance metric learning algorithm. A linear kernel function is adopted for SVM classifier. In order to demonstrate the effectiveness of our proposed method, we compare it with other representative methods. They are referred to as ONMF150 [20], HOG-SVM [5], C4 [7] and LMNN-R [27]. To quantitatively evaluate these methods, we adopt FPPI (False Positive Per Image)-MR (Miss Rate) curve as the evaluation criterion.

Fig. 6.
figure 6

Example detection results on CAVIAR dataset. The results by ours, HOG-SVM, ONMF150, C4 and LMNN-R are listed in column. Each row represents each case on CAVIAR dataset.

Fig. 7.
figure 7

Quantitative comparison of ONMF, HOG-SVM, C4, LMNN-R and our method on six typical sequences from CAVIAR dataset.

3.2 Empirical Results

In this subsection, we present the effect of key components in our method as well as some comparisons with other state-of-art methods.

In order to demonstrate the effectiveness of our proposed ONMF-based Omega shape feature learning method, we compare it with other three feature extraction methods. They are referred to as NMF, HOG and CENTRIST [8]. Figure 5 shows the performance comparison among these four feature extraction methods. Cross-validation is adopted to obtain the average error rate (miss rate). Training set is randomly divided into five groups with four groups used for training and one left used for testing. A linear SVM model is selected as the classifier with four different kinds of input. From the results we can find although the error rate of our method is a little higher than CENTRIST method, however, CENTRIST gets a remarkable high dimensional feature which is not suitable for classification. Thus, taking error rate and feature dimensions these two factors into consideration simultaneously, our feature learning method outperforms the other feature learning methods.

We compare our proposed method with other representative methods including ONMF150 [20], HOG-SVM [5], C4 [7] and LMNN-R [27]. Figure 6 lists some example detection results on CAVIAR dataset. The detected human is represented by a white bounding box. From the results we can find with the other four methods they all have some miss-classifications. Even with the partial occlusion case, our method is capable of detecting people accurately. Figure 7 illustrates the quantitative comparison results of the five methods over six typical sequences from CAVIAR dataset. We observe that our detection method achieves the best detection performance with a lower FPPI and missing rate on most video sequences.

4 Conclusion

In this paper, we have proposed a human detection method based on part-based human Omega shape features. We have empirically proved that the HOG-ONMF based shape feature descriptor can capture the certain details which is robust to partial occlusion and background corruption. The improved SVM classifier is applied in a learned feasible metric space, both intraclass compactness and interclass separability of training samples are improved. We compare our proposed method with four competing methods on six challenging sequences on CAVIAR dataset. Both qualitative and quantitative experimental results have verified the effectiveness and robustness of our method.