1 Introduction

Pedestrian detection has gained great attention in the past few decades due to its important role in many computer vision applications [1,2,3,4]. Head-shoulder detection is one of the active ways for the pedestrian detection task [5,6,7]. There are two main motivations for choosing the head-shoulder part as the target. For one thing, head-shoulder detection is more reliable than detection of full pedestrian body. In complex and dynamic scenes, many challenging factors like posture change and occlusion would seriously interfere with the human body but less with the head and shoulders which would still have high consistency under different postures. When the pedestrian is occluded, head and shoulders are more likely to be detected. For another thing, because of the diversity of the head shape, hair style and hair color, it is difficult to achieve ideal results of head detection. However, with a larger part composed of head and shoulders, we can extract more valuable features.

Many works have been presented for exploiting the head-shoulder appearance for pedestrian detection [6,7,8]. Hand-crafted features are conventionally combined with classifiers in the field of machine learning to complete the detection task. Histogram of Oriented Gradients (HOG) is a typical feature that can describe the shape and edge information of the object, which was introduced by Dalal and Triggs [9] for pedestrian detection. Later, Li et al. [6] applied HOG features to the detection of head and shoulders. Local Binary Pattern (LBP) is also a widely applied descriptor in object detection. Wang et al. [10] gained improved results of pedestrian detection with the combination of HOG and LBP features. Zeng and Ma [7] presented a discriminative multi-level HOG-LBP feature for head-shoulder detection. Besides the commonly used support vector machine (SVM) classifiers, Viola-Jones classifiers and AdaBoost classifiers are also often employed due to their property of fast computation [5, 11]. However, traditional methods cannot well capture the rich information conveyed by images and have very limited generalization ability for various application environments.

Recently, deep learning methods have been actively applied in diverse tasks for their ability of learning features directly from images. Many deep learning models, commonly based on convolutional neural networks (CNN), have been presented for object detection [12,13,14,15,16]. Among them, R-CNN [12] performed feature extraction without any sharing for the calculation of each candidate region. SPPnet [13] introduced adaptively-sized pooling on shared convolutional feature maps for efficient region-based object detection. Afterwards, end-to-end detector training on shared convolutional features was achieved by Fast R-CNN [14], and Faster R-CNN [15] incorporated region proposals to form an efficient unified network. Inspired by these deep learning methods, we exploit the powerful representation learning ability of deep learning models for the head-shoulder detection task.

In order to extract reliable head-shoulder characteristic for detecting pedestrians in complex and challenging environments, this paper introduces a head-shoulder detection method based on the convolutional neural network. The proposed model combines candidate region extraction, head-shoulder classification and location prediction by sharing parameters. We propose a novel structure-sensitive neural network method by embedding head-shoulder structure information, which can add translation variability to better predict head-shoulder location. Specifically, we construct structure-sensitive convolutional feature maps behind the shared convolutional layers, and use a structure-sensitive pooling method to integrate the location information of head and shoulders into the feature of the candidate region. In this way, the pedestrians will be localized more accurately and the Intersection-over-Union (IoU) between the detection bounding box and the ground truth box is higher. Meanwhile, to improve the effectiveness of our model, we pre-train the shared convolutional layers with a triplet loss for the initialization of the head-shoulder detection network model. The experimental results demonstrate that the performance of our method is close to that of Faster R-CNN [15] while running faster than it with higher IoU. Besides, when the number of Regions-of-Interest (RoIs) is increased, the increased detection time can be ignored.

2 Proposed Method

The framework of our method is illustrated in Fig. 1. The lower part of Fig. 1 shows the detection process. First, The high-level convolutional features of the image are extracted via the shared convolutional subnet. Second, the image candidate regions that may contain head-shoulders will be generated. Third, structure-sensitive pooling is performed for each candidate region. Finally, with the pooled feature, the predictors will classify a candidate and find the accurate location. Furthermore, to improve the effectiveness of our model, we pre-train the shared convolutional subnet with a triplet loss, as shown in the upper part of Fig. 1.

Fig. 1.
figure 1

The framework of the proposed method for head-shoulder detection.

2.1 Candidate Region Extraction

We utilize a candidate region extraction method based on sliding window. All image patches will be enumerated by the sliding window and they will be classified as foreground objects or background. Only the foreground patches will be considered as candidate regions and be further processed. The image patches are generated based on the feature maps output by the shared convolutional subnet. For simplicity, the size of the sliding window is fixed to \(3 \times 3\) and the stride is set to 1. Meanwhile, in order to handle objects of different sizes, we introduce a multi-scale mapping strategy. That is, a window on the feature map corresponds to multiple image regions of different sizes on the original image. Here we use the combination of different scales and aspect ratios. According to the shape characteristic of head-shoulders, we define three scales \(\{100^2\), \(150^2\), \(200^2\}\) and three aspect ratios \(\{3:4, 4:4, 5:4\}\). Therefore, each sliding window maps to nine image regions of different sizes, illustrated in Fig. 2. We use the softmax classifier to pick out all foreground regions as the candidates.

Fig. 2.
figure 2

The mapping diagram between the sliding window and original image regions.

2.2 Structure-Sensitive RoI Pooling

The characteristic blocks on the pedestrian’s head-shoulder appearance contain head, left shoulder and right shoulder, respectively. These three blocks can be viewed as the main roles for head-shoulder detection. The proposed algorithm uses the convolutional network to extract the feature of head-shoulder appearance and integrate the head and shoulders’ location information via structure-sensitive pooling. As shown in Fig. 3, according to the shape and positions of a pedestrian’s head and shoulders, the region is divided into upper and lower parts. While the head is at the middle of the upper part, the left and right shoulders occupy the lower left and lower right corners, respectively. With a candidate region of size \(w \times h\) and the coordinates of the top left corner to be \((x_0, y_0)\), the head block is defined as \(\{x_0+w/4\le x<x_0+3w/4, y_0\le y<y_0+h/2\}\), the left shoulder block is \(\{x_0\le x<x_0+w/2,y_0+h/2\le y<y_0+h\}\), and the right shoulder block is \(\{x_0+w/2\le x<x_0+w,y_0+h/2\le y<y_0+h\}\).

Fig. 3.
figure 3

Division diagram of pedestrian’s head and shoulders.

Fig. 4.
figure 4

Illustration of the structure-sensitive pooling operation.

Fig. 5.
figure 5

The performance under different values of k.

A convolutional layer is used to output 3k feature maps, with k feature maps for each characteristic block. Then the pooling for each block is formulated as:

$$\begin{aligned} r(i|\theta )=\sum _{{(x,y)\in R(i)}, {ik\le j<(i+1)k}}z_j(x_0+x,y_0+y)/n, \end{aligned}$$
(1)

where \(r(i|\theta )\) denotes the pooled feature for the ith (\(i=0,1,2\)) characteristic block, R(i) denotes the corresponding region of the ith characteristic block in the given RoI, \(z_j\) denotes the jth feature map for the ith block, \((x_0,y_0)\) refers to the coordinates of the top left corner of the RoI, and n represents the number of pixels. The structure-sensitive pooling operation is illustrated in Fig. 4. In the head-shoulder structure-sensitive feature maps, each color represents a characteristic block. The output feature vector is then fed into the softmax classifier for head-shoulder classification.

The parameter k has an important impact on the performance of the proposed method. We have conducted experiments to analyze it and the experimental results are reported in Fig. 5. It can be observed that our method performs the best when \(k=32\).

2.3 Multi-task Loss with Classification and Localization

Due to the coherent relationship between object classification and location prediction and to reduce computational complexity, we use multi-task learning to achieve our goal. As shown in Fig. 1, the high-level image features will be only extracted once, and then they will be used by the object classifier and the object position regressor concurrently.

For each RoI, the classifier will output its probability of different categories. We only define two categories: head-shoulder and non head-shoulder. We denote \(p_0\) as the probability of non head-shoulder class and \(p_1\) as the other. The output of the classifier is \(P=({p_0},{p_1})\). With the softmax classifier [14], the loss function is formulated as:

$$\begin{aligned} L_{cls}(P,u)=-\log (p_u) \end{aligned}$$
(2)

where u represents the true category of RoI, \(u=1\) indicates head-shoulder and \(u=0\) otherwise. If a RoI belongs to the head-shoulder class, the location predictor will output the relative location information in RoI:

$$\begin{aligned} {\left\{ \begin{array}{ll} t_x=gt_x/R_w,\\ t_y=gt_y/R_h,\\ t_w=gt_w/R_w,\\ t_h=gt_h/R_h, \end{array}\right. } \end{aligned}$$
(3)

where \(R_w\) and \(R_h\) refer to the width and height of RoI, and \(gt_x\), \(gt_y\), \(gt_w\) and \(gt_h\) represent the coordinates of the upper left corner, width and height of the true target rectangle in RoI. In order to predict these target values, it is also necessary to select an appropriate regression function and we select the smooth L1 function which is insensitive to outliers. And the regression error of the target location is the cumulative error of these target values:

$$\begin{aligned} smooth L_1(x) = {\left\{ \begin{array}{ll} 0.5x^2, &{}\text {if } ||x||< 1,\\ ||x||-0.5, &{}\text {otherwise}, \end{array}\right. } \end{aligned}$$
(4)
$$\begin{aligned} L_{loc}(t,v)= \sum _{i\in \{x,y,w,h\}}smooth L_1(t_i-v_i). \end{aligned}$$
(5)

where t refers to the true output, and v is the predicted output. Thus the multi-task loss function of object classification and location regression is:

$$\begin{aligned} L(P,u,t,v)=L_{cls} (P,u)+\gamma [u=1] L_{loc} (t,v) \end{aligned}$$
(6)

where \(\gamma \) is used to balance the weights between the classification loss and the location regression loss, and \([u=1]\) is 1 when \(u=1\) and 0 otherwise, which indicates that the location regression loss is calculated only if the true category of RoI is head-shoulder.

2.4 Pre-training with Triplet Loss

In order to improve the performance of our head-shoulder detection network model, we pre-train the shared convolutional subnet to improve its representation ability on the head-shoulder, which is used for the initialization of the detection network model, as illustrated in Fig. 1. The shared subnet is appended with a fully connected layer, a normalization layer and a triplet loss layer, and this network is denoted as the pre-model.

The training data are prepared in triplets \((x_i^a,x_i^p,x_i^n)\), \(i=1,\ldots ,N\), with an anchor image \(x_i^a\) (head-shoulder), a positive sample \(x_i^p\) (head-shoulder) and a negative sample \(x_i^n\) (non head-shoulder). The triplet loss aims to minimize the distance between an anchor and a positive and maximize the distance between the anchor and a negative. Similar to [17], the triplet loss function is formulated as:

$$\begin{aligned} \sum \nolimits _i\left[ \,\Vert f(x_i^a)-f(x_i^p)\Vert _2^2-f(x_i^a)-f(x_i^n)\Vert _2^2+\alpha \,\right] _{+}, \end{aligned}$$
(7)

where f(x) is the output feature of image x, \(\alpha \) is an enforced margin between positive and negative pairs, and \([z]_{+}\) equals z when \(z>0\) and 0 otherwise.

3 Experiments

Dateset. In the camera view of large tilt angles, pedestrians’ heads and shoulders are visible even if inter-occlusion exists among them. But in the camera view of small tilt angles, the heads and shoulders of occluded pedestrians are often invisible. Therefore, applying head-shoulder detection to camera views of large tilt angles is more appropriate. But up to now we have not found similar public datasets, so we need to collect our own dataset to verify the effectiveness of the proposed method. First, we shoot the pedestrian videos under different scenarios such as stations, markets and campuses. Then each pedestrian’s head-shoulders in selected video frames is marked with a rectangular bounding box, and the coordinates of the top left corner and the lower right corner are saved. Eventually, the collected dataset contains six scenes with a total of 5,196 images, each with the size of \(640 \times 360\), and the dataset is divided into training and test sets. The training set contains 4,196 images of pedestrians, with a total of 38,246 positive samples, while the test set contains 1,000 images, with a total of 12,831 positive samples.

Network details. The shared convolutional subnet (conv 1–5) has the same structure of the first five convolutional layers of Fast R-CNN [14]. The network structure of candidate region extraction, similar to the Region Proposal Network [15], includes a conv layer with kernel \(3\times 3\), stride 1 and 512 feature maps, a conv layer with kernel \(1\times 1\), stride 1 and 18 feature maps, and a softmax layer. The network structure of object classification contains a conv layer with kernel \(1\times 1\) and stride 1, a structure-sensitive pooling layer and a softmax layer, while the structure of location prediction contains a conv layer with kernel \(1\times 1\), stride 1 and 24k feature maps, a structure-sensitive pooling layer and a output layer with 8 neurons.

We conduct extensive experiments to verify the proposed head-shoulder detection network model on the test dataset.

3.1 Experimental Results

Some representative pedestrian detection results of our method under different scenarios are illustrated in Fig. 6. It can be observed that our method performs robustly in different scenes under various background and illumination conditions. Even in the cases where pedestrians are occluded, our method can also detect pedestrians’ head-shoulder parts accurately.

Fig. 6.
figure 6

Example pedestrian detection results of our model under different scenarios.

Fig. 7.
figure 7

Comparison with state-of-the-art methods.

Fig. 8.
figure 8

Comparison results under different IoU thresholds

To further evaluate the performance of our detection model, we compare our method with several state-of-the-art detection methods, including two traditional methods (the HOG feature based method [9] and the DPM method [18]) and the deep learning based method Faster R-CNN [15]. In the experiments, the proposed method and the compared methods are trained using the same training dataset, and then they are tested on the same test dataset. Under the condition that IoU is equal to 0.5, the evaluation results are presented in Fig. 7. It can be observed that the performance of the proposed model is close to Faster R-CNN’s, and the quantitative difference of mAP between the two is merely about 0.5%. However, the results of the two traditional detection methods (HOG and DPM) are very unsatisfactory compared with those based on deep learning. It can be inferred that, in this task, the features extracted by the deep learning based methods are more excellent than the traditional hand-crafted features.

3.2 Results with Different IoU Thresholds

When testing the proposed model, in order to verify that our model can locate the pedestrians more close to the ground-truth bounding boxes, we set various IoU thresholds and compare the results between the proposed method and Faster R-CNN. The results are reported in Fig. 8. It shows that when the IoU threshold is 0.5 or 0.6, the results of the proposed model are similar to those of Faster R-CNN. But if the threshold is greater than 0.6, the results of our method are all superior to those of Faster R-CNN. In particular, when the IoU threshold is 0.9, the mAP of our model is 0.106 while the other is only 0.023. These results demonstrate that our method can make the detection bounding box more accurate to ground truth.

3.3 Detection Speed Analysis

The detection procedure of the proposed method and Faster R-CNN both include the extraction of candidate regions and the calculation for each candidate region, so the detection time is determined by the performance of the detection module as well as the number of extracted candidate regions. To compare the detection speed of the two methods comprehensively, it is necessary not only to measure the detection time at different image sizes when extracting equal number of candidate regions, but also to measure the detection time when different numbers of candidate regions are extracted under the same image size. When the number of extracted candidates is fixed to 300, the detection time at different image sizes is shown in Table 1. When the image size is fixed to \(640 \times 360\), the detection time for different numbers of candidates (300, 1000 and 2000) are reported in Table 2.

Table 1. The detection time (in seconds) with various image sizes
Table 2. The detection time (in seconds) with various numbers of RoIs

As observed in Table 1, the detection time of the proposed model and the Faster R-CNN model increases with the increment of image sizes. However, at the same image size, the time that the proposed model consumes is less than the other. Table 2 reports that the consumed time for the proposed model is all less than that of Faster R-CNN in the situation of various candidate numbers. The detection time of our model increases in a very small scale with the rising number of candidate regions, which can be ignored when compared with the Faster R-CNN model.

4 Conclusion

This paper has presented a novel pedestrian detection method by detecting head-shoulder based on the convolutional neural network. The head-shoulder detection network model integrates candidate region extraction, object classification and location prediction in the form of sharing convolutional features. To embed translation variability into the model, we construct head-shoulder structure-sensitive convolutional feature maps and introduce structure-sensitive pooling. Furthermore, to improve the representation ability of our network model on pedestrians’ head-shoulders, we pre-train the shared convolutional subnet with a triplet loss, whose parameters are used for the initialization of the head-shoulder detection network model. The experimental results demonstrate that the proposed method has similar performance to the Faster R-CNN model, while achieving higher detection speed as well as higher IoU.