1 Introduction

Pedestrian detection is the main task in autonomous driving, where the accurate and robust detection has the direct impact on the planning and decision of autonomous vehicles [14]. In addition, pedestrian detection forms as the basis for many promising vision tasks, such as pedestrian tracking [11], crowd sensing [25], activity reasoning [24], etc. Besides, pedestrian, as the main traffic element, plays an influential role for traffic scene understanding and mapping [6]. Hence, many efforts have been devoted for its progress. However, it still needs a large space to boost the detection performance, mainly because that there are many challenging factors: covering of all the pedestrians with different scales, distinct illumination, partial-occlusion, motion blur, similar appearance to other non-human objects, and so forth.

Fig. 1.
figure 1

The detection result of one image. Left is the one by Faster R-CNN [20] and right is generated by R-FCN [3].

Facing these problems, many works have been proposed. Among them, convolutional neural network (CNN) module have established the most excellent performance. For example, faster region-based convolutional neural networks (Faster R-CNN) [20] is proposed with 9 anchor scales for bounding box regression, where a region proposal network (RPN) is embedded to speed up the proposal generation procedure. Redmon et al. [18] proposed a YOLO detection module which predicts the coordinates of bounding boxes directly using fully connected layers on top of the convolutional feature extractor. Subsequently, some variants of YOLO are put forward, such as YOLOv2 and YOLO 9000 [19]. Single shot multiBox detector (SSD) [12] initialized a set of default boxes over different aspect ratios and scales within a feature map, and discretized the output space of bounding boxes into these boxes. Although these works have complex architectures and delved into the instinct pedestrian representation, all of them cannot obtain a satisfactory performance, seeing Fig. 1 for a demonstration. One reason may be the dynamic challenging factors mentioned before, but one another more important reason is that it is difficult to learn an invariable representation of pedestrians in diverse environment. Supplemented by the 3D LiDAR sensor, we can gather the physically geometrical information of pedestrians, such as height from the ground, area of occupancy, etc. These information also can be treated as the spatial context clue for inferring. Actually, there is one former work [21] which addressed the pedestrian detection by fusing LiDAR and visual clues. However, this method cannot obtain a good calibration of visual and LiDAR clue, as well as an accurate detection by naive neural networks. Though there are some works for detection using LADAR or Laser sensors [13, 23], they are based on the hypothesis that front dynamic objects are all pedestrians, where non object class knowledge is exploited. In other words, LiDAR cannot distinguish the class of different object, but cameras can. Hence, it is inevitable to fuse the camera and LiDAR sensors together, whereas needs a calibration for tackling their heterogeneous and asynchronous properties. Actually, vision+x module is becoming the main trend for scene understanding.

To this end, this work firstly dedicates to an accurate calibration for visual and LiDAR sensors, and update the calibration parameters with an online way. Second, we take Faster R-CNN as the basis for generating pedestrian proposals, and eliminate the wrong detections by considering constraints of physical geometrical clues, including the dominant distance of pedestrian within the proposals, height from the ground and dynamic area occupancy variation of pedestrians. By that, the pedestrian proposals generated by Faster R-CNN are significantly cleaned. The detailed flowchart is demonstrated in Fig. 2.

Fig. 2.
figure 2

The flowchart of the proposed method.

2 Related Works

This work mainly aims to boost the CNN based pedestrian detection performance with a 3D LiDAR sensor auxiliary. We will review the related works from CNN-based pedestrian detectors and other detection modules by non-vision approaches, such as LiDAR, Laser, etc.

CNN-based pedestrian detection: Recently, there have been a lot of detection works of interest deriving a convolutional deep neural networks (CNNs) [4, 12]. Within this framework, great progress of pedestrian detection has been made compared with previous works with hand-craft feature, such as deformable part-based model (DPM) [5]. The core purpose of these CNN based detectors is to search the instinct or structural information implied by large-scale pedestrian samples with respect to the scale space [12, 20, 27] or geometry constraint, such as the part-geometry [16]. For example, Faster R-CNN [20], inspired by R-CNN [7], sampled the object proposal with multiple anchor scales, and speeded up the proposal generation by a region proposal networks (RPN). Cai et al. [2] proposed a unified multi-scale deep neural networks (denoted as MS-CNN) to address the scale issue. The similar issue was also concerned by the work of scale-adaptive deconvolutional regression (SADR) [27] and scale-aware Fast R-CNN [10]. Single shot multiBox detector (SSD) [12] predicted the category scores and box offsets for a set of default bounding boxes on feature maps, which is faster than single box module of YOLO [18]. Beside of the scale issue consideration, some studies concentrate on the structural information implied by different part of pedestrians. Within this category, Ouyang et al. [16] jointly estimated the visibility relationship of different parts of the same pedestrian to solve the partial-occlusion problem. They also proposed a deformable deep convolutional neural networks for generic object detection [17], where they introduced a new deformation constrained pooling layer modeling the deformation of object parts with geometric constraint and penalty. Although these CNN based detectors search for an instinct and structural representation of pedestrians, robust detection still remains very difficult because of the diverse environment.

Non-vision pedestrian detection: Except for the universal vision based module for pedestrian detection, some researchers exploited this problem using many non-vision ways, including Lidar [8, 23], LADAR [13], and so on. Within this domain, geometrical features, such as the edge, skeleton, width of the scan line are the main kind of features. For example, Navarroserment et al. [13] utilized LADAR to detect the pedestrian by the constraint of height from the ground. Oliveira and Nunes [15] introduced Lidar sensor to segment the scan lines of pedestrian from the background with a spatial context consideration. Börcs et al. [1] detected the instant object by 3D LiDAR point clouds segmentation, where a convolutional neural networks was utilized to learn information of objects of a depth image estimated by 3D LiDAR point clouds. Wang et al. [23] also adopted the 3D LiDAR sensor to detect and track pedestrians. In their work, they first clustered the point cloud into several blobs, and labeled many samples manually. Then a support vector machine (SVM) was used to learn the geometrical clue of pedestrians.

In summary, the information acquired by non-vision sensors are all the geometrical clues without the explicit class information. Hence, in some circumstance, the frequent false detection is generated while the vision based methods can distinguish the different classes. Nevertheless, non-vision modules have the superior ability to the vision based ones for adapting different environment. It is inevitable to fuse the camera and non-vision modules together to obtain a boosted detection performance. Hence, this work utilizes 3D LiDAR sensor for an attempt.

3 Accurate Calibration of 3D LiDAR and Camera

For boosting the pedestrian detection performance, primary task is to calibrate camera to 3D LiDAR because of the demand for targeting the same objects. The work of calibration can be explained as to compute the intrinsic parameter of camera and extrinsic parameters correlation of two sensors, i.e., the translational vector \(\mathbf{{t}}\) and rotate matrix \(\mathbf{{R}}\in \mathbb {R}^{3\times 3}\). In this work, the intrinsic camera parameter is computed by Zhang Zhengyou calibration [26]. For the extrinsic parameter, this work introduces an online automatic calibration method [9] to carry out the accurate calibration for our camera and 3D LiDAR sensors. It aims to pursue a maximization of overlapping geometry structure. Different from other off-line calibrations [22, 26], it optimizes the the extrinsic parameter by latest observed several frames. Specifically, six values are calculated when optimization. They are {\(\varDelta x\), \(\varDelta y\), \(\varDelta z\)} translations, and the {roll, pitch, Eular-angle} rotations between the camera and 3D LiDAR sensors. Given a calibration of \(\mathbf{{t}}\) and \(\mathbf{{R}}\), we first project the 3D LiDAR sensor onto the image plane captured by visual camera. Then, the objective function for optimization is specified as:

$$\begin{aligned} max: \sum \limits _{f = n - w}^n {\sum \limits _{p = 1}^{\left| {{V^f}} \right| } {V_p^f} } S_{i,j}^f, \end{aligned}$$
(1)

where w is the frame number for optimization (set as 9 frames in this work), n is the newest observed video frame, p is the index for 3D point set \(\{{V_p^f}\}_{p=1}^{|V^f|}\) obtained by 3D LiDAR sensor, \(S_{i,j}^f\) is the point (xj) in the \(f^{th}\) frame S. Note that, the point set in 3D LiDAR and camera is not the whole plane. Actually, the points in both sensors are all the edge points. For image, the point \(S_{i,j}^f\) is extracted by edge detection appending an inverse distance transformation, and \(\{{V_p^f}\}\) is obtained by calculating the distance difference of the scene from the 3D LiDAR (denoted as the origin of coordinates). Some typical calibration results are shown in Fig. 3. From Fig. 3, we obtain a high-accurate calibration results. As thus, we accomplish the calibration of camera and 3D LiDAR sensors.

Fig. 3.
figure 3

Typical calibration results of camera and 3D LiDAR.

4 Boosting CNN-Based Detectors by Fusing Physically Geometrical Clue of Pedestrian

4.1 Pedestrian Proposal Generation by CNN-Based Detectors

After the calibration, we obtain a fundamental precondition for tackling the pedestrian detection problem by fusing visual color and real distance of the target. However, despite the calibration is conducted, there remains some issues for detection. The main difficulty is the heterogeneous property, i.e., the sparsity and the physical meaning of the points in two sensors are rather different. In addition, the 3D point captured by LiDAR does not have the class information. Therefore, in this work, we treat the CNN-based detector as the basis, and take some physically geometrical clue of 3D point to rectify the generated pedestrian proposals. Recently, there are many works with a deep network architecture addressing the pedestrian detection. However, each of them does not perform a satisfactory performance. Therefore, his work takes Faster R-CNN [20] as an attempt, the erroneous pedestrian proposals are eliminated by fusing the following physically geometrical clues.

4.2 Physically Geometrical Clue Fusion for Pedestrian Detection

In this subsection, we will describe the method for how to fuse the physically geometrical clue extracted by 3D LiDAR in detail. As we all know that, the height of most of the walking person in the world belongs to the range of [1, 2] meters, and occupies a region with the maximum size of \(0.5 \times 2\) m\(^2\). In addition, the occupancy region of a human maintains relatively static. Therefore, this work extracts the static and dynamic physically geometrical clues of the pedestrian, including the height from the ground, occupancy dominance within a pedestrian proposals, and a dynamic occupancy variation in accordance with the scale variation of proposals.

  1. (1)

    Static Geometrical Clues

Occupancy dominance (OD): The pedestrian proposals are generally represented by bounding boxes. By the observation, the 3D points locate in the bounding boxes sparsely and uniformly. The distance of the 3D points in each bounding box is computed by \(r=\root 2 \of {x^2+y^2+z^2}\), where r represents the distance of a 3D point (xyz). Specially, because the sparsity of 3D points, some pixels in color image have no distance information, usually denoted as \((\infty ,\infty ,\infty )\). Besides, the bounding box inevitably contain a few of background region whose distance is much larger than the ones of pedestrians. In addition, the distances of the 3D points within pedestrians always are similar, and the pedestrian occupies the dominant part of the bounding box. Inspired by this insight, this work puts forward an occupancy dominance to eliminate the bounding boxes whose scale is rather different from the pedestrian. Specifically, we sort the distance of the 3D points in a bounding box with ascending order, and observe that the truly pedestrian always have a largest width range of zone with constant distance, seeing Fig. 4 for an example. As thus, the main step of occupancy dominance is to extract the largest smooth part of the sorted distance curve. For this purpose, we compute the difference of two adjacent point in this distance curve, and set the difference as 0 when the distance difference is lower than 0.3 m. Then, we segment the curve into several fragments, and the length of each fragment is denoted as the occupancy in a bounding box. By this clue, we can get rid of the proposals without dominant object region.

Fig. 4.
figure 4

The illustration of occupancy dominance.

Height-width constraint (HC): In the driving circumstances, the height of a walking pedestrian usually drops into a finite range, e.g., from 1.2 m to 2 m. Therefore, given a pedestrian proposal, its height cannot exceed 2.5 m. In this paper, for a bounding box, we specify the height constraint as \(0.8<(h_{max}-h_{min})<1.5\) m. With this constraint, the proposals with too little or large size are removed.

  1. (2)

    Dynamic Geometrical Clues

Dynamic occupancy (DO): In addition to the static clues, we also exploit the dynamic clues for removing the proposals wrongly detected. That is because that the occupancy (defined before) of the human body in the bounding box maintains constant, i.e., the fragment length in Fig. 4 remains almost unchanged when varying the scale of the bounding box. On the contrary, the objects, such as the trees which are always detected as pedestrian, may have a rather different scale size, and have a dynamic occupancy (denoted as DO) when varying the scale of bounding boxes. Hence, we further determine the quality of the pedestrian proposal by varying the height of the bounding box, and examine whether the dominant occupancy of the bounding box variation has a direct proportion to height or not. If not, the proposal is a pedestrian proposal. Specifically, dynamic occupancy (DO) in this paper is fulfilled by enlarging the height of the bounding box with the size of 1.3 times.

Although the above clues are all quite simple, they are intuitive and can significantly boost the performance of the CNN based detector verified by the following experiments.

5 Experiments and Discussions

5.1 Dataset Acquisition

We collect the experimental data by an autonomous vehicle named as “Kuafu”, which is developed by the Laboratory of Visual Cognitive Computing and Intelligent Vehicle of Xian Jiaotong University. In this work, a Velodyne HDL-64E S2 LIDAR sensor with 64 beams, and a high-resolution camera system with differential GPS/inertial information are equipped in the acquisition system. The visual camera is with the resolution of \(1920\times 1200\) and a frame rate of 25. In addition, the scanning frequency of the 3D-LiDAR is 10 Hz. In the dataset, there are 5000 frames containing 5771 pedestrians proposals in the ground-truth manually labeled by ourselves. It is worth noting that this work treats the detected proposals with a detection score larger than 0.8 as the truly detected pedestrian. Hence the performance of the proposed method cannot be represented by a precision-recall curve.

5.2 Metrics for Evaluation

To evaluate the performance, this paper introduces the precision and recall values. The precision value represents the ratio of proposals correctly detected as pedestrian to all the detected proposals, while the recall value specifies the percentage of detected pedestrian proposal in relation to the ground-truth number. For the performance evaluation, this work adds the constraints, i.e., OD, HC and DO gradually, by which the performance of different clues can be presented. In addition, occupancy dominance (OD) clue is essential for HC and DO. Therefore, we deploy it in all the configurations.

Table 1. The precision and recall values for different physically geometrical clue embedding. For a clearer comparison, we demonstrate the numbers of detected proposals (DPs) and the wrongly detected proposals (WDPs). Besides, the best precision and recall value are marked by bold font.
Fig. 5.
figure 5

Some typical snapshots of the results generated by embedding different clues. The first row is the results by Faster R-CNN [20]. The second row is the results by RGB+3D-LiDAR with the clue embedding of OD and DO. The third row is the results after embedding OD and HC, and the results with all of the physically geometrical clues are presented in the last row.

5.3 Performance Evaluation

The detection efficiency of the proposed method is 5 fps. Table 1 demonstrates the precision and recall values after embedding different physically geometrical clues. From this table, we can observe that the more the clues are added, the better precision the method generates and the worse recall is. It seems that the more clues make the detector cannot robustly detect all the pedestrians, seeing the \(1606^{th}\) and \(1651^{th}\) frames. Actually, through a checking in the visual results, more clues are necessary, which can remove the wrongly detected proposals to a larger extent. In the meantime, the margin of the recall value by embedding all the clues is commonly caused by that we removed the pedestrian proposals whose distances are larger than about 50 m from our vehicle, which is totally acceptable in practical situations, taking the \(1623^{th}\) frame as an example (Fig. 5).

5.4 Discussions

In this work, we only take the Faster R-CNN [20] as an attempt. Actually, it is not the focus and similar for other CNN-based detectors. In addition, the utilization procedure of 3D-LiDAR is not restricted to this kind of module in this work. The purpose of this work aims to present that the performance of CNN-based detectors can be boosted by fusing some simple and intuitive geometrical clues extracted from 3D-LiDAR sensor, and the convincing results can be generated.

6 Conclusion

This paper novelly introduced the 3D-LiDAR sensor to boost the performance of CNN-based detectors. Faster R-CNN was utilized as an attempt. Facing the heterogeneous and asynchronous properties of two different sensors, this work firstly calibrated the RGB and LiDAR data with an online module which can adapt to the dynamic scene more effectively. Then, some physically geometrical clues acquired by 3D LiDAR were exploited to eliminate the erroneous pedestrian proposals. Exhaustive experiments verified the superiority of the proposed method. In the future, the more fusing module for camera and 3D-LiDAR is our focus.