Keywords

1 Introduction

Traffic accidents are the first preventable death cause worldwide [8]. An emerging approach to solve this problem is autonomous driving [3]. In recent years, there has been an increasing interest in autonomous vehicles. Based on Scopus database, in 2017 more than 3,500 published papers show the interest of the researching community.

One of the greatest challenges on autonomous driving is object detection, which is needed to take driving decisions. Therefore, the autonomous vehicles have integrated diverse devices to sense their environment. Commonly, an autonomous vehicle has GPS, encoders, inertial measurement units, cameras, and LiDAR sensors [5]. The last two comprise an artificial vision system used for object detection applications.

Since 2010, autonomous vehicles have incorporated 3D vision systems. Those that stand out the most are LiDAR-based sensors, which reconstruct their surrounding environment in three dimensions by using a point cloud representation. Some studies examined LiDAR as an option for object detection in an autonomous driving environment by using a 2D LiDAR device [7, 16]. Due to the low resolution of 2D LiDAR sensors, most of these papers use this device as a part of a sensor fusion scheme for object detection, where the main sensor is an RGB camera. This scheme is limited by hard environmental light conditions. Despite the 2D LiDAR low resolution, the papers that only use this device, achieve good results on vehicle detection, but the information is limited for detecting small objects. With the inclusion of a high definition LiDAR [14] in autonomous driving platforms, there is a growing number of researchers, who have exclusively used 3D information for on-road object detection tasks [1, 6, 17].

An early approach to the object detection on LiDAR data focused on handcrafted features. Those features are extracted by using different techniques [6]. However, as it is evident from object detection tasks based on RGB images, in recent years, Deep Learning started to dominate 3D data object detection, too.

Based on a KITTI 3D object detection benchmark [], there have been a number of studies, which involve Deep Learning that has reported the best performance on different street object detection. Some methods listed in the benchmark still use RGB images, even if these are only used for 3D region proposals [9]. However, there are explorations which exclusively use LiDAR data for object detection. One of the outstanding methods is VoxelNet. This method applies voxelization of the entire point cloud in order to use it as an organized representation on a Deep Net Architecture [17]. An approach more recently presented implements a multilayer bird’s eye view representation of the 3D point cloud to use it into a convolutional network [1]. Despite its results, the representations used may lose information in the transformation process, which can be essential on object detection tasks.

A novel approach for point cloud object classification and segmentation has been proposed by Qi et al. in two architectures: PointNet [10] and PointNet++ [12], which exclusively and directly use raw point clouds. This architecture is mainly based on a symmetric function to transform the point cloud into an orderly invariant space. The PointNet architecture has shown good results on object classification in benchmarks, such as ModelNet [15]. However, this architecture has a limitation as for the number of points to be processed. This architecture uses a fixed number for input points, which makes inefficient its use on data with variable resolution. The data captured by a LIDAR sensor have different resolution for different distances to object. This entails the necessity to improve the PointNet architecture to use it in autonomous vehicle applications.

In this paper, a PointNet evaluation for on-road object detection is presented, which uses LiDAR data. Our work has 3 main blocks. The first one, a 3D object generator, which segments a LIDAR point cloud scene into 3D on-road objects. The second one, a multi-resolution conditioning stage, which adjusts each object point cloud to a specific-resolution. And the third one, six PointNet models at six different resolutions were trained.

This paper is structured as follows: Sect. 2 describes the methodology used to design our method. And Sect. 3 contains the experimental results and conclusions.

2 Framework Overview

Figure 1 represents the global architecture of our approach. Here, block 1 presents the segmentation process, block 2 represents a resolution conditioning process, and block 3 represents a multiresolution approach based on PointNet.

Fig. 1.
figure 1

General scheme of our traffic detection method by using a multi-resolution PointNet architecture.

In the following sub-sections, each block on the Fig. 1 and some tools and techniques needed for the process will be explained.

2.1 Segmentation

The elimination of the ground points is a necessary step for non-ground element segmentation. For this purpose, a modeling of ground plane is done by an algorithm named Random Sample Consensus (RANSAC) [13], which meets the model of the plane that has the higher number of points into a defined distance. The RANSAC algorithm is restricted by 3\(^{\circ }\) from the plane defined by the Z axes and the sensor position height. The result of applying the algorithm is presented in Fig. 2b.

Fig. 2.
figure 2

Segmentation process. (a) Original LiDAR point cloud frame. (b) LiDAR point cloud frame after ground extraction. (c) Occupation grid representation with an object bounding box. (d) 3D bounding box from occupation grid.

The occupation grid is an approach in 2 dimensions, which has been efficiently used on pedestrian detection [6]. Starting with the point cloud without the ground plane points, a mesh over the XY plane is established. The mesh is defined by 20 cm by 20 cm\(^2\), taking 20 cm as a prudent distance in order to establish two point clouds as different objects. All the points on each square of the mesh are projected into a single plane in order to use the number of points into each square as input of the occupation array. To ignore very low resolution elements, squares with less than 4 points are removed.

Figure 2c shows an occupation array where the white points represent 20 by 20 squares with more than 3 points. This array is treated by a connected region algorithm to determine those groups, which compose an object [2]. From the connected regions, the bounding boxes are extracted and these are transformed into 3D bounding boxes giving an extra margin to avoid the loss of points as it is shown in the Fig. 2d. The point cloud into each 3D bounding box is stored with the label information, into an H5 format, which will be used by the Deep Net architecture.

2.2 Resolution Conditioning

In order to take advantage of the parallel computing, the Deep Net architectures use groups of information to compute it simultaneously by reducing the processing time. This batch should be made up of information of the same size or shape. By the nature of the point cloud 3D representation, each object in a given point cloud has a different quantity of points and a group of different objects, which will have a variable shape, for that reason, a resolution conditioning is required to unify a batch size for a Deep Net training.

As our approach is based on multi-resolution, the data prepared to train our method should be separated depending on the number of points, which compose each object. As it will be explained on the deep architecture subsection, our approach uses 6 different resolutions (16, 32, 64, 128, 256, and 512 points). For objects with more than 512 points, a sub-group of 512 points is randomly selected. For objects with points between 256 and 512, 256 points randomly selected and the same is true for the rest of ranges except for objects that have less than 16 points. In this case, points are aggregated to achieve the 16 points resolution, which is made up by the most simple up-sampling procedure. This is performed repeating the existing points in the object until the desired number of points is reached. Since the objective of this paper is to evaluate the performance of the PointNet architecture, it is unwanted to improve the data, which is used in it.

2.3 PointNet Architecture for Multiple Resolution Objects

Due to the multi-resolution of the objects represented by a LiDAR point cloud, a method which can handle a highly variable point quantity is required. For this purpose, a parallel architecture of point nets, which process separately point clouds with six ranges of resolutions is established. This architecture learns specific features for each resolution, which takes advantage of the available information on each object. Besides, this approach avoids the excessive down-sampling or up-sampling, which is required by a single resolution PointNet.

The evaluation approach is made up of six PointNets, where each one will be trained with different batch shapes. Finally, in the evaluation process, the trained model will be selected depending on the quantity of points, which represents the object to be classified. This ensures that the evaluation process will not have more computation cost than a single resolution PointNet.

3 Experiments

3.1 Dataset

The data that is used on our approach was obtained from the KITTI vision benchmark suite. The dataset has 22 sequences of LiDAR data, which represents 39.2 Km in diverse autonomous driving environments. From all the data on the dataset, the 3D object detection benchmark selected 12,000 frames of tridimensional information, which was acquired by an HD LiDAR sensor [4]. From these 29 Gb of LiDAR data, over 40,000 individual objects are extracted and stored in an H5 format with its respective labels.

More than 30,000 objects in the dataset are small vehicles, which are labeled as Car-Van. This implies a highly unbalanced database, which can hinder training process. To improve this issue, new examples from the existent data were generated. It was performed by a sub-sampling process in order to simulate different acquisition distances for the same object.

The existent examples, which have enough points, are randomly sub-sampled in order to generate new objects simulating acquisitions distances of up to 70 m. This process is performed on pedestrians, cyclists, and miscellaneous elements. These classes have a low number of examples compared to the vehicles on the dataset. At the end of the process, the data set is made up of 58,566 objects on multiple resolutions. The dataset is divided in order to get 80% for the training process and 20% for the evaluation.

3.2 Training Process

The model, which is described in Subsect. 2.3, is trained in 6 stages and defined by resolution groups. In order to individually establish the models, which would process each resolution of point clouds, the training process was done by each PointNet architecture separately. The 46,853 objects in the training dataset were conditioned following the procedure described in Sect. 3.2. Since there are some objects, which appear with an extremely low resolution, a threshold of eight points was defined to leave out of the training process, that is to say, those elements with less than eight points. To increase the training data, some of the shelf methods to generate new point clouds from the existent data are applied [11]. From each training object, a new object was created through a random angle rotation over its Z axis, and other were created through adding a random level of noise moving their points from 1 cm to 4 cm. The random noise added to the new objects also gives some robustness in environments like mild frog or rainy conditions. As our approach uses 6 PointNets, the training time increases compared to a single resolution PointNet. The training time of our proposal was near to the double of a single resolution PointNet trained with 512 points, about 45 h in a basic computer with a single GTX650Ti GPU.

3.3 Evaluation Process

To assess the performance of the proposed multi-resolution approach, the PointNet with all the testing dataset, in a single resolution architecture, was tested. In order to keep the original information of the objects with a resolution under the architecture defined, an up-sampling process by using repeated points was applied. The evaluation process of the single resolution PointNet was performed in six different resolutions, separately, in order to see the behavior of the detection across of each resolution. The results are presented in the first six rows of Table 1. As it can be seen, the low resolution objects negatively affect the performance of the PointNet object detector when it is used with 512 and 256 points. Nevertheless, the performance of this architecture is outstanding for the data used for training and evaluation. Interestingly, the results shown in Table 1 indicate that the resolution where each class achieves the better results are not the same. This means that if a model with a specific resolution is chosen, some classes would decrease their score. As far as the experimental evaluation of the multi-resolution architecture, the last row of Table 1 illustrates the performance of the proposed method.

Taken together, these results show that even when the single resolution PointNet can have lightly better results on certain classes, our multi-resolution adaption improves the general performance in 3.5% over the 32 points resolution single PointNet, which is the best result for the single-resolution architecture.

Table 1. Classification results of PointNet on KITTI 3D object detection data set

To illustrate the performance over the KITTI dataset, in the Fig. 3 two frames are shown with the detection results.

Fig. 3.
figure 3

Performance over KITTI 3D object detection dataset. Green boxes contain the good predictions (Color figure online)

4 Conclusions

The PointNet architecture was evaluated on the detection of on-road 3D objects using multiple resolutions of point cloud data, but without using any RGB information. A proposed resolution conditioning of 3D objects allowed us to use the point clouds representations of these objects on multiple specific resolution nets. This proposed approach using multiple resolutions on PointNet architecture yielded an improvement of 3.5% in general AP compared with the best performance of a single resolution approach. On the pedestrian class, the average precision of detection was 99%, but it is worth to note that this class is easily differentiable from the other objects because of its size characteristics. Future work will entail labeling process in order to increase the number of classes of the dataset, for example including classes like road signs and trees. With this, we attempt to find a more robust detector over the most common on-road objects. The labeling process should also have a data balancing purpose in order to improve the training process.