CrossFusion net: Deep 3D object detection based on RGB images and point clouds in autonomous driving
Introduction
Thanks to the rapid development of intelligent vehicles, autonomous driving becomes a popular issue in recent years. The most important issue for autonomous driving is to understand the surroundings of a vehicle, and one crucial key is 3D detection. By reasoning the surroundings of vehicles in 3D detection, the system can make the correct decisions under various kinds of situations. Nowadays, many of intelligent vehicles are equipped with multiple sensors at the same time, such as cameras, LiDAR and Inertial measurement unit (IMU). This motivates researchers to combine different sensors to conduct 3D object detection. In this work, the authors aim to design an accurate and stable 3D detector which is based on cameras and LiDARs. The effective fusion of RGB images and LiDAR point clouds should be capable of supplying much richer information.
Although 2D object detection has achieved great success on famous datasets, such as ImageNet [6], MS COCO [8] and KITTI [10], the 3D object detection remains an open problem because of an additional depth information. In most of the case, 2D object detectors, such as YOLO [12], Fast R-CNN [14], take only RGB images as inputs. However, the lack of depth could be a fatal flaw which leads to coarse results in 3D object detection. Hence, we propose a fusion-based network that takes advantages of mature 2D object detection methods. With the presence of LiDAR point clouds, the network is able to learn more representative information. Besides, each sensor has its own merits. Specifically, LiDAR is effective at providing depth information under various weather conditions while suffering from distant details. On the other hand, the camera preserves the detailed information of the front-view while suffering from various weather conditions. The purpose of this work is on 3D object detection that exploits both RGB images and point clouds in the on-road scene. Specifically, the study focuses on combining different sensors beneficial to each other.
Recently, some novel image-based methods explored the use of monocular [1], [15], [16], [17] or stereo [2], [4] images. Images usually provided detailed and dense measurements of front-view. However, these methods were limited by the loss of depth information. On the other hand, LiDAR based 3D object detection methods were rapidly developed afterwards. LiDAR brought way accurate depth information applying effective use of localization and shape description. Nevertheless, the point clouds were unordered and sparse. To deal with this problem, VoxelNet [7] and PointNet [19] grouped the points into voxel grids. Simony et al. [9] and Yu et al. [20] projected point clouds to a ground such as bird's eye view(BEV) or front-view to avoid high computational cost of 3D convolution. In addition, the PointNet [19] directly processed point clouds through their permutation invariance. However, LiDAR suffered from distant detection due to its natural defects. As to fusion both RGB images and LiDAR point clouds methods, the ContFuse [3] successfully combined two streams of feature maps in different combinations of fusion.
The proposed CrossFusion Net is a 3D object detection network that takes RGB images and point clouds as inputs to make a valid use of both cameras and LiDARs. The presented CrossFusion Net is an end-to-end trainable architecture and capable of predicting accurate 3D bounding boxes. In addition, the novel CrossFusion layer enables the fusion between two streams of feature maps from different sensors in a cascading way. Through projecting all the points and pixels to its absolute coordinate in a 3D space, feature maps from different sensors could be passed to the other. Thus, by avoiding computational-cost 3D convolution, the 3D space relationship is kept between two kinds of feature maps during the CrossFusion layer. The presented network is evaluated by the tasks of both the 3D detection and the BEV detection benchmarks based on the popular KITTI on-road dataset. In this paper, the remaining parts are organized as follows. Section II introduces related works about RBG image based, point cloud based and fusion based methods to achieve 3D detection task. Section III mentions the formulation of the target task. Section IV proposes the overall architecture of the method. Section V elaborates the details of the proposed components. Section VI presents the experiments on the KITTI road dataset. Finally, Section VII gives a conclusion of the presented method.
Section snippets
Related works
The 3D object detection is a crucial part of intelligent transportation systems. Many works focusing on this topic come up with their solutions. After reviewing the existing works on the 3D object detection, they are basically divided into the following three categories according to the inputs.
Problem formulation
The presented deep learning network simultaneously absorbs both RGB inputs of the images and the point clouds. The input RGB images can be represented as a set of integer pixel values V, where V = {vij | 1 ≤ i ≤ h, 1 ≤ j ≤ w}, h symbolizes the height and w stands for the width of images as well. Each element vij in the image is an integer within the range of [0, 255]. On the other hand, a point cloud can be parametrized as a set of points PC, where PC = {Ps | s = 1, 2, …n} and n represents the
CrossFusion net
As more and more intelligent vehicles are equipped with both cameras and LiDARs, the CrossFusion Net is proposed to exploit the pros of these two different sensors. As shown in Fig. 1, the CrossFusion Net takes an RGB image and a point cloud to conduct the 3D object detection. Recently, Mono3D [1], Stereo R-CNN [2], Pseudo-LiDAR [4] and SECOND [5] have performed impressive results on 2D object detection based on RGB image feature maps. In contrast, Simony et al. [9] and Li et al. [24] achieved
Elaboration of CrossFusion layer
In order to fully exploit the potential of features of BEV images and RGB images and make them benefit each other, the proposed CrossFusion layer transforms the features from one to the other on the basis of their spatial relationship. In the following Subsections, the details of the CrossFusion layer are specified.
Experimental results of the crossfusion net
The presented network is trained and tested on a personal computer with single NVIDIA GTX 1080 Ti GPU. The experiments are divided into three parts. Firstly, it begins with conducting the experiments on the challenging KITTI dataset. Secondly, an ablation study is given to evaluate the contribution of each proposed methods. Finally, the quantitative and qualitative visualization results are demonstrated by projecting the 3D bounding boxes onto 2D images. Moreover, the power and the limitations
Conclusions
A novel end-to-end trainable fusion-based 3D object detection network, CrossFusion Network, is presented to take both RGB images and point clouds as inputs. Most of the existing fusion-based methods for 3D object detection do not fully take advantages of the spatial relationship between RGB images and point clouds. In this paper, the developed fusion method, CrossFusion layer, acts as a bridge between the RGB image feature maps and the BEV feature maps according to their absolute coordinate in
Acknowledgement
This work was partially sponsored by the Ministry of Science and Technology (MOST), Taiwan ROC, under Project 108-2634-F-002-016, 108-2634-F-002-017, 105-2221-E-390-024-MY3 and 108-2221-E-390-019-MY3. This research was also supported in part by the Center for AI & Advanced Robotics, National Taiwan University and the Joint Research Center for AI Technology and All Vista Healthcare under MOST.
Author contributions
Dza-Shiang Hong: Formal analysis, Software, Visualization, Writing.
Hung-Hao Chen: Visualization, Writing, Revising, Review & Editing.
Pei-Yung Hsiao: Investigation, Methodology, Supervision Administration, Funding Acquisition, Visualization, Revising, Review & Editing.
Li-Chen Fu: Conceptualization, Funding Acquisition, Investigation, Methodology, Resources, Supervision, Project Administration, Visualization.
Siang-Min Siao: Formal analysis, Software, Visualization, Revising.
Dza-Shiang Hong received the B.S. degree in civil engineering and the M.S. degree in Computer Science and Information Engineering from National Taiwan University from National Taiwan University, Taipei, Taiwan, in 2015 and 2018, respectively. His research interests include deep learning and computer vision.
References (31)
- et al.
Joint 3d proposal generation and object detection from view aggregation
- et al.
Monocular 3d object detection for autonomous driving
- et al.
Stereo R-CNN based 3D Object Detection for Autonomous Driving
- et al.
Deep continuous fusion for multi-sensor 3d object detection
- et al.
Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving
- et al.
Second: Sparsely embedded convolutional detection
(2018) - et al.
Imagenet: A large-scale hierarchical image database
- et al.
Voxelnet: End-to-end learning for point cloud based 3d object detection
Microsoft COCO: Common Objects in Context
- et al.
Complex-YOLO: an Euler-region-proposal for real-time 3D object detection on point clouds
Are we ready for autonomous driving? the kitti vision benchmark suite
You only look once: Unified, real-time object detection
Multi-view 3d object detection network for autonomous Driving
Fast r-cnn
Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image
Cited by (0)
Dza-Shiang Hong received the B.S. degree in civil engineering and the M.S. degree in Computer Science and Information Engineering from National Taiwan University from National Taiwan University, Taipei, Taiwan, in 2015 and 2018, respectively. His research interests include deep learning and computer vision.
Hung-Hao Chen received the B.S. degree in Department of Computer Science and Engineering from National Chen Kung University, Tainan, Taiwan, in 2018. He is currently pursuing the M.S. degree with the Department of Computer Science and Engineering in Department of Computer Science and Engineering from National Taiwan University, Taipei, Taiwan,. His research interests include deep learning and computer vision.
Pei-Yung Hsiao(M'90) received the B.S. degree in chemical engineering from Tung Hai University, in 1980 and the M.S. and Ph.D. degrees in electrical engineering from the National Taiwan University, in 1987 and 1990, respectively. In 1990, he was an Associate Professor in the Department of Computer and Information Science, National Chiao Tung University, Hsinchu, Taiwan. In 1998, he was the CEO of Aetex Biometric Corporation. He is currently a Professor in the Department of Electrical Engineering, National Univ. of Kaohsiung. His research interests and industrial experiences include VLSI/CAD, image processing, fingerprint recognition, visual detection, embedded systems, and FPGA rapid prototyping.
Li-Chen Fu (M'84-SM'94-F′04) received the B.S. degree from National Taiwan University in 1981, and the M.S. and Ph.D. degrees from the University of California, Berkeley, in 1985 and 1987, respectively. Since 1987, he has been on the faculty of and currently is a professor in both the Department of Electrical Engineering and Department of Computer Science & Information Engineering of National Taiwan University. He is now a senior member of both the Robotics and Automation Society and Automatic Control Society of IEEE, and he became an IEEE Fellow (F) in 2004. His areas of research interest include robotics, FMS scheduling, shop floor control, home automation, visual detection and tracking, E-commerce, and control theory & applications.
Siang-Min Siao received the M.S. and Ph.D. degrees in Electronic Engineering from National Yunlin University of Science & Technology, Yunlin, Taiwan, in 2011 and 2017 respectively. He is presently an engineer in the automotive research & testing center, Taiwan. His research interests include VLSI/CAD, digital circuit design, digital signal process, and algorithm analysis.