CrossFusion net: Deep 3D object detection based on RGB images and point clouds in autonomous driving

https://doi.org/10.1016/j.imavis.2020.103955Get rights and content

Highlights

  • The proposed CrossFusion Net performs 3D object detection from two sensors.

  • The presented attention mechanism generates adaptive weights for two streams of feature maps.

  • The CF Net outperforms 1%, 8%, and 3% APs in easy, moderate, and hard cases, respectively.

  • The comparable 100 ms inference time is much less than 170-360 ms from others.

Abstract

In recent years, accurate 3D detection plays an important role in a lot of applications. Autonomous driving, for instance, is one of typical representatives. This paper aims to design an accurate 3D detector that takes both Li-DAR point clouds and RGB images as inputs according to the fact that both LiDAR and camera have their own merits. A deep novel end-to-end two-stream learnable architecture, CrossFusion Net, is designed to exploit features from both LiDAR point clouds as well as RGB images through a hierarchical fusion structure. Specifically, CrossFusion Net utilizes bird's eye view (BEV) of point clouds through projection. Besides, these two feature maps of different streams are fused through the newly introduced CrossFusion(CF) layer. The proposed CF layer transforms feature maps of one stream to another based on the spatial relationship between the BEV and RGB images. Additionally, we apply attention mechanism on the transformed feature map and the original one to automatically decide the importance of the two feature maps from the two sensors. Experiments on the challenging KITTI car 3D detection benchmark and BEV detection benchmark show that the presented approach outperforms the other state-of-the-art methods in average precision(AP), specifically, as well as outperforms UberATG-ContFuse [3] of 8% AP in moderate 3D car detection. Furthermore, the proposed network learns an effective representation in perception of circumstances via RGB feature maps and BEV feature maps.

Introduction

Thanks to the rapid development of intelligent vehicles, autonomous driving becomes a popular issue in recent years. The most important issue for autonomous driving is to understand the surroundings of a vehicle, and one crucial key is 3D detection. By reasoning the surroundings of vehicles in 3D detection, the system can make the correct decisions under various kinds of situations. Nowadays, many of intelligent vehicles are equipped with multiple sensors at the same time, such as cameras, LiDAR and Inertial measurement unit (IMU). This motivates researchers to combine different sensors to conduct 3D object detection. In this work, the authors aim to design an accurate and stable 3D detector which is based on cameras and LiDARs. The effective fusion of RGB images and LiDAR point clouds should be capable of supplying much richer information.

Although 2D object detection has achieved great success on famous datasets, such as ImageNet [6], MS COCO [8] and KITTI [10], the 3D object detection remains an open problem because of an additional depth information. In most of the case, 2D object detectors, such as YOLO [12], Fast R-CNN [14], take only RGB images as inputs. However, the lack of depth could be a fatal flaw which leads to coarse results in 3D object detection. Hence, we propose a fusion-based network that takes advantages of mature 2D object detection methods. With the presence of LiDAR point clouds, the network is able to learn more representative information. Besides, each sensor has its own merits. Specifically, LiDAR is effective at providing depth information under various weather conditions while suffering from distant details. On the other hand, the camera preserves the detailed information of the front-view while suffering from various weather conditions. The purpose of this work is on 3D object detection that exploits both RGB images and point clouds in the on-road scene. Specifically, the study focuses on combining different sensors beneficial to each other.

Recently, some novel image-based methods explored the use of monocular [1], [15], [16], [17] or stereo [2], [4] images. Images usually provided detailed and dense measurements of front-view. However, these methods were limited by the loss of depth information. On the other hand, LiDAR based 3D object detection methods were rapidly developed afterwards. LiDAR brought way accurate depth information applying effective use of localization and shape description. Nevertheless, the point clouds were unordered and sparse. To deal with this problem, VoxelNet [7] and PointNet [19] grouped the points into voxel grids. Simony et al. [9] and Yu et al. [20] projected point clouds to a ground such as bird's eye view(BEV) or front-view to avoid high computational cost of 3D convolution. In addition, the PointNet [19] directly processed point clouds through their permutation invariance. However, LiDAR suffered from distant detection due to its natural defects. As to fusion both RGB images and LiDAR point clouds methods, the ContFuse [3] successfully combined two streams of feature maps in different combinations of fusion.

The proposed CrossFusion Net is a 3D object detection network that takes RGB images and point clouds as inputs to make a valid use of both cameras and LiDARs. The presented CrossFusion Net is an end-to-end trainable architecture and capable of predicting accurate 3D bounding boxes. In addition, the novel CrossFusion layer enables the fusion between two streams of feature maps from different sensors in a cascading way. Through projecting all the points and pixels to its absolute coordinate in a 3D space, feature maps from different sensors could be passed to the other. Thus, by avoiding computational-cost 3D convolution, the 3D space relationship is kept between two kinds of feature maps during the CrossFusion layer. The presented network is evaluated by the tasks of both the 3D detection and the BEV detection benchmarks based on the popular KITTI on-road dataset. In this paper, the remaining parts are organized as follows. Section II introduces related works about RBG image based, point cloud based and fusion based methods to achieve 3D detection task. Section III mentions the formulation of the target task. Section IV proposes the overall architecture of the method. Section V elaborates the details of the proposed components. Section VI presents the experiments on the KITTI road dataset. Finally, Section VII gives a conclusion of the presented method.

Section snippets

Related works

The 3D object detection is a crucial part of intelligent transportation systems. Many works focusing on this topic come up with their solutions. After reviewing the existing works on the 3D object detection, they are basically divided into the following three categories according to the inputs.

Problem formulation

The presented deep learning network simultaneously absorbs both RGB inputs of the images and the point clouds. The input RGB images can be represented as a set of integer pixel values V, where V = {vij | 1 ≤ i ≤ h, 1 ≤ j ≤ w}, h symbolizes the height and w stands for the width of images as well. Each element vij in the image is an integer within the range of [0, 255]. On the other hand, a point cloud can be parametrized as a set of points PC, where PC = {Ps | s = 1, 2, …n} and n represents the

CrossFusion net

As more and more intelligent vehicles are equipped with both cameras and LiDARs, the CrossFusion Net is proposed to exploit the pros of these two different sensors. As shown in Fig. 1, the CrossFusion Net takes an RGB image and a point cloud to conduct the 3D object detection. Recently, Mono3D [1], Stereo R-CNN [2], Pseudo-LiDAR [4] and SECOND [5] have performed impressive results on 2D object detection based on RGB image feature maps. In contrast, Simony et al. [9] and Li et al. [24] achieved

Elaboration of CrossFusion layer

In order to fully exploit the potential of features of BEV images and RGB images and make them benefit each other, the proposed CrossFusion layer transforms the features from one to the other on the basis of their spatial relationship. In the following Subsections, the details of the CrossFusion layer are specified.

Experimental results of the crossfusion net

The presented network is trained and tested on a personal computer with single NVIDIA GTX 1080 Ti GPU. The experiments are divided into three parts. Firstly, it begins with conducting the experiments on the challenging KITTI dataset. Secondly, an ablation study is given to evaluate the contribution of each proposed methods. Finally, the quantitative and qualitative visualization results are demonstrated by projecting the 3D bounding boxes onto 2D images. Moreover, the power and the limitations

Conclusions

A novel end-to-end trainable fusion-based 3D object detection network, CrossFusion Network, is presented to take both RGB images and point clouds as inputs. Most of the existing fusion-based methods for 3D object detection do not fully take advantages of the spatial relationship between RGB images and point clouds. In this paper, the developed fusion method, CrossFusion layer, acts as a bridge between the RGB image feature maps and the BEV feature maps according to their absolute coordinate in

Acknowledgement

This work was partially sponsored by the Ministry of Science and Technology (MOST), Taiwan ROC, under Project 108-2634-F-002-016, 108-2634-F-002-017, 105-2221-E-390-024-MY3 and 108-2221-E-390-019-MY3. This research was also supported in part by the Center for AI & Advanced Robotics, National Taiwan University and the Joint Research Center for AI Technology and All Vista Healthcare under MOST.

Author contributions

Dza-Shiang Hong: Formal analysis, Software, Visualization, Writing.

Hung-Hao Chen: Visualization, Writing, Revising, Review & Editing.

Pei-Yung Hsiao: Investigation, Methodology, Supervision Administration, Funding Acquisition, Visualization, Revising, Review & Editing.

Li-Chen Fu: Conceptualization, Funding Acquisition, Investigation, Methodology, Resources, Supervision, Project Administration, Visualization.

Siang-Min Siao: Formal analysis, Software, Visualization, Revising.

Dza-Shiang Hong received the B.S. degree in civil engineering and the M.S. degree in Computer Science and Information Engineering from National Taiwan University from National Taiwan University, Taipei, Taiwan, in 2015 and 2018, respectively. His research interests include deep learning and computer vision.

References (31)

  • J. Ku et al.

    Joint 3d proposal generation and object detection from view aggregation

  • X. Chen et al.

    Monocular 3d object detection for autonomous driving

  • P. Li et al.

    Stereo R-CNN based 3D Object Detection for Autonomous Driving

  • M. Liang et al.

    Deep continuous fusion for multi-sensor 3d object detection

  • Y. Wang et al.

    Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving

  • Y. Yan et al.

    Second: Sparsely embedded convolutional detection

    (2018)
  • J. Deng et al.

    Imagenet: A large-scale hierarchical image database

  • Y. Zhou et al.

    Voxelnet: End-to-end learning for point cloud based 3d object detection

  • T.-Y. Lin

    Microsoft COCO: Common Objects in Context

  • M. Simony et al.

    Complex-YOLO: an Euler-region-proposal for real-time 3D object detection on point clouds

  • A. Geiger et al.

    Are we ready for autonomous driving? the kitti vision benchmark suite

  • J. Redmon et al.

    You only look once: Unified, real-time object detection

  • X. Chen et al.

    Multi-view 3d object detection network for autonomous Driving

  • R. Girshick

    Fast r-cnn

  • F. Chabot et al.

    Deep manta: A coarse-to-fine many-task network for joint 2d and 3d vehicle analysis from monocular image

  • Cited by (0)

    Dza-Shiang Hong received the B.S. degree in civil engineering and the M.S. degree in Computer Science and Information Engineering from National Taiwan University from National Taiwan University, Taipei, Taiwan, in 2015 and 2018, respectively. His research interests include deep learning and computer vision.

    Hung-Hao Chen received the B.S. degree in Department of Computer Science and Engineering from National Chen Kung University, Tainan, Taiwan, in 2018. He is currently pursuing the M.S. degree with the Department of Computer Science and Engineering in Department of Computer Science and Engineering from National Taiwan University, Taipei, Taiwan,. His research interests include deep learning and computer vision.

    Pei-Yung Hsiao(M'90) received the B.S. degree in chemical engineering from Tung Hai University, in 1980 and the M.S. and Ph.D. degrees in electrical engineering from the National Taiwan University, in 1987 and 1990, respectively. In 1990, he was an Associate Professor in the Department of Computer and Information Science, National Chiao Tung University, Hsinchu, Taiwan. In 1998, he was the CEO of Aetex Biometric Corporation. He is currently a Professor in the Department of Electrical Engineering, National Univ. of Kaohsiung. His research interests and industrial experiences include VLSI/CAD, image processing, fingerprint recognition, visual detection, embedded systems, and FPGA rapid prototyping.

    Li-Chen Fu (M'84-SM'94-F′04) received the B.S. degree from National Taiwan University in 1981, and the M.S. and Ph.D. degrees from the University of California, Berkeley, in 1985 and 1987, respectively. Since 1987, he has been on the faculty of and currently is a professor in both the Department of Electrical Engineering and Department of Computer Science & Information Engineering of National Taiwan University. He is now a senior member of both the Robotics and Automation Society and Automatic Control Society of IEEE, and he became an IEEE Fellow (F) in 2004. His areas of research interest include robotics, FMS scheduling, shop floor control, home automation, visual detection and tracking, E-commerce, and control theory & applications.

    Siang-Min Siao received the M.S. and Ph.D. degrees in Electronic Engineering from National Yunlin University of Science & Technology, Yunlin, Taiwan, in 2011 and 2017 respectively. He is presently an engineer in the automotive research & testing center, Taiwan. His research interests include VLSI/CAD, digital circuit design, digital signal process, and algorithm analysis.

    View full text