Abstract
3D object detection is an essential task in autonomous driving and virtual reality. Existing approaches largely rely on expensive LiDAR sensors for accurate depth information to have high performance. While much lower-cost stereo cameras have been introduced as a promising alternative, there is still a notable performance gap. In this paper, we explore the idea to leverage sparse LiDAR and stereo images obtained by low-cost sensors for 3D object detection. We propose a novel multi-modal attention fusion end-to-end learning framework for 3D object detection, which effectively integrate the complementarities of sparse LiDAR and stereo images. Instead of directly fusing LiDAR and stereo modalities, we introduce a deep attention feature fusion module, which enables interactions between intermediate layers of LiDAR and stereo image paths by exploring the interdependencies of channel features. These fused features connect higher layer features after upsampling and lower layer features from the stereo image pathway and sparse LiDAR pathway. Hence, the fused features have high-level semantics with higher resolution, which is beneficial for the following object detection network. We provide detailed experiments on KITTI benchmark and achieve state-of-the-art performance compared with the low-cost based methods.
This work was supported by the National Natural Science Foundation of China under Grants 61801414, 62072391, Natural Science Foundation of Shandong Province under Grants ZR2020QF108.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Brazil, G., Liu, X.: M3d-RPN: Monocular 3d region proposal network for object detection. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9287ā9296 (2019)
Cai, Y., Li, B., Jiao, Z., Li, H., Zeng, X., Wang, X.: Monocular 3d object detection with decoupled structured polygon estimation and height-guided depth estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 10478ā10485 (2020)
Chen, X., Kundu, K., Zhu, Y., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals using stereo imagery for accurate object class detection. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1259ā1272 (2017)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3d object detection network for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1907ā1915 (2017)
Chen, Y., Liu, S., Shen, X., Jia, J.: DSGN: deep stereo geometry network for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12536ā12545 (2020)
Cheng, B., Sheng, L., Shi, S., Yang, M., Xu, D.: Back-tracing representative points for voting-based 3d object detection in point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8963ā8972 (2021)
Choi, C., Choi, J.H., Li, J., Malla, S.: Shared cross-modal trajectory prediction for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 244ā253 (2021)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? the kitti vision benchmark suite. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition, pp. 3354ā3361. IEEE (2012)
He, C., Zeng, H., Huang, J., Hua, X.S., Zhang, L.: Structure aware single-stage 3d object detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11873ā11882 (2020)
He, K., Gkioxari, G., DollĆ”r, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961ā2969 (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770ā778 (2016)
He, Y., et al.: DVFENet: dual-branch voxel feature extraction network for 3d object detection. Neurocomputing 459, 201ā211 (2021)
Kendall, A., Gal, Y., Cipolla, R.: Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7482ā7491 (2018)
Ku, J., Harakeh, A., Waslander, S.L.: In defense of classical image processing: fast depth completion on the CPU. In: 2018 15th Conference on Computer and Robot Vision (CRV), pp. 16ā22. IEEE (2018)
Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3d proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1ā8. IEEE (2018)
Lang, A.H., Vora, S., Caesar, H., Zhou, L., Yang, J., Beijbom, O.: Pointpillars: fast encoders for object detection from point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12697ā12705 (2019)
Li, P., Chen, X., Shen, S.: Stereo R-CNN based 3d object detection for autonomous driving. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7644ā7652 (2019)
Li, P., Zhao, H., Liu, P., Cao, F.: RTM3D: real-time monocular 3d detection from object keypoints for autonomous driving. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 644ā660. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_38
Liang, M., Yang, B., Chen, Y., Hu, R., Urtasun, R.: Multi-task multi-sensor fusion for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7345ā7353 (2019)
Liu, Y., Fan, B., Xiang, S., Pan, C.: Relation-shape convolutional neural network for point cloud analysis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8895ā8904 (2019)
Liu, Z., Wu, Z., Tóth, R.: Smoke: Single-stage monocular 3d object detection via keypoint estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 996ā997 (2020)
Luo, S., Dai, H., Shao, L., Ding, Y.: M3DSSD: monocular 3d single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6145ā6154 (2021)
Mai, N.A.M., Duthon, P., Khoudour, L., Crouzil, A., Velastin, S.A.: Sparse lidar and stereo fusion (SLS-fusion) for depth estimation and 3d object detection. arXiv preprint arXiv:2103.03977 (2021)
Peng, W., Pan, H., Liu, H., Sun, Y.: IDA-3D: instance-depth-aware 3d object detection from stereo vision for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 13015ā13024 (2020)
Prakash, A., Chitta, K., Geiger, A.: Multi-modal fusion transformer for end-to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7077ā7087 (2021)
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3d object detection from RGB-D data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 918ā927 (2018)
Qin, Z., Wang, J., Lu, Y.: MonoGRNet: a geometric reasoning network for monocular 3d object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8851ā8858 (2019)
Qin, Z., Wang, J., Lu, Y.: Triangulation learning network: from monocular to stereo 3d object detection. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 7607ā7615. IEEE (2019)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. Adv. Neural. Inf. Process. Syst. 28, 91ā99 (2015)
Shi, Y., Guo, Y., Mi, Z., Li, X.: Stereo centerNet-based 3d object detection for autonomous driving. Neurocomputing 471, 219ā229 (2022)
Sun, J., Chen, L., Xie, Y., Zhang, S., Jiang, Q., Zhou, X., Bao, H.: DISP R-CNN: stereo 3d object detection via shape prior guided instance disparity estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10548ā10557 (2020)
Tang, Y., Dorn, S., Savani, C.: Center3D: Center-based monocular 3d object detection with joint depth understanding. In: Akata, Z., Geiger, A., Sattler, T. (eds.) DAGM GCPR 2020. LNCS, vol. 12544, pp. 289ā302. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-71278-5_21
Vora, S., Lang, A.H., Helou, B., Beijbom, O.: Pointpainting: Sequential fusion for 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4604ā4612 (2020)
Wang, T.H., Hu, H.N., Lin, C.H., Tsai, Y.H., Chiu, W.C., Sun, M.: 3d lidar and stereo fusion using stereo matching network with conditional cost volume normalization. In: 2019 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 5895ā5902. IEEE (2019)
Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-lidar from visual depth estimation: Bridging the gap in 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8445ā8453 (2019)
Wang, Y., Sun, Y., Liu, Z., Sarma, S.E., Bronstein, M.M., Solomon, J.M.: Dynamic graph CNN for learning on point clouds. ACM Trans. Graph. (TOG) 38(5), 1ā12 (2019)
Xiao, Y., Codevilla, F., Gurram, A., Urfalioglu, O., López, A.M.: Multimodal end-to-end autonomous driving. IEEE Trans. Intell. Transp. Syst. (2020)
Xu, B., Chen, Z.: Multi-level fusion based 3d object detection from monocular images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2345ā2353 (2018)
Xu, Z., et al.: ZoomNet: part-aware adaptive zooming neural network for 3d object detection. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, pp. 12557ā12564 (2020)
Yin, T., Zhou, X., Krahenbuhl, P.: Center-based 3d object detection and tracking. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11784ā11793 (2021)
You, Y., et al.: Pseudo-lidar++: accurate depth for 3d object detection in autonomous driving. In: ICLR (2020)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
Ā© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Yan, W., Su, K., Ren, J., Cong, R., Li, S., Wang, S. (2022). Sparse LiDAR and Binocular Stereo Fusion Network for 3D Object Detection. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-18913-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-18912-8
Online ISBN: 978-3-031-18913-5
eBook Packages: Computer ScienceComputer Science (R0)