Abstract
Visual-based 3D detection is drawing a lot of attention recently. Despite the best efforts from the computer vision researchers visual-based 3D detection remains a largely unsolved problem. This is primarily due to the lack of accurate depth perception provided by LiDAR sensors. Previous works struggle to fuse 3D spatial information and the RGB image effectively. In this paper, we propose a novel monocular 3D detection framework to address this problem. Specifically, we propose to primary contributions: (i) We design an Adaptive Depth-guided Instance Normalization layer to leverage depth features to guide RGB features for high quality estimation of 3D properties. (ii) We introduce a Dynamic Depth Transformation module to better recover accurate depth according to semantic context learning and thus facilitate the removal of depth ambiguities that exist in the RGB image. Experiments show that our approach achieves state-of-the-art on KITTI 3D detection benchmark among current monocular 3D detection works.
E. Ouyang, L. Zhang—Authors contributed equally to this paper.
This work was supported in part by NSFC Projects (U62076067), Science and Technology Commission of Shanghai Municipality Projects (19511120700, 19ZR1471800).
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: CVPR (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: NeurIPS (2017)
Zhou, Y., Tuzel, O.: VoxelNet: end-to-end learning for point cloud based 3D object detection. In: CVPR (2018)
Qi, C.R., Liu, W., Wu, C., Su, H., Guibas, L.J.: Frustum PointNets for 3D object detection from RGB-D data. In: CVPR (2018)
Shi, S., Wang, X., Li, H.: PointRCNN: 3D object proposal generation and detection from point cloud. In: CVPR (2019)
Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: CVPR (2017)
Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3D object detection. In: ECCV (2018)
Yang, B., Luo, W., Urtasun, R.: Pixor: real-time 3D object detection from point clouds. In: CVPR (2018)
Chen, X., Kundu, K., Zhang, Z., Ma, H., Fidler, S., Urtasun, R.: Monocular 3D object detection for autonomous driving. In: CVPR (2016)
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D bounding box estimation using deep learning and geometry. In: CVPR (2017)
Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: CVPR (2018)
Zhang, L., Li, X., Arnab, A., Yang, K., Tong, Y., Torr, P.H.: Dual graph convolutional network for semantic segmentation. In: BMVC (2019)
Zhang, L., Xu, D., Arnab, A., Torr, P.H.: Dynamic graph message passing networks. In: CVPR (2020)
Hou, Q., Zhang, L., Cheng, M.M., Feng, J.: Strip pooling: rethinking spatial pooling for scene parsing. In: CVPR (2020)
Li, X., Zhang, L., You, A., Yang, M., Yang, K., Tong, Y.: Global aggregation then local distribution in fully convolutional networks. In: BMVC (2019)
Li, X., et al.: Improving semantic segmentation via decoupled body and edge supervision. In: ECCV (2020)
Wang, Q., Zhang, L., Bertinetto, L., Hu, W., Torr, P.H.: Fast online object tracking and segmentation: a unifying approach. In: CVPR (2019)
Zhu, F., Zhang, L., Fu, Y., Guo, G., Xie, W.: Self-supervised video object segmentation. arXiv preprint (2020)
Wang, Y., Chao, W.L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.: Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: CVPR (2019)
Ma, X., Wang, Z., Li, H., Zhang, P., Ouyang, W., Fan, X.: Accurate monocular 3D object detection via color-embedded 3D reconstruction for autonomous driving. In: ICCV (2019)
Chen, X., et al.: 3D object proposals for accurate object class detection. In: NeurIPS (2015)
Li, P., Chen, X., Shen, S.: Stereo R-CNN based 3D object detection for autonomous driving. In: CVPR (2019)
Fu, H., Gong, M., Wang, C., Batmanghelich, K., Tao, D.: Deep ordinal regression network for monocular depth estimation. In: CVPR (2018)
Geiger, A., Lenz, P., Urtasun, R.: Are we ready for autonomous driving? The kitti vision benchmark suite. In: CVPR (2012)
Li, B., Ouyang, W., Sheng, L., Zeng, X., Wang, X.: GS3D: an efficient 3D object detection framework for autonomous driving. In: CVPR (2019)
Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: ICCV (2019)
Manhardt, F., Kehl, W., Gaidon, A.: ROI-10D: monocular lifting of 2D detection to 6D pose and metric shape. In: CVPR (2019)
Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normalization. In: CVPR (2017)
Park, T., Liu, M.Y., Wang, T.C., Zhu, J.Y.: Semantic image synthesis with spatially-adaptive normalization. In: CVPR (2019)
Dai, J., et al.: Deformable convolutional networks. In: CVPR (2017)
Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29
Wei, S.E., Ramakrishna, V., Kanade, T., Sheikh, Y.: Convolutional pose machines. In: CVPR (2016)
Zhou, X., Wang, D., Krähenbühl, P.: Objects as points. In: arXiv preprint (2019)
Lin, T.Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: CVPR (2017)
Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: CVPR (2018)
Qin, Z., Wang, J., Lu, Y.: MonoGRNet: a geometric reasoning network for monocular 3D object localization. In: AAAI (2019)
Simonelli, A., Bulo, S.R., Porzi, L., López-Antequera, M., Kontschieder, P.: Disentangling monocular 3D object detection. In: CVPR (2019)
Brazil, G., Pons-Moll, G., Liu, X., Schiele, B.: Kinematic 3D object detection in monocular video. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12368, pp. 135–152. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58592-1_9
Liu, L., Lu, J., Xu, C., Tian, Q., Zhou, J.: Deep fitting degree scoring network for monocular 3D object detection. In: CVPR (2019)
Ku, J., Pon, A.D., Waslander, S.L.: Monocular 3D object detection leveraging accurate proposals and shape reconstruction. In: CVPR (2019)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Ouyang, E., Zhang, L., Chen, M., Arnab, A., Fu, Y. (2021). Dynamic Depth Fusion and Transformation for Monocular 3D Object Detection. In: Ishikawa, H., Liu, CL., Pajdla, T., Shi, J. (eds) Computer Vision – ACCV 2020. ACCV 2020. Lecture Notes in Computer Science(), vol 12622. Springer, Cham. https://doi.org/10.1007/978-3-030-69525-5_21
Download citation
DOI: https://doi.org/10.1007/978-3-030-69525-5_21
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-69524-8
Online ISBN: 978-3-030-69525-5
eBook Packages: Computer ScienceComputer Science (R0)