Leveraging front and side cues for occlusion handling in monocular 3D object detection

Song, Yuying; Li, Zecheng; Wu, Jingxuan; Song, Chunyi; Xu, Zhiwei

doi:10.1007/s00371-023-02884-0

Leveraging front and side cues for occlusion handling in monocular 3D object detection

Original article
Published: 21 June 2023

Volume 40, pages 1757–1773, (2024)
Cite this article

The Visual Computer Aims and scope Submit manuscript

Yuying Song¹^na1,
Zecheng Li¹^na1,
Jingxuan Wu¹,
Chunyi Song ORCID: orcid.org/0000-0002-3274-6806^1,2,3 &
…
Zhiwei Xu^1,2,3

301 Accesses
1 Altmetric
Explore all metrics

Abstract

3D object detection, as an essential part of perception, plays a principal role in the autonomous driving system. The cost-competitive monocular 3D object detection has drawn increasing attention recently. However, it still suffers an inferior accuracy especially for occluded objects due to the limited camera view. Inspired by compositional models, in which an object is represented as a combination of multiple components, this paper proposes a new monocular 3D object detection method that decreases the impact of occlusion by utilizing an object’s front and side cues. To do this, the features are extracted from a decoupled front and side representation and then fused by an attention-based module to obtain a more consistent feature distribution. An uncertainty-guided depth ensemble based on geometry is further applied to refine the depth prediction. Experiment results demonstrate that as compared to the conventional methods, the proposed method significantly improves the detection performance for occluded objects while still satisfying real-time efficiency, with the Average Precision on 40 recall positions (AP40), respectively, increasing by 10.23% for partly occluded objects and 12.22% for mostly occluded objects in the KITTI benchmark. The codes are released at https://github.com/kagurua/Front-Side-Det

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 10

MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation

Article 18 December 2023

Long Range Object-Level Monocular Depth Estimation for UAVs

RAGT: Learning Robust Features for Occluded Human Pose and Shape Estimation with Attention-Guided Transformer

Availability of data and materials

The KITTI dataset is available online. The self-build dataset will be supplied in response to reasonable requests.

Code availability

The code is available at https://github.com/kagurua/Front-Side-Det

References

Zhao, H., Yang, D., Yu, J.: 3D target detection using dual domain attention and SIFT operator in indoor scenes. Vis. Comput. 38, 3765–3774 (2022)
Article Google Scholar
Chen, Q., Sun, L., Wang, Z., Jia, K., Yuille, A.: Object as hotspots: an anchor-free 3D object detection approach via firing of hotspots. In: European Conference on Computer Vision. Springer, Berlin (2020)
Shi, S., Guo, C., Jiang, L., Wang, Z., Shi, J., Wang, X., Li, H.: PV-RCNN: point-voxel feature set abstraction for 3D object detection. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, pp. 10526–10535 (2020)
Wu, P., Gu, L., Yan, X., Xie, H., Wang, F.L., Cheng, G., Wei, M.: PV-RCNN++: semantical point-voxel feature interaction for 3D object detection. Vis. Comput. 1–16 (2022)
Ji, C., Liu, G., Zhao, D.: Stereo 3D object detection via instance depth prior guidance and adaptive spatial feature aggregation. Vis. Comput. 1–12 (2022)
Wang, R., Liang, Y., Xu, J.W., He, Z.H.: Cascading classifier with discriminative multi-features for a specific 3D object real-time detection. Vis. Comput. 35, 399–414 (2019)
Article Google Scholar
Ma, X., Liu, S., Xia, Z., Zhang, H., Zeng, X., Ouyang, W.: Rethinking pseudo-lidar representation. In: European Conference on Computer Vision. Springer, pp. 311–327 (2020)
Manhardt, F., Kehl, W., Gaidon, A.: ROI-10D: Monocular lifting of 2D detection to 6D pose and metric shape. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp. 2064–2073 (2019)
Cheng, T., Sun, L., Zhang, J., Hou, D., Shi, Q., Chen, J.: Based on real and virtual datasets adaptive joint training in multi-modal networks with applications in monocular 3D target detection. Vis. Comput. 1–11 (2022)
Brazil, G., Liu, X.: M3D-RPN: monocular 3D region proposal network for object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Seoul, Korea (South), pp. 9286–9295 (2019)
Wang, T., Xinge, Z., Pang, J., Lin, D.: Probabilistic and geometric depth: detecting objects in perspective. In: Conference on Robot Learning. PMLR, pp. 1475–1485 (2022)
Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular ımages. In: 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, Salt Lake City, UT, USA, pp. 2345–2353 (2018)
Wang, Y., Chao, W.-L., Garg, D., Hariharan, B., Campbell, M., Weinberger, K.Q.: Pseudo-LiDAR from visual depth estimation: bridging the gap in 3D object detection for autonomous driving. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp. 8437–8445 (2019)
Roddick, T., Kendall, A., Cipolla, R.: Orthographic Feature Transform for Monocular 3D Object Detection (2018)
Ouyang, E., Zhang, L., Chen, M., Arnab, A., Fu, Y.: Dynamic depth fusion and transformation for monocular 3d object detection. In: Proceedings of the Asian Conference on Computer Vision (2020)
Dai, J., Qi, H., Xiong, Y., Li, Y., Zhang, G., Hu, H., Wei, Y.: Deformable Convolutional Networks. p 10 (2017)
Chen, Y., Tai, L., Sun, K., Li, M.: MonoPair: monocular 3D object detection using pairwise spatial relationships. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Seattle, WA, USA, pp. 12090–12099 (2020)
Wang, H., Zhu, Y., Green, B., Adam, H., Yuille, A., Chen, L.-C.: Axial-deeplab: Stand-alone axial-attention for panoptic segmentation. In: European Conference on Computer Vision. Springer, pp. 108–126 (2020)
Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: the KITTI dataset. Int. J. Robot. Res. 32, 1231–1237 (2013). https://doi.org/10.1177/0278364913491297
Article Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. IEEE Trans. Pattern Anal. Mach. Intell. 39, 1137–1149 (2017). https://doi.org/10.1109/TPAMI.2016.2577031
Article PubMed Google Scholar
He, K., Gkioxari, G., Dollar, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: 2014 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, Columbus, OH, USA, pp. 580–587 (2014)
Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE İnternational Conference on Computer Vision, pp. 1440–1448 (2015)
Liu, W., Anguelov, D., Erhan, D., Szegedy, C., Reed, S., Fu, C.-Y., Berg, A.C.: SSD: single shot multibox detector. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) Computer Vision – ECCV 2016, pp. 21–37. Springer International Publishing, Cham (2016)
Chapter Google Scholar
Redmon, J., Farhadi, A.: YOLOv3: An Incremental Improvement (2018) arXiv:180402767 [cs]
Wei, L., Cui, W., Hu, Z., Sun, H., Hou, S.: A single-shot multi-level feature reused neural network for object detection. Vis. Comput. 37, 133–142 (2021)
Article Google Scholar
Zhang, T., Cao, Y., Zhang, L., Li, X.: Efficient feature fusion network based on center and scale prediction for pedestrian detection. Vis. Comput. 1–8 (2022)
Law, H., Deng, J.: CornerNet: detecting objects as paired keypoints. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)
Tian, Z., Shen, C., Chen, H., He, T.: FCOS: fully convolutional one-stage object detection. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Seoul, Korea (South), pp. 9626–9635 (2019)
Saeidi, M., Arabsorkhi, A.: A novel backbone architecture for pedestrian detection based on the human visual system. Vis. Comput. 38, 2223–2237 (2022)
Article Google Scholar
Mousavian, A., Anguelov, D., Flynn, J., Kosecka, J.: 3D Bounding Box Estimation Using Deep Learning and Geometry. (2017) arXiv:1612.00496 [cs]
Barabanau, I., Artemov, A., Burnaev, E., Murashkin, V.: Monocular 3D Object Detection via Geometric Reasoning on Keypoints. (2019) arXiv:190505618 [cs]
Wang, T., Zhu, X., Pang, J., Lin, D.: Fcos3d: fully convolutional one-stage monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 913–922 (2021)
Zhang, Y., Lu, J., Zhou, J.: Objects are different: flexible monocular 3D object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3289–3298 (2021)
Li, P., Zhao, H., Liu, P., Cao, F.: Rtm3d: real-time monocular 3d detection from object keypoints for autonomous driving. In: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part III 16. Springer, pp. 644–660 (2020)
Peng, S., Liu, Y., Huang, Q., Zhou, X., Bao, H.: Pvnet: Pixel-wise voting network for 6dof pose estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4561–4570 (2019)
Liu, H., Liu, H., Wang, Y., Sun, F., Huang, W.: Fine-grained multilevel fusion for anti-occlusion monocular 3d object detection. IEEE Trans. Image Process. 31, 4050–4061 (2022)
Article PubMed ADS Google Scholar
Kendall, A., Gal, Y.: What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision? (2017) arXiv:170304977 [cs]
Lu, Y., Ma, X., Yang, L., Zhang, T., Liu, Y., Chu, Q., Yan, J., Ouyang, W.: Geometry Uncertainty Projection Network for Monocular 3D Object Detection (2021) arXiv:2107.13774 [cs]
Liu, C., Gu, J., Kim, K., Narasimhan, S.G., Kautz. J.: Neural RGB®D sensing: depth and uncertainty from a video camera. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Long Beach, CA, USA, pp. 10978–10987 (2019)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 770–778 (2016)
Lin, T.-Y., Dollar, P., Girshick, R., He, K., Hariharan, B., Belongie, S.: Feature pyramid networks for object detection. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Honolulu, HI, pp. 936–944 (2017)
Xie Z, Song Y, Wu J, Li Z, Song C, Xu Z.: MDS-net: a multi-scale depth stratification based monocular 3D object detection algorithm (2022)
Lin, T.-Y., Goyal, P., Girshick, R., He, K., Dollár, P.: Focal loss for dense object detection. In: Proceedings of the IEEE İnternational Conference on Computer Vision, pp. 2980–2988(2017)
Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S.: Urtasun R 3D Object Proposals for Accurate Object Class Detection. p. 9 (2015)
Chen, K., Wang, J., Pang, J., Cao, Y., Xiong, Y., Li, X., Sun, S., Feng, W., Liu, Z., Xu, J. et al.: MMDetection: Open mmlab Detection Toolbox and Benchmark (2019) arXiv preprint arXiv:1906.07155
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Commun. ACM 60, 84–90 (2017)
Article Google Scholar
Ma, X., Zhang, Y., Xu, D., Zhou, D., Yi, S., Li, H., Ouyang, W.: Delving into localization errors for monocular 3D object detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Nashville, TN, USA, pp. 4719–4728 (2021)
Kumar, A., Brazil, G., Liu, X.: GrooMeD-NMS: grouped mathematically differentiable nms for monocular 3D object detection. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, Nashville, TN, USA, pp. 8969–8979 (2021)
Luo, S., Dai, H., Shao, L., Ding, Y.: M3dssd: monocular 3d single stage object detector. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6145–6154 (2021)
Shi, X., Ye, Q., Chen, X., Chen, C., Chen, Z., Kim, T.-K.: Geometry-based distance decomposition for monocular 3D object detection. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). IEEE, Montreal, QC, Canada, pp. 15152–15161 (2021)

Download references

Acknowledgements

The authors are grateful to the financial assistance provided by the project of the Donghai Laboratory under Grant DH-2022ZY0002.

Funding

This work is supported in part by the project of the Donghai Laboratory under Grant DH-2022ZY0002.

Author information

Yuying Song and Zecheng Li have contributed equally to this work.

Authors and Affiliations

The Institute of Marine Electronic and Intelligent System, Ocean College, Zhejiang University, Dinghai District, Zhoushan, 316036, Zhejiang, China
Yuying Song, Zecheng Li, Jingxuan Wu, Chunyi Song & Zhiwei Xu
The Engineering Research Center of Oceanic Sensing Technology and Equipment, Ministry of Education, Dinghai District, Zhoushan, 316036, Zhejiang, China
Chunyi Song & Zhiwei Xu
The Donghai Laboratory, Dinghai District, Zhoushan, 316036, Zhejiang, China
Chunyi Song & Zhiwei Xu

Authors

Yuying Song
View author publications
You can also search for this author in PubMed Google Scholar
Zecheng Li
View author publications
You can also search for this author in PubMed Google Scholar
Jingxuan Wu
View author publications
You can also search for this author in PubMed Google Scholar
Chunyi Song
View author publications
You can also search for this author in PubMed Google Scholar
Zhiwei Xu
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YS and ZL conceived of the study, designed the study, and performed the research; JW contributed to refining the ideas, and carrying out additional analyses; YS was the major contributor in writing the manuscript; CS and ZX discussed the results and revised the manuscript. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Chunyi Song.

Ethics declarations

Conflict of interest

The authors declare that they have no competing interests.

Ethics approval

Not applicable.

Informed consent

Not applicable.

Research involving human participants and/or animals

Not applicable.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Below is the link to the electronic supplementary material.

Supplementary file1 (AVI 7630 kb)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Song, Y., Li, Z., Wu, J. et al. Leveraging front and side cues for occlusion handling in monocular 3D object detection. Vis Comput 40, 1757–1773 (2024). https://doi.org/10.1007/s00371-023-02884-0

Download citation

Accepted: 19 April 2023
Published: 21 June 2023
Issue Date: March 2024
DOI: https://doi.org/10.1007/s00371-023-02884-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Leveraging front and side cues for occlusion handling in monocular 3D object detection

Abstract

Access this article

Similar content being viewed by others

MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation

Long Range Object-Level Monocular Depth Estimation for UAVs

RAGT: Learning Robust Features for Occluded Human Pose and Shape Estimation with Attention-Guided Transformer

Availability of data and materials

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Informed consent

Research involving human participants and/or animals

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Leveraging front and side cues for occlusion handling in monocular 3D object detection

Abstract

Access this article

Similar content being viewed by others

MonoSAID: Monocular 3D Object Detection based on Scene-Level Adaptive Instance Depth Estimation

Long Range Object-Level Monocular Depth Estimation for UAVs

RAGT: Learning Robust Features for Occluded Human Pose and Shape Estimation with Attention-Guided Transformer

Availability of data and materials

Code availability

References

Acknowledgements

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethics approval

Informed consent

Research involving human participants and/or animals

Additional information

Publisher's Note

Supplementary Information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation