Skip to main content
Log in

Based on real and virtual datasets adaptive joint training in multi-modal networks with applications in monocular 3D target detection

  • Original Article
  • Published:
The Visual Computer Aims and scope Submit manuscript

Abstract

Semantic segmentation is important for the accuracy of target detection. Semantic labels are difficult to obtain for real driving; however, they are easy to obtain in virtual datasets. So this paper presents an adaptive joint training strategy based on real and virtual datasets: (1) building multi-modal fusion networks using image, depth and semantic information. (2) A joint training strategy of virtual and real datasets and data sharing is used for semantic information, and an adaptive optimizer is provided. The monocular detection network obtained by training with this strategy has a large improvement in its effectiveness relative to the conventional network.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

Data availability

Data available on request from the authors. The data that support the findings of this study are available from the corresponding author, upon reasonable request.

References

  1. Redmon, J., Farhadi, A.: Yolov3: An incremental improvement. Arxiv preprint (2018). arXiv:1804.02767

  2. Han, J., Chen, H., Liu, N., Yan, C., Li, X.: CNNs-based RGB-D saliency detection via cross-view transfer and multiview fusion. IEEE Trans. Cybern. 48, 3171–3183 (2017)

    Article  Google Scholar 

  3. Hackett, J.K., Shah, M.: Multi-sensor fusion: A perspective. In: Proceedings of the IEEE International Conference on Robotics and Automation, pp. 1324–1330 (1990)

  4. Zhang, G., Liu, J., Liu, Y., Zhao, J., Tian, L., Chen, Y.Q.: Physical blob detector and multi-channel color shape descriptor for human detection. J. Vis. Commun. Image Represent. 52, 13–23 (2018)

    Article  Google Scholar 

  5. Zhang, G., Liu, J., Tian, L., Chen, Y.Q.: Reliably detecting humans with RGB-D camera with physical blob detector followed by learning-based filtering. In: 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). pp. 2004–2008. IEEE (2016)

  6. Zhang, G., Liu, J., Li, H., Chen, Y.Q., Davis, L.S.: Joint human detection and head pose estimation via multistream networks for RGB-D videos. IEEE Signal Process. Lett. 24, 1666–1670 (2017)

    Article  Google Scholar 

  7. Zhou, X., Wang, Y., Zhu, Q., Xiao, C., Lu, X.: SSG: Superpixel segmentation and grabcut-based salient object segmentation. Vis. Comput. 35, 385–398 (2019)

    Article  Google Scholar 

  8. Zhao, J., Mao, X., Zhang, J.: Learning deep facial expression features from image and optical flow sequences using 3D CNN. Vis. Comput. 34, 1461–1475 (2018)

    Article  Google Scholar 

  9. Berlin, S.J., John, M.: Spiking neural network based on joint entropy of optical flow features for human action recognition. Vis. Comput. 38, 223–237 (2022)

    Article  Google Scholar 

  10. He, D., He, X., Yuan, R., Li, Y., Shen, C.: Lightweight network-based multi-modal feature fusion for face anti-spoofing. Vis. Comput. (2022). https://doi.org/10.1007/s00371-022-02420-6

  11. Ahmadian, K., Gavrilova, M.: A multi-modal approach for high-dimensional feature recognition. Vis. Comput. 29, 123–130 (2013)

    Article  Google Scholar 

  12. Ding, Y., Duan, Z., Li, S.: Source-free unsupervised multi-source domain adaptation via proxy task for person re-identification. Vis. Comput. 38, 1871–1882 (2022)

    Article  Google Scholar 

  13. Li, P., Chen, X., Shen, S.: Stereo R-CNN based 3d object detection for autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 7644–7652 (2019)

  14. Qin, Z., Wang, J., Lu, Y.: Monogrnet: A geometric reasoning network for monocular 3d object localization. In: Proceedings of the AAAI Conference on Artificial Intelligence. pp. 8851–8858 (2019)

  15. Park, J., Jung, D.-J.: Deep convolutional neural network architectures for tonal frequency identification in a lofargram. Int. J. Control Autom. Syst. 19, 1103–1112 (2021)

    Article  Google Scholar 

  16. Liu, Z., Wu, Z., Tóth, R.: Smoke: Single-stage monocular 3d object detection via keypoint estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 996–997 (2020)

  17. Zhang, J., Su, Q., Tang, B., Wang, C., Li, Y.: DPSNet: Multitask learning using geometry reasoning for scene depth and semantics. IEEE Trans. Neural Netw. Learn. Syst. (2021). https://doi.org/10.1109/TNNLS.2021.3107362

  18. Zhang, J., Su, Q., Wang, C., Gu, H.: Monocular 3D vehicle detection with multi-instance depth and geometry reasoning for autonomous driving. Neurocomputing 403, 182–192 (2020)

    Article  Google Scholar 

  19. Li, Y., Luo, F., Li, W., Zheng, S., Wu, H., Xiao, C.: Self-supervised monocular depth estimation based on image texture detail enhancement. Vis. Comput. 37, 2567–2580 (2021)

    Article  Google Scholar 

  20. Ding, P., Shen, Q., Huang, T., Wang, M.: A generation method and verification of virtual dataset. In: International Conference on Man-Machine-Environment System Engineering. pp. 469–475. Springer (2020)

  21. Zhuang, F., Qi, Z., Duan, K., Xi, D., Zhu, Y., Zhu, H., Xiong, H., He, Q.: A comprehensive survey on transfer learning. Proc. IEEE 109, 43–76 (2020)

    Article  Google Scholar 

  22. Luo, H., Hanagud, S.: Dynamic learning rate neural network training and composite structural damage detection. AIAA J. 35, 1522–1527 (1997)

    Article  MATH  Google Scholar 

  23. Garcia-Garcia, A., Orts-Escolano, S., Oprea, S., Villena-Martinez, V., Garcia-Rodriguez, J.: A review on deep learning techniques applied to semantic segmentation. arXiv preprint arXiv:1704.06857. (2017)

  24. Vu, T.-H., Jain, H., Bucher, M., Cord, M., Pérez, P.: Dada: Depth-aware domain adaptation in semantic segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7364–7373 (2019)

  25. Yu, F., Wang, D., Shelhamer, E., Darrell, T.: Deep layer aggregation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 2403–2412 (2018)

  26. Li, B., Zhang, T., Xia, T.: Vehicle detection from 3d lidar using fully convolutional network. arXiv preprint arXiv:1608.07916. (2016)

  27. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980. (2014)

  28. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, Ł., Polosukhin, I.: Attention is all you need. Adv. Neural Inf. Process. Syst. 30, 5998–6008 (2017)

  29. Ros, G., Sellart, L., Materzynska, J., Vazquez, D., Lopez, A.M.: The synthia dataset: A large collection of synthetic images for semantic segmentation of urban scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3234–3243 (2016)

  30. Geiger, A., Lenz, P., Stiller, C., Urtasun, R.: Vision meets robotics: The kitti dataset. Int. J. Robot. Res. 32, 1231–1237 (2013)

    Article  Google Scholar 

  31. Chen, X., Kundu, K., Zhu, Y., Berneshawi, A.G., Ma, H., Fidler, S., Urtasun, R.: 3d object proposals for accurate object class detection. Adv Neural Inf. Process. Syst. 28, 424–432 (2015)

  32. Manhardt, F., Kehl, W., Gaidon, A.: Roi-10d: Monocular lifting of 2d detection to 6d pose and metric shape. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 2069–2078 (2019)

  33. Li, P., Zhao, H., Liu, P., Cao, F.: RTM3D: Real-time monocular 3d detection from object keypoints for autonomous driving. In: European Conference on Computer Vision. pp. 644–660. Springer (2020)

  34. Chen, Y., Tai, L., Sun, K., Li, M.: Monopair: Monocular 3d object detection using pairwise spatial relationships. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12093–12102 (2020)

  35. Ding, M., Huo, Y., Yi, H., Wang, Z., Shi, J., Lu, Z., Luo, P.: Learning depth-guided convolutions for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops. pp. 1000–1001 (2020)

  36. Brazil, G., Pons-Moll, G., Liu, X., Schiele, B.: Kinematic 3d object detection in monocular video. In: European Conference on Computer Vision. pp. 135–152. Springer (2020)

  37. Qin, Z., Wang, J., Lu, Y.: Monogrnet: A general framework for monocular 3d object detection. IEEE Trans. Pattern Anal. Mach. Intell. 44, 5170–5184 (2022)

  38. Li, P., Zhao, H.: Monocular 3d detection with geometric constraint embedding and semi-supervised training. IEEE Robot. Autom. Lett. 6, 5565–5572 (2021)

    Article  Google Scholar 

  39. Kumar, A., Brazil, G., Liu, X.: GrooMeD-NMS: Grouped mathematically differentiable NMS for monocular 3d object detection. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 8973–8983 (2021)

  40. Shi, X., Ye, Q., Chen, X., Chen, C., Chen, Z., Kim, T.-K.: Geometry-based distance decomposition for monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15172–15181 (2021)

  41. Zhou, Z., Du, L., Ye, X., Zou, Z., Tan, X., Ding, E., Zhang, L., Xue, X., Feng, J.: SGM3D: Stereo guided monocular 3D object detection. IEEE Robot. Autom. Lett. arXiv preprint arXiv:2112.01914. (2021)

  42. Liu, Z., Zhou, D., Lu, F., Fang, J., Zhang, L.: Autoshape: Real-time shape-aware monocular 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 15641–15650 (2021)

  43. Liu, C., Gu, S., Van Gool, L., Timofte, R.: Deep line encoding for monocular 3d object detection and depth prediction. In: 32nd British Machine Vision Conference (BMVC 2021). p. 354 (2021)

Download references

Funding

This work was supported by “the Fundamental Research Funds for the Central Universities, PA2021KCPY0041”,“Innovation Project of New Energy Vehicle and Intelligent Connected Vehicle of Anhui Province” and “The University Synergy Innovation Program of Anhui Province, GXXT-2020-076”. The authors thank the anonymous reviewers for their instructive comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Teng Cheng.

Ethics declarations

Conflict of interest

The authors certify that there is no conflict of interest with any individual/organization for the present work.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Cheng, T., Sun, L., Zhang, J. et al. Based on real and virtual datasets adaptive joint training in multi-modal networks with applications in monocular 3D target detection. Vis Comput 39, 6367–6377 (2023). https://doi.org/10.1007/s00371-022-02734-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00371-022-02734-5

Keywords

Navigation