Skip to main content

Unsupervised Pre-training for 3D Object Detection with Transformer

  • Conference paper
  • First Online:
Pattern Recognition and Computer Vision (PRCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13536))

Included in the following conference series:

  • 2161 Accesses

Abstract

Transformer improve the performance of 3D object detection with few hyperparameters. Inspired by the recent success of the pre-training Transformer in 2D object detection and natural language processing, we propose a pretext task named random block detection to unsupervisedly pre-train 3DETR (UP3DETR). Specifically, we sample random blocks from original point clouds and feed them into the Transformer decoder. Then, the whole Transformer is trained by detecting the locations of these blocks. The pretext task can pre-train the Transformer-based 3D object detector without any manual annotations. In our experiments, UP3DETR performs 6.2\(\%\) better than 3DETR baseline on challenging ScanNetV2 datasets and has a faster convergence speed on object detection tasks.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Afham, M., Dissanayake, I., Dissanayake, D., Dharmasiri, A., Thilakarathna, K., Rodrigo, R.: Crosspoint: Self-supervised cross-modal contrastive learning for 3D point cloud understanding. arXiv preprint arXiv:2203.00680 (2022)

  2. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 213–229. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_13

    Chapter  Google Scholar 

  3. Chen, T., Wang, S.H., Wang, Q., Zhang, Z., Xie, G.S., Tang, Z.: Enhanced feature alignment for unsupervised domain adaptation of semantic segmentation. IEEE Trans. Multimedia (TMM) 24, 1042–1054 (2022)

    Article  Google Scholar 

  4. Chen, T., et al.: Semantically meaningful class prototype learning for one-shot image segmentation. IEEE Trans. Multimedia (TMM) 24, 968–980 (2022)

    Article  Google Scholar 

  5. Chen, T., Yao, Y., Zhang, L., Wang, Q., Xie, G., Shen, F.: Saliency guided inter-and intra-class relation constraints for weakly supervised semantic segmentation. IEEE Trans. Multimedia (TMM) (2022). https://doi.org/10.1109/TMM.2022.3157481

    Article  Google Scholar 

  6. Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International Conference on Machine Learning, pp. 1597–1607. PMLR (2020)

    Google Scholar 

  7. Chen, X., Ma, H., Wan, J., Li, B., Xia, T.: Multi-view 3D object detection network for autonomous driving. In: Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp. 1907–1915 (2017)

    Google Scholar 

  8. Chen, X., Fan, H., Girshick, R., He, K.: Improved baselines with momentum contrastive learning. arXiv preprint arXiv:2003.04297 (2020)

  9. Chen, Y., Nießner, M., Dai, A.: 4Dcontrast: Contrastive learning with dynamic correspondences for 3d scene understanding. arXiv preprint arXiv:2112.02990 (2021)

  10. Choy, C., Park, J., Koltun, V.: Fully convolutional geometric features. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8958–8966 (2019)

    Google Scholar 

  11. Dai, A., Chang, A.X., Savva, M., Halber, M., Funkhouser, T., Nießner, M.: ScanNet: Richly-annotated 3D reconstructions of indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5828–5839 (2017)

    Google Scholar 

  12. Dai, Z., Cai, B., Lin, Y., Chen, J.: UP-DETR: Unsupervised pre-training for object detection with transformers. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1601–1610 (2021)

    Google Scholar 

  13. Guan, T., et al.: M3DETR: Multi-representation, multi-scale, mutual-relation 3d object detection with transformers. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 772–782 (2022)

    Google Scholar 

  14. Hassani, K., Haley, M.: Unsupervised multi-task feature learning on point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8160–8171 (2019)

    Google Scholar 

  15. He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9729–9738 (2020)

    Google Scholar 

  16. Hou, J., Graham, B., Nießner, M., Xie, S.: Exploring data-efficient 3d scene understanding with contrastive scene contexts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 15587–15597 (2021)

    Google Scholar 

  17. Huang, S., Xie, Y., Zhu, S.C., Zhu, Y.: Spatio-temporal self-supervised representation learning for 3d point clouds. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 6535–6545 (2021)

    Google Scholar 

  18. Huang, X., Fan, L., Wu, Q., Zhang, J., Yuan, C.: Fast registration for cross-source point clouds by using weak regional affinity and pixel-wise refinement. In: 2019 IEEE International Conference on Multimedia and Expo (ICME), pp. 1552–1557. IEEE (2019)

    Google Scholar 

  19. Huang, X., Fan, L., Zhang, J., Wu, Q., Yuan, C.: Real time complete dense depth reconstruction for a monocular camera. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pp. 32–37 (2016)

    Google Scholar 

  20. Ku, J., Mozifian, M., Lee, J., Harakeh, A., Waslander, S.L.: Joint 3D proposal generation and object detection from view aggregation. In: 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1–8. IEEE (2018)

    Google Scholar 

  21. Liang, M., Yang, B., Wang, S., Urtasun, R.: Deep continuous fusion for multi-sensor 3d object detection. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 641–656 (2018)

    Google Scholar 

  22. Liu, H., et al.: Exploiting web images for fine-grained visual recognition by eliminating open-set noise and utilizing hard examples. IEEE Trans. Multimedia (TMM) 24, 546–557 (2022)

    Article  Google Scholar 

  23. Liu, H., Zhang, H., Lu, J., Tang, Z.: Exploiting web images for fine-grained visual recognition via dynamic loss correction and global sample selection. IEEE Trans. Multimedia (TMM) 24, 1105–1115 (2022)

    Article  Google Scholar 

  24. Liu, Z., Zhang, Z., Cao, Y., Hu, H., Tong, X.: Group-free 3D object detection via transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2949–2958 (2021)

    Google Scholar 

  25. Mao, J., et al.: Voxel transformer for 3D object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 3164–3173 (2021)

    Google Scholar 

  26. Misra, I., Girdhar, R., Joulin, A.: An end-to-end transformer model for 3d object detection. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2906–2917 (2021)

    Google Scholar 

  27. Noroozi, M., Favaro, P.: Unsupervised learning of visual representations by solving jigsaw puzzles. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 69–84. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_5

    Chapter  Google Scholar 

  28. Pathak, D., Krahenbuhl, P., Donahue, J., Darrell, T., Efros, A.A.: Context encoders: Feature learning by inpainting. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2536–2544 (2016)

    Google Scholar 

  29. Pei, G., Shen, F., Yao, Y., Xie, G.S., Tang, Z., Tang, J.: Hierarchical feature alignment network for unsupervised video object segmentation. In: Proceedings of the European Conference on Computer Vision (ECCV) (2022)

    Google Scholar 

  30. Qi, C.R., Litany, O., He, K., Guibas, L.J.: Deep hough voting for 3d object detection in point clouds. In: proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9277–9286 (2019)

    Google Scholar 

  31. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: Deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)

    Google Scholar 

  32. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: Deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  33. Sauder, J., Sievers, B.: Self-supervised deep learning on point clouds by reconstructing space. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  34. Sheng, H., et al.: Improving 3D object detection with channel-wise transformer. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2743–2752 (2021)

    Google Scholar 

  35. Shi, S., Wang, X., Li, H.: PointrCNN: 3D object proposal generation and detection from point cloud. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 770–779 (2019)

    Google Scholar 

  36. Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)

    Google Scholar 

  37. Sun, Z., Hua, X.S., Yao, Y., Wei, X.S., Hu, G., Zhang, J.: CRSSC: salvage reusable samples from noisy data for robust learning. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 92–101 (2020)

    Google Scholar 

  38. Sun, Z., Liu, H., Wang, Q., Zhou, T., Wu, Q., Tang, Z.: Co-LDL: a co-training-based label distribution learning method for tackling label noise. IEEE Trans. Multimedia (TMM) 24, 1093–1104 (2022)

    Article  Google Scholar 

  39. Sun, Z., et al.: PNP: Robust learning from noisy labels by probabilistic noise prediction. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5311–5320 (2022)

    Google Scholar 

  40. Sun, Z., et al.: Webly supervised fine-grained recognition: Benchmark datasets and an approach. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 10602–10611 (2021)

    Google Scholar 

  41. Sun, Z., Yao, Y., Wei, X., Shen, F., Liu, H., Hua, X.S.: Boosting robust learning via leveraging reusable samples in noisy web data. IEEE Trans. Multimedia (TMM) (2022). https://doi.org/10.1109/TMM.2022.3158001

    Article  Google Scholar 

  42. Sun, Z., Yao, Y., Xiao, J., Zhang, L., Zhang, J., Tang, Z.: Exploiting textual queries for dynamically visual disambiguation. Pattern Recogn. 110, 107620 (2021)

    Article  Google Scholar 

  43. Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. Adv. Neural. Inf. Process. Syst. 33, 7537–7547 (2020)

    Google Scholar 

  44. Vaswani, A., et al.: Attention is all you need. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  45. Xie, S., Gu, J., Guo, D., Qi, C.R., Guibas, L., Litany, O.: PointContrast: unsupervised pre-training for 3D point cloud understanding. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12348, pp. 574–591. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58580-8_34

    Chapter  Google Scholar 

  46. Xu, B., Chen, Z.: Multi-level fusion based 3D object detection from monocular images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2345–2353 (2018)

    Google Scholar 

  47. Yang, B., Luo, W., Urtasun, R.: Pixor: Real-time 3D object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7652–7660 (2018)

    Google Scholar 

  48. Yao, Y., et al.: Non-salient region object mining for weakly supervised semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2623–2632 (2021)

    Google Scholar 

  49. Yao, Y., Hua, X.S., Shen, F., Zhang, J., Tang, Z.: A domain robust approach for image dataset construction. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 212–216 (2016)

    Google Scholar 

  50. Yao, Y., Hua, X., Gao, G., Sun, Z., Li, Z., Zhang, J.: Bridging the web data and fine-grained visual recognition via alleviating label noise and domain mismatch. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 1735–1744 (2020)

    Google Scholar 

  51. Yao, Y., et al.: Exploiting web images for multi-output classification: from category to subcategories. IEEE Trans. Neural Netw. Learn. Syst. (TNNLS) 31(7), 2348–2360 (2020)

    Google Scholar 

  52. Yao, Y., Shen, F., Zhang, J., Liu, L., Tang, Z., Shao, L.: Extracting multiple visual senses for web learning. IEEE Trans. Multimedia (TMM) 21(1), 184–196 (2019)

    Article  Google Scholar 

  53. Yao, Y., Shen, F., Zhang, J., Liu, L., Tang, Z., Shao, L.: Extracting privileged information for enhancing classifier learning. IEEE Trans. Image Process. (TIP) 28(1), 436–450 (2019)

    Article  MathSciNet  Google Scholar 

  54. Yao, Y., Sun, Z., Zhang, C., Shen, F., Wu, Q., Zhang, J., Tang, Z.: Jo-SRC: A contrastive approach for combating noisy labels. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 5192–5201 (2021)

    Google Scholar 

  55. Yao, Y., Zhang, J., Shen, F., Hua, X., Xu, J., Tang, Z.: Exploiting web images for dataset construction: a domain robust approach. IEEE Trans. Multimedia (TMM) 19(8), 1771–1784 (2017)

    Article  Google Scholar 

  56. Yao, Y., et al.: Towards automatic construction of diverse, high-quality image datasets. IEEE Trans. Knowl. Data Eng. (TKDE) 32(6), 1199–1211 (2020)

    Article  Google Scholar 

  57. Yao, Y., Zhang, J., Shen, F., Yang, W., Huang, P., Tang, Z.: Discovering and distinguishing multiple visual senses for polysemous words. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 523–530 (2018)

    Google Scholar 

  58. Zhang, C., Lin, G., Wang, Q., Shen, F., Yao, Y., Tang, Z.: Guided by meta-set: a data-driven method for fine-grained visual recognition. IEEE Trans. Multimedia (TMM) (2022). https://doi.org/10.1109/TMM.2022.3181439

    Article  Google Scholar 

  59. Zhang, C., Wang, Q., Xie, G., Wu, Q., Shen, F., Tang, Z.: Robust learning from noisy web images via data purification for fine-grained recognition. IEEE Trans. Multimedia (TMM) 24, 1198–1209 (2022)

    Article  Google Scholar 

  60. Zhang, C.,et al.: Web-supervised network with softly update-drop training for fine-grained visual classification. In: Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), pp. 12781–12788 (2020)

    Google Scholar 

  61. Zhang, C., Yao, Y., Shu, X., Li, Z., Tang, Z., Wu, Q.: Data-driven meta-set based fine-grained visual recognition. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 2372–2381 (2020)

    Google Scholar 

  62. Zhang, C., et al.: Extracting useful knowledge from noisy web images via data purification for fine-grained recognition. In: Proceedings of the ACM International Conference on Multimedia (ACMMM), pp. 4063–4072 (2021)

    Google Scholar 

  63. Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3d features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263 (2021)

    Google Scholar 

  64. Zhang, Z., Girdhar, R., Joulin, A., Misra, I.: Self-supervised pretraining of 3D features on any point-cloud. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10252–10263 (2021)

    Google Scholar 

Download references

Acknowledgments

This work was supported by the pre-research project of the Equipment Development Department of the Central Military Commission (No. 31514020205).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yazhou Yao .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Sun, M., Huang, X., Sun, Z., Wang, Q., Yao, Y. (2022). Unsupervised Pre-training for 3D Object Detection with Transformer. In: Yu, S., et al. Pattern Recognition and Computer Vision. PRCV 2022. Lecture Notes in Computer Science, vol 13536. Springer, Cham. https://doi.org/10.1007/978-3-031-18913-5_7

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-18913-5_7

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-18912-8

  • Online ISBN: 978-3-031-18913-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics