Skip to main content

Improving Zero-Shot Template-Based 6D Pose Estimation with Geometric Features

  • Conference paper
  • First Online:
Advances in Visual Computing (ISVC 2024)

Abstract

6D Object Pose Estimation is a fundamental problem in robotics and augmented reality. Most of today’s state-of-the-art approaches rely on deep learning and require large sets of training images depicting the target objects. A growing number of algorithms try to generalize from a set of known objects, available for training, to unseen objects at test time. Among those, GigaPose is a template-based approach, that renders the target object in an onboarding phase shortly before inference and uses learned latent codes of these renderings and observed objects for feature matching. While learned representation prove powerful in a wide range of tasks, we propose the integration of additional purely geometric features, which can be extracted basically for free from the available 3D meshes during the onboarding phase. This representation is then used as an additional input for template- and 2D-2D correspondence matching in our approach. We consider multiple relevant features and, implementing one of them, demonstrate improved performance on the core datasets of the relevant BOP Challenge. Our results suggest that, indeed, utilizing additional geometric features can improve the relevant metrics without much additional cost.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Ausserlechner, P., Haberger, D., Thalhammer, S., Weibel, J.B., Vincze, M.: Zs6d: Zero-shot 6d object pose estimation using vision transformers. ArXiv abs/2309.11986 (2023). https://api.semanticscholar.org/CorpusID:262084099

  2. Caron, M., Touvron, H., Misra, I., J’egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640 (2021). https://api.semanticscholar.org/CorpusID:233444273

  3. Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. ArXiv abs/1512.03012 (2015). https://api.semanticscholar.org/CorpusID:2554264

  4. Chen, J., Sun, M., Bao, T., Zhao, R., Wu, L., He, Z.: Zeropose: Cad-model-based zero-shot pose estimation (2023). https://api.semanticscholar.org/CorpusID:258960481

  5. Cohen-Steiner, D., Morvan, J.M.: Restricted delaunay triangulations and normal cycle. In: SCG ’03 (2003). https://api.semanticscholar.org/CorpusID:5777927

  6. Deng, X., Geng, J., Bretl, T., Xiang, Y., Fox, D.: icaps: iterative category-level object pose and shape estimation. IEEE Robot. Automation Lett. PP, 1 (2021). https://api.semanticscholar.org/CorpusID:245650812

  7. Di, Y., Manhardt, F., Wang, G., Ji, X., Navab, N., Tombari, F.: So-pose: exploiting self-occlusion for direct 6d pose estimation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12376–12385 (2021), https://api.semanticscholar.org/CorpusID:237213284

  8. Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv abs/2010.11929 (2020). https://api.semanticscholar.org/CorpusID:225039882

  9. Downs, L., et al.: Google scanned objects: a high-quality dataset of 3d scanned household items. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 2553–2560 (2022). https://api.semanticscholar.org/CorpusID:248392390

  10. Fan, Z., et al.: Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference. ArXiv abs/2305.15727 (2023). https://api.semanticscholar.org/CorpusID:258887814

  11. Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981). https://api.semanticscholar.org/CorpusID:972888

  12. Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind one embedding space to bind them all. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15180–15190 (2023). https://api.semanticscholar.org/CorpusID:258564264

  13. Harris, C., Stephens, M., et al.: A combined corner and edge detector. In: Alvey Vision Conference, vol. 15, pp. 10–5244. Citeseer (1988)

    Google Scholar 

  14. He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn (2017). https://api.semanticscholar.org/CorpusID:54465873

  15. He, X., Sun, J., Wang, Y., Huang, D., Bao, H., Zhou, X.: Onepose++: keypoint-free one-shot object pose estimation without CAD models. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  16. He, Y., Wang, Y., Fan, H., Sun, J., Chen, Q.: Fs6d: Few-shot 6d pose estimation of novel objects. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6804–6814 (2022). https://api.semanticscholar.org/CorpusID:247763100

  17. Hodan, T., Baráth, D., Matas, J.: Epos: Estimating 6d pose of objects with symmetries. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11700–11709 (2020). https://api.semanticscholar.org/CorpusID:214743136

  18. Hodan, T., et al.: Bop challenge 2023 on detection, segmentation and pose estimation of seen and unseen rigid objects. arXiv preprint arXiv:2403.09799 (2024)

  19. Kirillov, A., et al.: Segment anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3992–4003 (2023). https://api.semanticscholar.org/CorpusID:257952310

  20. Labb’e, Y., et al.: Megapose: 6d pose estimation of novel objects via render & compare. In: Conference on Robot Learning (2022). https://api.semanticscholar.org/CorpusID:254636085

  21. Lin, J., Wei, Z., Li, Z., Xu, S., Jia, K., Li, Y.: Dualposenet: category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3540–3549 (2021). https://api.semanticscholar.org/CorpusID:232185618

  22. Lin, Y.C., Florence, P.R., Barron, J.T., Rodriguez, A., Isola, P., Lin, T.Y.: inerf: Inverting neural radiance fields for pose estimation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1323–1330 (2020). https://api.semanticscholar.org/CorpusID:228083990

  23. Liu, X., et al.: Gdrnpp (2022). https://github.com/shanice-l/gdrnpp_bop2022

  24. Liu, Y., et al.: Gen6d: generalizable model-free 6-dof object pose estimation from rgb images. In: European Conference on Computer Vision (2022). https://api.semanticscholar.org/CorpusID:248366253

  25. Luo, G., Dunlap, L., Park, D.H., Holynski, A., Darrell, T.: Diffusion hyperfeatures: searching through time and space for semantic correspondence. In: Advances in Neural Information Processing Systems (2023)

    Google Scholar 

  26. Moreno-Noguer, F., Lepetit, V., Fua, P.V.: Accurate non-iterative o(n) solution to the pnp problem. 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007). https://api.semanticscholar.org/CorpusID:3107373

  27. Nguyen, V.N., Groueix, T., Salzmann, M., Lepetit, V.: Gigapose: Fast and robust novel object pose estimation via one correspondence. ArXiv abs/2311.14155 (2023). https://api.semanticscholar.org/CorpusID:265445006

  28. Nguyen, V.N., Hodan, T., Ponimatkin, G., Groueix, T., Lepetit, V.: Cnos: A strong baseline for cad-based novel object segmentation. In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 2126–2132 (2023). https://api.semanticscholar.org/CorpusID:259991698

  29. Oquab, M., et al.: Dinov2: learning robust visual features without supervision. ArXiv abs/2304.07193 (2023). https://api.semanticscholar.org/CorpusID:258170077

  30. Örnek, E.P., et al.: Foundpose: unseen object pose estimation with foundation features. ArXiv abs/2311.18809 (2023). https://api.semanticscholar.org/CorpusID:265506592

  31. Pan, P., Fan, Z., Feng, B.Y., Wang, P., Li, C., Wang, Z.: Learning to estimate 6dof pose from limited data: A few-shot, generalizable approach using rgb images. ArXiv abs/2306.07598 (2023). https://api.semanticscholar.org/CorpusID:259144908

  32. Peng, S., Liu, Y., Huang, Q.X., Bao, H., Zhou, X.: Pvnet: pixel-wise voting network for 6dof pose estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4556–4565 (2018). https://api.semanticscholar.org/CorpusID:57189382

  33. Pitteri, G., Ilic, S., Lepetit, V.: Cornet: Generic 3d corners for 6d pose estimation of new objects without retraining. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 2807–2815 (2019). https://api.semanticscholar.org/CorpusID:201698155

  34. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 12159–12168 (2021), https://api.semanticscholar.org/CorpusID:232352612

  35. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44(3) (2022)

    Google Scholar 

  36. Reis, D., Kupec, J., Hong, J., Daoudi, A.: Real-time flying object detection with yolov8. ArXiv abs/2305.09972 (2023). https://api.semanticscholar.org/CorpusID:258741093

  37. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674–10685 (2022). https://doi.org/10.1109/CVPR52688.2022.01042

  38. Shugurov, I.S., Li, F., Busam, B., Ilic, S.: Osop: a multi-stage one shot object pose estimation framework. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6825–6834 (2022). https://api.semanticscholar.org/CorpusID:247779245

  39. Sipiran, I., Bustos, B.: Harris 3d: a robust extension of the harris operator for interest point detection on 3d meshes. Visual Comput. 27, 963–976 (2011). https://api.semanticscholar.org/CorpusID:15897631

  40. Song, C., Song, J., Huang, Q.X.: Hybridpose: 6d object pose estimation under hybrid representations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 428–437 (2020). https://api.semanticscholar.org/CorpusID:210023370

  41. Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: detector-free local feature matching with transformers. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8918–8927 (2021). https://api.semanticscholar.org/CorpusID:232478646

  42. Sundermeyer, M., et al.: Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2784–2793 (2023)

    Google Scholar 

  43. Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspondence from image diffusion. Adv. Neural. Inf. Process. Syst. 36, 1363–1389 (2023)

    Google Scholar 

  44. Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 13, 376–380 (1991). https://api.semanticscholar.org/CorpusID:206421766

  45. Wang, G., Manhardt, F., Tombari, F., Ji, X.: Gdr-net: geometry-guided direct regression network for monocular 6d object pose estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16606–16616 (2021). https://api.semanticscholar.org/CorpusID:232035418

  46. Wang, H., Sridhar, S., Huang, J., Valentin, J.P.C., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2637–2646 (2019). https://api.semanticscholar.org/CorpusID:57761160

  47. Wen, B., Yang, W., Kautz, J., Birchfield, S.T.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. ArXiv abs/2312.08344 (2023). https://api.semanticscholar.org/CorpusID:266191252

  48. Wen, Y., Li, X., Pan, H., Yang, L., Wang, Z., Komura, T., Wang, W.: Disp6d: disentangled implicit shape and pose learning for scalable 6d pose estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 404–421. Springer, Cham (2022)

    Chapter  Google Scholar 

  49. Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Unleashing the power of large-scale unlabeled data. ArXiv abs/2401.10891 (2024). https://api.semanticscholar.org/CorpusID:267061016

  50. Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891 (2024)

  51. Zhan, G., Zheng, C., Xie, W., Zisserman, A.: What does stable diffusion know about the 3d scene? ArXiv abs/2310.06836 (2023). https://api.semanticscholar.org/CorpusID:263829471

  52. Zhang, J., et al.: A tale of two features: stable diffusion complements dino for zero-shot semantic correspondence (2023)

    Google Scholar 

  53. Zhang, J., Herrmann, C., Hur, J., Chen, E., Jampani, V., Sun, D., Yang, M.H.: Telling left from right: Identifying geometry-aware semantic correspondence (2023)

    Google Scholar 

  54. Zhao, X., et al.: Fast segment anything. ArXiv abs/2306.12156 (2023). https://api.semanticscholar.org/CorpusID:259212104

Download references

Acknowledgement

This research has received funding from the European Union’s Horizon Europe programme in the course of the ZDZW project under grant agreement No 101057404. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Thomas Pöllabauer .

Editor information

Editors and Affiliations

Ethics declarations

Disclosure of Interests

The authors have no competing interests to declare that are relevant to the content of this article.

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Pöllabauer, T., Weyel, J., Knauthe, V., Berkei, S., Kuijper, A. (2025). Improving Zero-Shot Template-Based 6D Pose Estimation with Geometric Features. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2024. Lecture Notes in Computer Science, vol 15046. Springer, Cham. https://doi.org/10.1007/978-3-031-77392-1_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-77392-1_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-77391-4

  • Online ISBN: 978-3-031-77392-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics