Abstract
6D Object Pose Estimation is a fundamental problem in robotics and augmented reality. Most of today’s state-of-the-art approaches rely on deep learning and require large sets of training images depicting the target objects. A growing number of algorithms try to generalize from a set of known objects, available for training, to unseen objects at test time. Among those, GigaPose is a template-based approach, that renders the target object in an onboarding phase shortly before inference and uses learned latent codes of these renderings and observed objects for feature matching. While learned representation prove powerful in a wide range of tasks, we propose the integration of additional purely geometric features, which can be extracted basically for free from the available 3D meshes during the onboarding phase. This representation is then used as an additional input for template- and 2D-2D correspondence matching in our approach. We consider multiple relevant features and, implementing one of them, demonstrate improved performance on the core datasets of the relevant BOP Challenge. Our results suggest that, indeed, utilizing additional geometric features can improve the relevant metrics without much additional cost.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ausserlechner, P., Haberger, D., Thalhammer, S., Weibel, J.B., Vincze, M.: Zs6d: Zero-shot 6d object pose estimation using vision transformers. ArXiv abs/2309.11986 (2023). https://api.semanticscholar.org/CorpusID:262084099
Caron, M., Touvron, H., Misra, I., J’egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640 (2021). https://api.semanticscholar.org/CorpusID:233444273
Chang, A.X., et al.: Shapenet: an information-rich 3d model repository. ArXiv abs/1512.03012 (2015). https://api.semanticscholar.org/CorpusID:2554264
Chen, J., Sun, M., Bao, T., Zhao, R., Wu, L., He, Z.: Zeropose: Cad-model-based zero-shot pose estimation (2023). https://api.semanticscholar.org/CorpusID:258960481
Cohen-Steiner, D., Morvan, J.M.: Restricted delaunay triangulations and normal cycle. In: SCG ’03 (2003). https://api.semanticscholar.org/CorpusID:5777927
Deng, X., Geng, J., Bretl, T., Xiang, Y., Fox, D.: icaps: iterative category-level object pose and shape estimation. IEEE Robot. Automation Lett. PP, 1 (2021). https://api.semanticscholar.org/CorpusID:245650812
Di, Y., Manhardt, F., Wang, G., Ji, X., Navab, N., Tombari, F.: So-pose: exploiting self-occlusion for direct 6d pose estimation. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 12376–12385 (2021), https://api.semanticscholar.org/CorpusID:237213284
Dosovitskiy, A., et al.: An image is worth 16x16 words: transformers for image recognition at scale. ArXiv abs/2010.11929 (2020). https://api.semanticscholar.org/CorpusID:225039882
Downs, L., et al.: Google scanned objects: a high-quality dataset of 3d scanned household items. In: 2022 International Conference on Robotics and Automation (ICRA), pp. 2553–2560 (2022). https://api.semanticscholar.org/CorpusID:248392390
Fan, Z., et al.: Pope: 6-dof promptable pose estimation of any object, in any scene, with one reference. ArXiv abs/2305.15727 (2023). https://api.semanticscholar.org/CorpusID:258887814
Fischler, M.A., Bolles, R.C.: Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM 24, 381–395 (1981). https://api.semanticscholar.org/CorpusID:972888
Girdhar, R., El-Nouby, A., Liu, Z., Singh, M., Alwala, K.V., Joulin, A., Misra, I.: Imagebind one embedding space to bind them all. 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 15180–15190 (2023). https://api.semanticscholar.org/CorpusID:258564264
Harris, C., Stephens, M., et al.: A combined corner and edge detector. In: Alvey Vision Conference, vol. 15, pp. 10–5244. Citeseer (1988)
He, K., Gkioxari, G., Dollár, P., Girshick, R.B.: Mask r-cnn (2017). https://api.semanticscholar.org/CorpusID:54465873
He, X., Sun, J., Wang, Y., Huang, D., Bao, H., Zhou, X.: Onepose++: keypoint-free one-shot object pose estimation without CAD models. In: Advances in Neural Information Processing Systems (2022)
He, Y., Wang, Y., Fan, H., Sun, J., Chen, Q.: Fs6d: Few-shot 6d pose estimation of novel objects. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6804–6814 (2022). https://api.semanticscholar.org/CorpusID:247763100
Hodan, T., Baráth, D., Matas, J.: Epos: Estimating 6d pose of objects with symmetries. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 11700–11709 (2020). https://api.semanticscholar.org/CorpusID:214743136
Hodan, T., et al.: Bop challenge 2023 on detection, segmentation and pose estimation of seen and unseen rigid objects. arXiv preprint arXiv:2403.09799 (2024)
Kirillov, A., et al.: Segment anything. 2023 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3992–4003 (2023). https://api.semanticscholar.org/CorpusID:257952310
Labb’e, Y., et al.: Megapose: 6d pose estimation of novel objects via render & compare. In: Conference on Robot Learning (2022). https://api.semanticscholar.org/CorpusID:254636085
Lin, J., Wei, Z., Li, Z., Xu, S., Jia, K., Li, Y.: Dualposenet: category-level 6d object pose and size estimation using dual pose network with refined learning of pose consistency. In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 3540–3549 (2021). https://api.semanticscholar.org/CorpusID:232185618
Lin, Y.C., Florence, P.R., Barron, J.T., Rodriguez, A., Isola, P., Lin, T.Y.: inerf: Inverting neural radiance fields for pose estimation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1323–1330 (2020). https://api.semanticscholar.org/CorpusID:228083990
Liu, X., et al.: Gdrnpp (2022). https://github.com/shanice-l/gdrnpp_bop2022
Liu, Y., et al.: Gen6d: generalizable model-free 6-dof object pose estimation from rgb images. In: European Conference on Computer Vision (2022). https://api.semanticscholar.org/CorpusID:248366253
Luo, G., Dunlap, L., Park, D.H., Holynski, A., Darrell, T.: Diffusion hyperfeatures: searching through time and space for semantic correspondence. In: Advances in Neural Information Processing Systems (2023)
Moreno-Noguer, F., Lepetit, V., Fua, P.V.: Accurate non-iterative o(n) solution to the pnp problem. 2007 IEEE 11th International Conference on Computer Vision, pp. 1–8 (2007). https://api.semanticscholar.org/CorpusID:3107373
Nguyen, V.N., Groueix, T., Salzmann, M., Lepetit, V.: Gigapose: Fast and robust novel object pose estimation via one correspondence. ArXiv abs/2311.14155 (2023). https://api.semanticscholar.org/CorpusID:265445006
Nguyen, V.N., Hodan, T., Ponimatkin, G., Groueix, T., Lepetit, V.: Cnos: A strong baseline for cad-based novel object segmentation. In: 2023 IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 2126–2132 (2023). https://api.semanticscholar.org/CorpusID:259991698
Oquab, M., et al.: Dinov2: learning robust visual features without supervision. ArXiv abs/2304.07193 (2023). https://api.semanticscholar.org/CorpusID:258170077
Örnek, E.P., et al.: Foundpose: unseen object pose estimation with foundation features. ArXiv abs/2311.18809 (2023). https://api.semanticscholar.org/CorpusID:265506592
Pan, P., Fan, Z., Feng, B.Y., Wang, P., Li, C., Wang, Z.: Learning to estimate 6dof pose from limited data: A few-shot, generalizable approach using rgb images. ArXiv abs/2306.07598 (2023). https://api.semanticscholar.org/CorpusID:259144908
Peng, S., Liu, Y., Huang, Q.X., Bao, H., Zhou, X.: Pvnet: pixel-wise voting network for 6dof pose estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4556–4565 (2018). https://api.semanticscholar.org/CorpusID:57189382
Pitteri, G., Ilic, S., Lepetit, V.: Cornet: Generic 3d corners for 6d pose estimation of new objects without retraining. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW). pp. 2807–2815 (2019). https://api.semanticscholar.org/CorpusID:201698155
Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction. 2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 12159–12168 (2021), https://api.semanticscholar.org/CorpusID:232352612
Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44(3) (2022)
Reis, D., Kupec, J., Hong, J., Daoudi, A.: Real-time flying object detection with yolov8. ArXiv abs/2305.09972 (2023). https://api.semanticscholar.org/CorpusID:258741093
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10674–10685 (2022). https://doi.org/10.1109/CVPR52688.2022.01042
Shugurov, I.S., Li, F., Busam, B., Ilic, S.: Osop: a multi-stage one shot object pose estimation framework. In: 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 6825–6834 (2022). https://api.semanticscholar.org/CorpusID:247779245
Sipiran, I., Bustos, B.: Harris 3d: a robust extension of the harris operator for interest point detection on 3d meshes. Visual Comput. 27, 963–976 (2011). https://api.semanticscholar.org/CorpusID:15897631
Song, C., Song, J., Huang, Q.X.: Hybridpose: 6d object pose estimation under hybrid representations. In: 2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 428–437 (2020). https://api.semanticscholar.org/CorpusID:210023370
Sun, J., Shen, Z., Wang, Y., Bao, H., Zhou, X.: Loftr: detector-free local feature matching with transformers. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 8918–8927 (2021). https://api.semanticscholar.org/CorpusID:232478646
Sundermeyer, M., et al.: Bop challenge 2022 on detection, segmentation and pose estimation of specific rigid objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2784–2793 (2023)
Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspondence from image diffusion. Adv. Neural. Inf. Process. Syst. 36, 1363–1389 (2023)
Umeyama, S.: Least-squares estimation of transformation parameters between two point patterns. IEEE Trans. Pattern Anal. Mach. Intell. 13, 376–380 (1991). https://api.semanticscholar.org/CorpusID:206421766
Wang, G., Manhardt, F., Tombari, F., Ji, X.: Gdr-net: geometry-guided direct regression network for monocular 6d object pose estimation. In: 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 16606–16616 (2021). https://api.semanticscholar.org/CorpusID:232035418
Wang, H., Sridhar, S., Huang, J., Valentin, J.P.C., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6d object pose and size estimation. In: 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2637–2646 (2019). https://api.semanticscholar.org/CorpusID:57761160
Wen, B., Yang, W., Kautz, J., Birchfield, S.T.: Foundationpose: Unified 6d pose estimation and tracking of novel objects. ArXiv abs/2312.08344 (2023). https://api.semanticscholar.org/CorpusID:266191252
Wen, Y., Li, X., Pan, H., Yang, L., Wang, Z., Komura, T., Wang, W.: Disp6d: disentangled implicit shape and pose learning for scalable 6d pose estimation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022, pp. 404–421. Springer, Cham (2022)
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: Unleashing the power of large-scale unlabeled data. ArXiv abs/2401.10891 (2024). https://api.semanticscholar.org/CorpusID:267061016
Yang, L., Kang, B., Huang, Z., Xu, X., Feng, J., Zhao, H.: Depth anything: unleashing the power of large-scale unlabeled data. arXiv preprint arXiv:2401.10891 (2024)
Zhan, G., Zheng, C., Xie, W., Zisserman, A.: What does stable diffusion know about the 3d scene? ArXiv abs/2310.06836 (2023). https://api.semanticscholar.org/CorpusID:263829471
Zhang, J., et al.: A tale of two features: stable diffusion complements dino for zero-shot semantic correspondence (2023)
Zhang, J., Herrmann, C., Hur, J., Chen, E., Jampani, V., Sun, D., Yang, M.H.: Telling left from right: Identifying geometry-aware semantic correspondence (2023)
Zhao, X., et al.: Fast segment anything. ArXiv abs/2306.12156 (2023). https://api.semanticscholar.org/CorpusID:259212104
Acknowledgement
This research has received funding from the European Union’s Horizon Europe programme in the course of the ZDZW project under grant agreement No 101057404. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union. Neither the European Union nor the granting authority can be held responsible for them.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Ethics declarations
Disclosure of Interests
The authors have no competing interests to declare that are relevant to the content of this article.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Pöllabauer, T., Weyel, J., Knauthe, V., Berkei, S., Kuijper, A. (2025). Improving Zero-Shot Template-Based 6D Pose Estimation with Geometric Features. In: Bebis, G., et al. Advances in Visual Computing. ISVC 2024. Lecture Notes in Computer Science, vol 15046. Springer, Cham. https://doi.org/10.1007/978-3-031-77392-1_4
Download citation
DOI: https://doi.org/10.1007/978-3-031-77392-1_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-77391-4
Online ISBN: 978-3-031-77392-1
eBook Packages: Computer ScienceComputer Science (R0)