Skip to main content

OP-Align: Object-Level and Part-Level Alignment for Self-supervised Category-Level Articulated Object Pose Estimation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

Category-level articulated object pose estimation focuses on the pose estimation of unknown articulated objects within known categories. Despite its significance, this task remains challenging due to the varying shapes and poses of objects, expensive dataset annotation costs, and complex real-world environments. In this paper, we propose a novel self-supervised approach that leverages a single-frame point cloud to solve this task. Our model consistently generates reconstruction with a canonical pose and joint state for the entire input object, and it estimates object-level poses that reduce overall pose variance and part-level poses that align each part of the input with its corresponding part of the reconstruction. Experimental results demonstrate that our approach significantly outperforms previous self-supervised methods and is comparable to the state-of-the-art supervised methods. To assess the performance of our model in real-world scenarios, we also introduce a new real-world articulated object benchmark dataset (Code and dataset are released at https://github.com/YC-Che/OP-Align.).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Abbatematteo, B., Tellex, S., Konidaris, G.: Learning to generalize kinematic models to novel objects. In: Proceedings of the 3rd Conference on Robot Learning (2019)

    Google Scholar 

  2. Chang, A.X., et al.: ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)

  3. Chen, H., Liu, S., Chen, W., Li, H., Hill, R.: Equivariant point network for 3D point cloud analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 14514–14523 (2021)

    Google Scholar 

  4. Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1581–1590 (2021)

    Google Scholar 

  5. Chu, R., Liu, Z., Ye, X., Tan, X., Qi, X., Fu, C.W., Jia, J.: Command-driven articulated object understanding and manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8813–8823 (2023)

    Google Scholar 

  6. Di, Y., Zhang, R., Lou, Z., Manhardt, F., Ji, X., Navab, N., Tombari, F.: GPV-Pose: category-level object pose estimation via geometry-guided point-wise voting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6781–6791 (2022)

    Google Scholar 

  7. Harris, C.R., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2

    Article  Google Scholar 

  8. Hausman, K., Niekum, S., Osentoski, S., Sukhatme, G.S.: Active articulation model estimation through interactive perception. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 3305–3312. IEEE (2015)

    Google Scholar 

  9. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  10. Huang, J., et al.: MultiBodySync: multi-body segmentation and motion estimation via 3D scan synchronization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7108–7118 (2021)

    Google Scholar 

  11. Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: Advances in Neural Information Processing Systems, vol. 31 (2018)

    Google Scholar 

  12. Irshad, M.Z., Kollar, T., Laskey, M., Stone, K., Kira, Z.: CenterSnap: single-shot multi-object 3D shape reconstruction and categorical 6D pose and size estimation. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 10632–10640. IEEE (2022)

    Google Scholar 

  13. Irshad, M.Z., Zakharov, S., Ambrus, R., Kollar, T., Kira, Z., Gaidon, A.: ShAPO: implicit representations for multi-object shape, appearance, and pose optimization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022. ECCV 2022. LNCS, vol. 13662, pp. 275–292. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_16

  14. Jiang, H., Mao, Y., Savva, M., Chang, A.X.: OPD: single-view 3D openable part detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022. ECCV 2022. LNCS, vol. 13699, pp. 410–426. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19842-7_24

  15. Jiang, Z., Hsu, C.C., Zhu, Y.: Ditto: Building digital twins of articulated objects from interaction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5616–5626 (2022)

    Google Scholar 

  16. Kawana, Y., Mukuta, Y., Harada, T.: Unsupervised pose-aware part decomposition for man-made articulated objects. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022. ECCV 2022. LNCS, vol. 13663, pp. 558–575. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20062-5_32

  17. Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)

  18. Lei, J., Daniilidis, K.: CaDex: learning canonical deformation coordinate space for dynamic surface representation via neural homeomorphism. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6624–6634 (2022)

    Google Scholar 

  19. Li, C., Bai, J., Hager, G.D.: A unified framework for multi-view multi-class object pose estimation. In: European Conference on Computer Vision, pp. 254–269 (2018)

    Google Scholar 

  20. Li, X., Wang, H., Yi, L., Guibas, L.J., Abbott, A.L., Song, S.: Category-level articulated object pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3706–3715 (2020)

    Google Scholar 

  21. Li, X., et al.: Leveraging SE(3) equivariance for self-supervised category-level object pose estimation from point clouds. Adv. Neural Inform. Process. Syst. 34, 15370–15381 (2021)

    Google Scholar 

  22. Lin, Z.H., Huang, S.Y., Wang, Y.C.F.: Convolution in the cloud: learning deformable kernels in 3D graph convolution networks for point cloud analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1800–1809 (2020)

    Google Scholar 

  23. Liu, G., et al.: Semi-weakly supervised object kinematic motion prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 21726–21735 (2023)

    Google Scholar 

  24. Liu, X., Zhang, J., Hu, R., Huang, H., Wang, H., Yi, L.: Self-supervised category-level articulated object pose estimation with part-level se (3) equivariance. In: International Conference on Learning Representations (2023)

    Google Scholar 

  25. Liu, Y., et al.: HOI4D: a 4D egocentric dataset for category-level human-object interaction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 21013–21022 (2022)

    Google Scholar 

  26. Locatello, F., et al.: Object-centric learning with slot attention. Adv. Neural Inform. Process. Syst. 33, 11525–11538 (2020)

    Google Scholar 

  27. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3D reconstruction in function space. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019)

    Google Scholar 

  28. Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2019)

    Google Scholar 

  29. Mu, J., Qiu, W., Kortylewski, A., Yuille, A., Vasconcelos, N., Wang, X.: A-SDF: learning disentangled signed distance functions for articulated shape representation. In: International Conference on Computer Vision, pp. 13001–13011 (2021)

    Google Scholar 

  30. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 165–174 (2019)

    Google Scholar 

  31. Paschalidou, D., Katharopoulos, A., Geiger, A., Fidler, S.: Neural parts: learning expressive 3D shape abstractions with invertible neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3204–3215 (2021)

    Google Scholar 

  32. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)

    Google Scholar 

  33. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)

    Google Scholar 

  34. Segal, A., Haehnel, D., Thrun, S.: Generalized-ICP. In: Robotics: science and systems. vol. 2, p. 435. Seattle, WA (2009)

    Google Scholar 

  35. Shi, Y., Cao, X., Zhou, B.: Self-supervised learning of part mobility from point cloud sequence. In: Computer Graphics Forum. vol. 40, pp. 104–116. Wiley Online Library (2021)

    Google Scholar 

  36. Sundermeyer, M., Marton, Z.C., Durner, M., Triebel, R.: Augmented autoencoders: implicit 3D orientation learning for 6D object detection. Int. J. Comput. Vis. 128, 714–729 (2020)

    Article  Google Scholar 

  37. Tian, M., Ang, M.H., Lee, G.H.: Shape prior deformation for categorical 6D object pose and size estimation. In: European Conference on Computer Vision (2020)

    Google Scholar 

  38. Wang, G., Manhardt, F., Shao, J., Ji, X., Navab, N., Tombari, F.: Self6D: self-supervised monocular 6D object pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 108–125. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_7

    Chapter  Google Scholar 

  39. Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6D object pose and size estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2642–2651 (2019)

    Google Scholar 

  40. Wang, X., Zhou, B., Shi, Y., Chen, X., Zhao, Q., Xu, K.: Shape2Motion: joint analysis of motion parts and attributes from 3D shapes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8876–8884 (2019)

    Google Scholar 

  41. Weng, Y., et al.: CAPTRA: CAtegory-level pose tracking for rigid and articulated objects from point clouds. In: International Conference on Computer Vision, pp. 13209–13218 (2021)

    Google Scholar 

  42. Wu, T., Pan, L., Zhang, J., Wang, T., Liu, Z., Lin, D.: Density-aware chamfer distance as a comprehensive metric for point cloud completion. arXiv preprint arXiv:2111.12702 (2021)

  43. Xiang, F., et al.: SAPIEN: A simulAted part-based interactive ENvironment. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)

    Google Scholar 

  44. Zhu, M., Ghaffari, M., Clark, W.A., Peng, H.: E2PN: efficient se(3)-equivariant point network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1223–1232 (2023)

    Google Scholar 

Download references

Acknowledgements

We thank Ryutaro Yamauchi and Tatsushi Matsubayashi from ALBERT Inc. (now Accenture Japan Ltd.) for their insightful suggestions and support. This work was supported by JST FOREST Program, Grant Number JPMJFR206H.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yuchen Che .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 2812 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Che, Y., Furukawa, R., Kanezaki, A. (2025). OP-Align: Object-Level and Part-Level Alignment for Self-supervised Category-Level Articulated Object Pose Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15133. Springer, Cham. https://doi.org/10.1007/978-3-031-73226-3_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73226-3_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73225-6

  • Online ISBN: 978-3-031-73226-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics