Abstract
Category-level articulated object pose estimation focuses on the pose estimation of unknown articulated objects within known categories. Despite its significance, this task remains challenging due to the varying shapes and poses of objects, expensive dataset annotation costs, and complex real-world environments. In this paper, we propose a novel self-supervised approach that leverages a single-frame point cloud to solve this task. Our model consistently generates reconstruction with a canonical pose and joint state for the entire input object, and it estimates object-level poses that reduce overall pose variance and part-level poses that align each part of the input with its corresponding part of the reconstruction. Experimental results demonstrate that our approach significantly outperforms previous self-supervised methods and is comparable to the state-of-the-art supervised methods. To assess the performance of our model in real-world scenarios, we also introduce a new real-world articulated object benchmark dataset (Code and dataset are released at https://github.com/YC-Che/OP-Align.).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abbatematteo, B., Tellex, S., Konidaris, G.: Learning to generalize kinematic models to novel objects. In: Proceedings of the 3rd Conference on Robot Learning (2019)
Chang, A.X., et al.: ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, H., Liu, S., Chen, W., Li, H., Hill, R.: Equivariant point network for 3D point cloud analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 14514–14523 (2021)
Chen, W., Jia, X., Chang, H.J., Duan, J., Shen, L., Leonardis, A.: FS-Net: fast shape-based network for category-level 6d object pose estimation with decoupled rotation mechanism. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1581–1590 (2021)
Chu, R., Liu, Z., Ye, X., Tan, X., Qi, X., Fu, C.W., Jia, J.: Command-driven articulated object understanding and manipulation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8813–8823 (2023)
Di, Y., Zhang, R., Lou, Z., Manhardt, F., Ji, X., Navab, N., Tombari, F.: GPV-Pose: category-level object pose estimation via geometry-guided point-wise voting. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6781–6791 (2022)
Harris, C.R., et al.: Array programming with NumPy. Nature 585(7825), 357–362 (2020). https://doi.org/10.1038/s41586-020-2649-2
Hausman, K., Niekum, S., Osentoski, S., Sukhatme, G.S.: Active articulation model estimation through interactive perception. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 3305–3312. IEEE (2015)
He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: International Conference on Computer Vision, pp. 2961–2969 (2017)
Huang, J., et al.: MultiBodySync: multi-body segmentation and motion estimation via 3D scan synchronization. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 7108–7118 (2021)
Insafutdinov, E., Dosovitskiy, A.: Unsupervised learning of shape and pose with differentiable point clouds. In: Advances in Neural Information Processing Systems, vol. 31 (2018)
Irshad, M.Z., Kollar, T., Laskey, M., Stone, K., Kira, Z.: CenterSnap: single-shot multi-object 3D shape reconstruction and categorical 6D pose and size estimation. In: Proceedings of IEEE International Conference on Robotics and Automation (ICRA), pp. 10632–10640. IEEE (2022)
Irshad, M.Z., Zakharov, S., Ambrus, R., Kollar, T., Kira, Z., Gaidon, A.: ShAPO: implicit representations for multi-object shape, appearance, and pose optimization. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022. ECCV 2022. LNCS, vol. 13662, pp. 275–292. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_16
Jiang, H., Mao, Y., Savva, M., Chang, A.X.: OPD: single-view 3D openable part detection. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022. ECCV 2022. LNCS, vol. 13699, pp. 410–426. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19842-7_24
Jiang, Z., Hsu, C.C., Zhu, Y.: Ditto: Building digital twins of articulated objects from interaction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 5616–5626 (2022)
Kawana, Y., Mukuta, Y., Harada, T.: Unsupervised pose-aware part decomposition for man-made articulated objects. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) Computer Vision - ECCV 2022. ECCV 2022. LNCS, vol. 13663, pp. 558–575. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20062-5_32
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Lei, J., Daniilidis, K.: CaDex: learning canonical deformation coordinate space for dynamic surface representation via neural homeomorphism. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 6624–6634 (2022)
Li, C., Bai, J., Hager, G.D.: A unified framework for multi-view multi-class object pose estimation. In: European Conference on Computer Vision, pp. 254–269 (2018)
Li, X., Wang, H., Yi, L., Guibas, L.J., Abbott, A.L., Song, S.: Category-level articulated object pose estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3706–3715 (2020)
Li, X., et al.: Leveraging SE(3) equivariance for self-supervised category-level object pose estimation from point clouds. Adv. Neural Inform. Process. Syst. 34, 15370–15381 (2021)
Lin, Z.H., Huang, S.Y., Wang, Y.C.F.: Convolution in the cloud: learning deformable kernels in 3D graph convolution networks for point cloud analysis. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1800–1809 (2020)
Liu, G., et al.: Semi-weakly supervised object kinematic motion prediction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 21726–21735 (2023)
Liu, X., Zhang, J., Hu, R., Huang, H., Wang, H., Yi, L.: Self-supervised category-level articulated object pose estimation with part-level se (3) equivariance. In: International Conference on Learning Representations (2023)
Liu, Y., et al.: HOI4D: a 4D egocentric dataset for category-level human-object interaction. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 21013–21022 (2022)
Locatello, F., et al.: Object-centric learning with slot attention. Adv. Neural Inform. Process. Syst. 33, 11525–11538 (2020)
Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: Learning 3D reconstruction in function space. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019)
Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2019)
Mu, J., Qiu, W., Kortylewski, A., Yuille, A., Vasconcelos, N., Wang, X.: A-SDF: learning disentangled signed distance functions for articulated shape representation. In: International Conference on Computer Vision, pp. 13001–13011 (2021)
Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 165–174 (2019)
Paschalidou, D., Katharopoulos, A., Geiger, A., Fidler, S.: Neural parts: learning expressive 3D shape abstractions with invertible neural networks. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 3204–3215 (2021)
Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 652–660 (2017)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: Pointnet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Segal, A., Haehnel, D., Thrun, S.: Generalized-ICP. In: Robotics: science and systems. vol. 2, p. 435. Seattle, WA (2009)
Shi, Y., Cao, X., Zhou, B.: Self-supervised learning of part mobility from point cloud sequence. In: Computer Graphics Forum. vol. 40, pp. 104–116. Wiley Online Library (2021)
Sundermeyer, M., Marton, Z.C., Durner, M., Triebel, R.: Augmented autoencoders: implicit 3D orientation learning for 6D object detection. Int. J. Comput. Vis. 128, 714–729 (2020)
Tian, M., Ang, M.H., Lee, G.H.: Shape prior deformation for categorical 6D object pose and size estimation. In: European Conference on Computer Vision (2020)
Wang, G., Manhardt, F., Shao, J., Ji, X., Navab, N., Tombari, F.: Self6D: self-supervised monocular 6D object pose estimation. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12346, pp. 108–125. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58452-8_7
Wang, H., Sridhar, S., Huang, J., Valentin, J., Song, S., Guibas, L.J.: Normalized object coordinate space for category-level 6D object pose and size estimation. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 2642–2651 (2019)
Wang, X., Zhou, B., Shi, Y., Chen, X., Zhao, Q., Xu, K.: Shape2Motion: joint analysis of motion parts and attributes from 3D shapes. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 8876–8884 (2019)
Weng, Y., et al.: CAPTRA: CAtegory-level pose tracking for rigid and articulated objects from point clouds. In: International Conference on Computer Vision, pp. 13209–13218 (2021)
Wu, T., Pan, L., Zhang, J., Wang, T., Liu, Z., Lin, D.: Density-aware chamfer distance as a comprehensive metric for point cloud completion. arXiv preprint arXiv:2111.12702 (2021)
Xiang, F., et al.: SAPIEN: A simulAted part-based interactive ENvironment. In: IEEE Conference on Computer Vision and Pattern Recognition (2020)
Zhu, M., Ghaffari, M., Clark, W.A., Peng, H.: E2PN: efficient se(3)-equivariant point network. In: IEEE Conference on Computer Vision and Pattern Recognition, pp. 1223–1232 (2023)
Acknowledgements
We thank Ryutaro Yamauchi and Tatsushi Matsubayashi from ALBERT Inc. (now Accenture Japan Ltd.) for their insightful suggestions and support. This work was supported by JST FOREST Program, Grant Number JPMJFR206H.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Che, Y., Furukawa, R., Kanezaki, A. (2025). OP-Align: Object-Level and Part-Level Alignment for Self-supervised Category-Level Articulated Object Pose Estimation. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15133. Springer, Cham. https://doi.org/10.1007/978-3-031-73226-3_5
Download citation
DOI: https://doi.org/10.1007/978-3-031-73226-3_5
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-73225-6
Online ISBN: 978-3-031-73226-3
eBook Packages: Computer ScienceComputer Science (R0)