Abstract
3D object part segmentation is essential in computer vision applications. While substantial progress has been made in 2D object part segmentation, the 3D counterpart has received less attention, in part due to the scarcity of annotated 3D datasets, which are expensive to collect. In this work, we propose to leverage a few annotated 3D shapes or richly annotated 2D datasets to perform 3D object part segmentation. We present our novel approach, termed 3-By-2 that achieves SOTA performance on different benchmarks with various granularity levels. By using features from pretrained foundation models and exploiting semantic and geometric correspondences, we are able to overcome the challenges of limited 3D annotations. Our approach leverages available 2D labels, enabling effective 3D object part segmentation. Our method 3-By-2 can accommodate various part taxonomies and granularities, demonstrating part label transfer ability across different object categories. Project website: https://ngailapdi.github.io/projects/3by2/.
Work done as an intern at Meta AI (FAIR).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Abdelreheem, A., Skorokhodov, I., Ovsjanikov, M., Wonka, P.: SATR: zero-shot semantic segmentation of 3D shapes. arXiv preprint arXiv:2304.04909 (2023)
Amir, S., Gandelsman, Y., Bagon, S., Dekel, T.: Deep ViT features as dense visual descriptors. arXiv preprint arXiv:2112.058142(3), 4 (2021)
Caron, M., et al.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9650–9660 (2021)
Cen, J., et al.: Segment anything in 3D with NeRFs. In: Advances in Neural Information Processing Systems, vol. 36 (2024)
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, N., Liu, L., Cui, Z., Chen, R., Ceylan, D., Tu, C., Wang, W.: Unsupervised learning of intrinsic structural representation points. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 9121–9130 (2020)
Dai, A., Nießner, M.: 3DMV: joint 3D-multi-view prediction for 3D semantic scene segmentation. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11214, pp. 458–474. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01249-6_28
Deng, S., Xu, X., Wu, C., Chen, K., Jia, K.: 3D affordancenet: a benchmark for visual object affordance understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1778–1787 (2021)
Gupta, A., Dollar, P., Girshick, R.: LVIS: a dataset for large vocabulary instance segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5356–5364 (2019)
He, J., et al.: PartImageNet: a large, high-quality dataset of parts. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13668, pp. 128–145. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_8
Hedlin, E., et al.: Unsupervised semantic correspondence using stable diffusion. arXiv preprint arXiv:2305.15581 (2023)
Huang, R., et al.: Segment3D: learning fine-grained class-agnostic 3D segmentation without manual labels. arXiv preprint arXiv:2312.17232 (2023)
Jaritz, M., Gu, J., Su, H.: Multi-view pointnet for 3D scene understanding. In: 2019 IEEE/CVF International Conference on Computer Vision Workshop (ICCVW), pp. 3995–4003 (2019). https://api.semanticscholar.org/CorpusID:203593088
Kalogerakis, E., Hertzmann, A., Singh, K.: Learning 3D mesh segmentation and labeling. ACM Trans. Graph. 29(3) (2010)
Kim, H., Sung, M.: PartSTAD: 2D-to-3D part segmentation task adaptation (2024)
Kirillov, A., et al.: Segment anything. arXiv preprint arXiv:2304.02643 (2023)
Li, L.H., et al.: Grounded language-image pre-training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10965–10975 (2022)
Li, Y., et al.: 3D CoMPaT: composition of materials on parts of 3D things. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13668, pp. 110–127. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20074-8_7
Liu, M., et al.: Partslip: low-shot part segmentation for 3D point clouds via pretrained image-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 21736–21746 (2023)
Liu, W., Mao, J., Hsu, J., Hermans, T., Garg, A., Wu, J.: Composable part-based manipulation. In: 7th Annual Conference on Robot Learning (2023). https://openreview.net/forum?id=o-K3HVUeEw
Liu, X., Xu, X., Rao, A., Gan, C., Yi, L.: AutoGPart: intermediate supervision search for generalizable 3D part segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11624–11634 (2022)
Lowe, D.G.: Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 91–110 (2004)
Min, J., Lee, J., Ponce, J., Cho, M.: SPair-71k: a large-scale benchmark for semantic correspondence. arXiv preprint arXiv:1908.10543 (2019)
Mo, K., et al.: PartNet: a large-scale benchmark for fine-grained and hierarchical part-level 3D object understanding. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 909–918 (2019)
Nadeau, P., Giamou, M., Kelly, J.: The sum of its parts: visual part segmentation for inertial parameter identification of manipulated objects. arXiv preprint arXiv:2302.06685 (2023)
Nguyen, P.D.A., et al.: Open3DIS: open-vocabulary 3D instance segmentation with 2D mask guidance (2023)
Peng, S., et al.: OpenScene: 3D scene understanding with open vocabularies. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 815–824 (2023)
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Qian, G., et al.: PointNext: revisiting pointnet++ with improved training and scaling strategies. In: Advances in Neural Information Processing Systems, vol. 35, pp. 23192–23204 (2022)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ramanathan, V., et al.: Paco: parts and attributes of common objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7141–7151 (2023)
Sharma, G., Yin, K., Maji, S., Kalogerakis, E., Litany, O., Fidler, S.: MvDeCor: multi-view dense correspondence learning for fine-grained 3D segmentation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) ECCV 2022. LNCS, vol. 13662, pp. 550–567. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-20086-1_32
Singh, C., Murdoch, W.J., Yu, B.: Hierarchical interpretations for neural network predictions. arXiv preprint arXiv:1806.05337 (2018)
Sun, P., et al.: Going denser with open-vocabulary part segmentation. arXiv preprint arXiv:2305.11173 (2023)
Takmaz, A., Fedele, E., Sumner, R.W., Pollefeys, M., Tombari, F., Engelmann, F.: Openmask3D: open-vocabulary 3D instance segmentation. arXiv preprint arXiv:2306.13631 (2023)
Tang, L., Jia, M., Wang, Q., Phoo, C.P., Hariharan, B.: Emergent correspondence from image diffusion. arXiv preprint arXiv:2306.03881 (2023)
Varadarajan, K.M., Vincze, M.: Object part segmentation and classification in range images for grasping. In: 2011 15th International Conference on Advanced Robotics (ICAR), pp. 21–27. IEEE (2011)
Vu, T., Kim, K., Luu, T.M., Nguyen, T., Yoo, C.D.: Softgroup for 3D instance segmentation on point clouds. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2708–2717 (2022)
Wang, L., Li, X., Fang, Y.: Few-shot learning of part-specific probability space for 3D shape segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2020)
Wang, R., Zhang, Y., Mao, J., Zhang, R., Cheng, C.Y., Wu, J.: Ikea-manual: seeing shape assembly step by step. In: Advances in Neural Information Processing Systems, vol. 35, pp. 28428–28440 (2022)
Xiang, F., et al.: SAPIEN: a simulated part-based interactive environment. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 11097–11107 (2020)
Xu, M., Yin, X., Qiu, L., Liu, Y., Tong, X., Han, X.: SAMPro3D: locating SAM prompts in 3D for zero-shot scene segmentation. arXiv preprint arXiv:2311.17707 (2023)
Xue, Y., Chen, N., Liu, J., Sun, W.: Zerops: high-quality cross-modal knowledge transfer for zero-shot 3D part segmentation (2023)
Yang, Y., Wu, X., He, T., Zhao, H., Liu, X.: SAM3D: segment anything in 3d scenes. arXiv preprint arXiv:2306.03908 (2023)
Yi, L., et al.: A scalable active framework for region annotation in 3D shape collections. In: SIGGRAPH Asia (2016)
Yu, Q., Du, H., Liu, C., Yu, X.: When 3D bounding-box meets SAM: point cloud instance segmentation with weak-and-noisy supervision. arXiv abs/2309.00828 (2023). https://api.semanticscholar.org/CorpusID:261530997
Zhang, J., et al.: A tale of two features: stable diffusion complements DINO for zero-shot semantic correspondence. arXiv preprint arXiv:2305.15347 (2023)
Zhao, L., Lu, J., Zhou, J.: Similarity-aware fusion network for 3D semantic segmentation. In: 2021 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 1585–1592 (2021). https://api.semanticscholar.org/CorpusID:235732071
Zhao, N., Chua, T.S., Lee, G.H.: Few-shot 3D point cloud semantic segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8873–8882 (2021)
Zhou, Y., Gu, J., Li, X., Liu, M., Fang, Y., Su, H.: PartSLIP++: enhancing low-shot 3d part segmentation via multi-view instance segmentation and maximum likelihood estimation. arXiv preprint arXiv:2312.03015 (2023)
Zhu, J., et al.: Label transfer between images and 3D shapes via local correspondence encoding. Comput. Aided Geom. Des. 71(C), 255–266 (2019). https://doi.org/10.1016/j.cagd.2019.04.009
Zhu, X., et al.: PointCLIP V2: prompting clip and GPT for powerful 3D open-world learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2639–2650 (2023)
Acknowledgement
This work was partly supported by NIH R01HD104624-01A1.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Thai, A., Wang, W., Tang, H., Stojanov, S., Rehg, J.M., Feiszli, M. (2025). \(3\times 2\): 3D Object Part Segmentation by 2D Semantic Correspondences. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15096. Springer, Cham. https://doi.org/10.1007/978-3-031-72920-1_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-72920-1_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-72919-5
Online ISBN: 978-3-031-72920-1
eBook Packages: Computer ScienceComputer Science (R0)