Abstract
In the last years, deep learning models have achieved remarkable success in computer vision tasks, but their ability to process and reason about multi-modal data has been limited. The emergence of models leveraging contrastive loss to learn a joint embedding space for images and text has sparked research in multi-modal unsupervised alignment. This paper proposes a contrastive model for the multi-modal alignment of images and 3D representations. In particular, we study the alignment of images and raw point clouds on a learned latent space. The effectiveness of the proposed model is demonstrated through various experiments, including 3D shape retrieval from a single image, testing on out-of-distribution data, and latent space analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Chang, A.X., et al.: ShapeNet: An information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, H., Zuo, Y.: 3D-ARNet: an accurate 3D point cloud reconstruction network from a single-image. In: Multimedia Tools and Applications, pp. 1–14 (2022)
Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 605–613 (2017)
Gemmeke, J.F., et al.: Audio set: an ontology and human-labeled dataset for audio events. In: 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 776–780. IEEE (2017)
Guo, M.H., Cai, J.X., Liu, Z.N., Mu, T.J., Martin, R.R., Hu, S.M.: PCT: point cloud transformer. Comput. Vis. Media 7, 187–199 (2021)
Guzhov, A., Raue, F., Hees, J., Dengel, A.: AudioCLIP: extending clip to image, text and audio. In: ICASSP 2022–2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 976–980. IEEE (2022)
Hafiz, A.M., Bhat, R.U.A., Parah, S.A., Hassaballah, M.: SE-MD: a single-encoder multiple-decoder deep network for point cloud generation from 2D images. arXiv preprint arXiv:2106.15325 (2021)
Jia, C., et al.: Scaling up visual and vision-language representation learning with noisy text supervision. In: International Conference on Machine Learning, pp. 4904–4916. PMLR (2021)
Li, B., Zhu, S., Lu, Y.: A single stage and single view 3D point cloud reconstruction network based on DetNet. Sensors 22(21), 8235 (2022)
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Mandikal, P., Navaneet, K., Agarwal, M., Babu, R.V.: 3D-LMNet: latent embedding matching for accurate and diverse 3D point cloud reconstruction from a single image. arXiv preprint arXiv:1807.07796 (2018)
Mohammad Khalid, N., Xie, T., Belilovsky, E., Popa, T.: Clip-mesh: generating textured meshes from text using pretrained image-text models. In: SIGGRAPH Asia 2022 Conference Papers, SA 2022. Association for Computing Machinery, New York (2022). https://doi.org/10.1145/3550469.3555392
Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. In: Advances in Neural Information Processing Systems, vol. 30 (2017)
Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)
Ramesh, A., Dhariwal, P., Nichol, A., Chu, C., Chen, M.: Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 (2022)
Sbrolli, C., Cudrano, P., Frosi, M., Matteucci, M.: IC3D: image-conditioned 3D diffusion for shape generation (2023)
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jégou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning, pp. 10347–10357. PMLR (2021)
Wang, E., Sun, H., Wang, B., Cao, Z., Liu, Z.: 3D-FEGNet: a feature enhanced point cloud generation network from a single image. IET Comput. Vision 17(1), 98–110 (2023)
Wu, Z., Xiong, Y., Yu, S.X., Lin, D.: Unsupervised feature learning via non-parametric instance discrimination. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2018)
Xu, H., et al.: Videoclip: contrastive pre-training for zero-shot video-text understanding. arXiv preprint arXiv:2109.14084 (2021)
Zhang, R., et al.: PointClip: point cloud understanding by clip. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8552–8562 (2022)
Zhu, X., Zhang, R., He, B., Zeng, Z., Zhang, S., Gao, P.: PointClip V2: adapting clip for powerful 3D open-world learning. arXiv preprint arXiv:2211.11682 (2022)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Sbrolli, C., Cudrano, P., Matteucci, M. (2023). CISPc: Embedding Images and Point Clouds in a Joint Concept Space by Contrastive Learning. In: Foresti, G.L., Fusiello, A., Hancock, E. (eds) Image Analysis and Processing – ICIAP 2023. ICIAP 2023. Lecture Notes in Computer Science, vol 14234. Springer, Cham. https://doi.org/10.1007/978-3-031-43153-1_39
Download citation
DOI: https://doi.org/10.1007/978-3-031-43153-1_39
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-43152-4
Online ISBN: 978-3-031-43153-1
eBook Packages: Computer ScienceComputer Science (R0)