Abstract
This paper considers the problem of unsupervised 3D object reconstruction from in-the-wild single-view images. Due to ambiguity and intrinsic ill-posedness, this problem is inherently difficult to solve and therefore requires strong regularization to achieve disentanglement of different latent factors. Unlike existing works that introduce explicit regularizations into objective functions, we look into a different space for implicit regularization – the structure of latent space. Specifically, we restrict the structure of latent space to capture a topological causal ordering of latent factors (i.e., representing causal dependency as a directed acyclic graph). We first show that different causal orderings matter for 3D reconstruction, and then explore several approaches to find a task-dependent causal factor ordering. Our experiments demonstrate that the latent space structure indeed serves as an implicit regularization and introduces an inductive bias beneficial for reconstruction.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Agrawal, P., Carreira, J., Malik, J.: Learning to see by moving. In: ICCV (2015)
Albiero, V., Chen, X., Yin, X., Pang, G., Hassner, T.: img2pose: face alignment and detection via 6DoF, face pose estimation. In: CVPR (2021)
Besserve, M., Sun, R., Schölkopf, B.: Intrinsic disentanglement: an invariance view for deep generative models. In: ICML 2018 Workshop on Theoretical Foundations and Applications of Deep Generative Models (2018)
Blanz, V., Vetter, T.: A morphable model for the synthesis of 3D faces. In: Proceedings of the 26th Annual Conference on Computer Graphics and Interactive Techniques (1999)
Chang, A.X., et al.: ShapeNet: an information-rich 3D model repository. arXiv preprint arXiv:1512.03012 (2015)
Chen, C.H., Tyagi, A., Agrawal, A., Drover, D., Stojanov, S., Rehg, J.M.: Unsupervised 3D pose estimation with geometric self-supervision. In: CVPR (2019)
Chen, W., et al.: Learning to predict 3D objects with an interpolation-based differentiable renderer. In: NeurIPS (2019)
Choutas, V., Pavlakos, G., Bolkart, T., Tzionas, D., Black, M.J.: Monocular expressive body regression through body-driven attention. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12355, pp. 20–40. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58607-2_2
Choy, C.B., Xu, D., Gwak, J.Y., Chen, K., Savarese, S.: 3D-R2N2: a unified approach for single and multi-view 3D object reconstruction. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 628–644. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_38
Eigen, D., Puhrsch, C., Fergus, R.: Depth map prediction from a single image using a multi-scale deep network. In: NeurIPS (2014)
Elsken, T., Metzen, J.H., Hutter, F.: Neural architecture search: a survey. J. Mach. Learn. Res. 20(1), 1997–2017 (2019)
Fahim, G., Amin, K., Zarif, S.: Single-view 3D reconstruction: a survey of deep learning methods. Comput. Graph. 94, 164–190 (2021)
Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: CVPR (2017)
Feng, Y., Choutas, V., Bolkart, T., Tzionas, D., Black, M.J.: Collaborative regression of expressive bodies using moderation. In: 3DV (2021)
Feng, Y., Feng, H., Black, M.J., Bolkart, T.: Learning an animatable detailed 3D face model from in-the-wild images. ACM Trans. Graph. (TOG) 40, 1–13 (2021)
François, A.R., Medioni, G.G., Waupotitsch, R.: Mirror symmetry \({=}{>}\) 2-view stereo geometry. Image Vis. Comput. 21(2), 137–143 (2003)
Frazier, P.I.: A tutorial on Bayesian optimization. arXiv preprint arXiv:1807.02811 (2018)
Fuentes-Pacheco, J., Ruiz-Ascencio, J., Rendón-Mancha, J.M.: Visual simultaneous localization and mapping: a survey. Artif. Intell. Rev. 43(1), 55–81 (2015)
Gecer, B., Ploumpis, S., Kotsia, I., Zafeiriou, S.: GANFit: generative adversarial network fitting for high fidelity 3D face reconstruction. In: CVPR (2019)
Gerig, T., et al.: Morphable face models-an open framework. In: 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018) (2018)
Girdhar, R., Fouhey, D.F., Rodriguez, M., Gupta, A.: Learning a predictable and generative vector representation for objects. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9910, pp. 484–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46466-4_29
Green, R.: Spherical harmonic lighting: the gritty details. In: Archives of the Game Developers Conference, vol. 56, p. 4 (2003)
Gwak, J., Choy, C.B., Chandraker, M., Garg, A., Savarese, S.: Weakly supervised 3D reconstruction with adversarial constraint. In: 3DV (2017)
He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: CVPR (2016)
Henderson, P., Ferrari, V.: Learning to generate and reconstruct 3D meshes with only 2D supervision. arXiv preprint arXiv:1807.09259 (2018)
Henderson, P., Ferrari, V.: Learning single-image 3D reconstruction by generative modelling of shape, pose and shading. IJCV 128(4), 835–854 (2020)
Ho, L.N., Tran, A.T., Phung, Q., Hoai, M.: Toward realistic single-view 3D object reconstruction with unsupervised learning from multiple images. In: ICCV (2021)
Horn, B.K., Brooks, M.J.: Shape from Shading. MIT Press, Cambridge (1989)
Hu, T., Wang, L., Xu, X., Liu, S., Jia, J.: Self-supervised 3D mesh reconstruction from single images. In: CVPR (2021)
Kanazawa, A., Black, M.J., Jacobs, D.W., Malik, J.: End-to-end recovery of human shape and pose. In: CVPR (2018)
Kanazawa, A., Tulsiani, S., Efros, A.A., Malik, J.: Learning category-specific mesh reconstruction from image collections. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11219, pp. 386–402. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01267-0_23
Kar, A., Häne, C., Malik, J.: Learning a multi-view stereo machine. In: NIPS (2017)
Kato, H., Harada, T.: Learning view priors for single-view 3D reconstruction. In: CVPR (2019)
Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: CVPR (2018)
Kilbertus, N., Parascandolo, G., Schölkopf, B.: Generalization in anti-causal learning. arXiv preprint arXiv:1812.00524 (2018)
Koenderink, J.J.: What does the occluding contour tell us about solid shape? Perception 13(3), 321–330 (1984)
Leeb, F., Lanzillotta, G., Annadani, Y., Besserve, M., Bauer, S., Schölkopf, B.: Structure by architecture: disentangled representations without regularization. arXiv preprint arXiv:2006.07796 (2020)
Li, T., Bolkart, T., Black, M.J., Li, H., Romero, J.: Learning a model of facial shape and expression from 4D scans. ACM Trans. Graph. 36(6), 1–17 (2017)
Li, X., et al.: Self-supervised single-view 3D reconstruction via semantic consistency. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12359, pp. 677–693. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58568-6_40
Liu, H., Simonyan, K., Yang, Y.: Darts: differentiable architecture search. In: ICLR (2019)
Liu, S., Li, T., Chen, W., Li, H.: Soft rasterizer: a differentiable renderer for image-based 3D reasoning. In: ICCV (2019)
Liu, W., Wen, Y., Raj, B., Singh, R., Weller, A.: Sphereface revived: unifying hyperspherical face recognition. TPAMI (2022)
Liu, W., Wen, Y., Yu, Z., Li, M., Raj, B., Song, L.: Sphereface: deep hypersphere embedding for face recognition. In: CVPR (2017)
Liu, Z., Luo, P., Wang, X., Tang, X.: Deep learning face attributes in the wild. In: ICCV (2015)
Loper, M., Mahmood, N., Romero, J., Pons-Moll, G., Black, M.J.: SMPL: a skinned multi-person linear model. ACM Trans. Graph. 34(6), 1–16 (2015)
Mukherjee, D.P., Zisserman, A.P., Brady, M., Smith, F.: Shape from symmetry: detecting and exploiting symmetry in affine images. Philos. Trans. R. Soc. Lond. Series A: Phys. Eng. Sci. 351(1695), 77–106 (1995)
Murphy, K.P.: Dynamic Bayesian Networks: Representation, Inference and Learning. University of California, Berkeley (2002)
Novotny, D., Larlus, D., Vedaldi, A.: Learning 3D object categories by looking around them. In: ICCV (2017)
Ozyesil, O., Voroninski, V., Basri, R., Singer, A.: A survey of structure from motion. arXiv preprint arXiv:1701.08493 (2017)
Pan, J., Han, X., Chen, W., Tang, J., Jia, K.: Deep mesh reconstruction from single RGB images via topology modification networks. In: ICCV (2019)
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: CVPR (2012)
Paysan, P., Knothe, R., Amberg, B., Romdhani, S., Vetter, T.: A 3D face model for pose and illumination invariant face recognition. In: 2009 Sixth IEEE International Conference on Advanced Video and Signal Based Surveillance, pp. 296–301. IEEE (2009)
Pearl, J.: Causality. Cambridge University Press, Cambridge (2009)
Phong, B.T.: Illumination for computer generated pictures. Commun. ACM 18(6), 311–317 (1975)
Schölkopf, B., Janzing, D., Peters, J., Sgouritsa, E., Zhang, K., Mooij, J.: On causal and anticausal learning. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML), pp. 1255–1262. Omnipress, New York (2012). http://icml.cc/2012/papers/625.pdf
Schölkopf, B., et al.: Toward causal representation learning. Proc. IEEE 109(5), 612–634 (2021)
Schroff, F., Kalenichenko, D., Philbin, J.: FaceNet: a unified embedding for face recognition and clustering. In: CVPR (2015)
Shen, X., Liu, F., Dong, H., Lian, Q., Chen, Z., Zhang, T.: Disentangled generative causal representation learning. arXiv preprint arXiv:2010.02637 (2020)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Sinha, S.N., Ramnath, K., Szeliski, R.: Detecting and reconstructing 3D mirror symmetric objects. In: Fitzgibbon, A., Lazebnik, S., Perona, P., Sato, Y., Schmid, C. (eds.) ECCV 2012. LNCS, vol. 7573, pp. 586–600. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33709-3_42
Suwajanakorn, S., Snavely, N., Tompson, J.J., Norouzi, M.: Discovery of latent 3D keypoints via end-to-end geometric reasoning. In: NeurIPS (2018)
Tewari, A., et al.: MoFA: model-based deep convolutional face autoencoder for unsupervised monocular reconstruction. In: ICCV (2017)
Thrun, S., Wegbreit, B.: Shape from symmetry. In: ICCV (2005)
Tulsiani, S., Zhou, T., Efros, A.A., Malik, J.: Multi-view supervision for single-view reconstruction via differentiable ray consistency. In: CVPR (2017)
Vowels, M.J., Camgoz, N.C., Bowden, R.: D’ya like DAGs? A survey on structure learning and causal discovery. arXiv preprint arXiv:2103.02582 (2021)
Wang, C., Buenaposada, J.M., Zhu, R., Lucey, S.: Learning depth from monocular videos using direct methods. In: CVPR (2018)
Wang, M., Shu, Z., Cheng, S., Panagakis, Y., Samaras, D., Zafeiriou, S.: An adversarial neuro-tensorial approach for learning disentangled representations. IJCV 127(6), 743–762 (2019)
Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.-G.: Pixel2Mesh: generating 3D mesh models from single RGB images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11215, pp. 55–71. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01252-6_4
Wang, Q., et al.: Exponential convergence of the deep neural network approximation for analytic functions. arXiv preprint arXiv:1807.00297 (2018)
Weichwald, S., Schölkopf, B., Ball, T., Grosse-Wentrup, M.: Causal and anti-causal learning in pattern recognition for neuroimaging. In: International Workshop on Pattern Recognition in Neuroimaging (2014)
Wen, C., Zhang, Y., Li, Z., Fu, Y.: Pixel2Mesh++: multi-view 3D mesh generation via deformation. In: ICCV (2019)
Wen, Y., Liu, W., Raj, B., Singh, R.: Self-supervised 3d face reconstruction via conditional estimation. In: ICCV (2021)
Wiles, O., Zisserman, A.: SilNet: single-and multi-view reconstruction by learning from silhouettes. In: BMVC (2017)
Witkin, A.P.: Recovering surface shape and orientation from texture. Artif. Intell. 17(1–3), 17–45 (1981)
Wu, S., Rupprecht, C., Vedaldi, A.: Unsupervised learning of probably symmetric deformable 3D objects from images in the wild. In: CVPR (2020)
Xiang, Yu., et al.: ObjectNet3D: a large scale database for 3D object recognition. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 160–176. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_10
Xie, H., Yao, H., Sun, X., Zhou, S., Zhang, S.: Pix2Vox: context-aware 3D reconstruction from single and multi-view images. In: ICCV (2019)
Yan, X., Yang, J., Yumer, E., Guo, Y., Lee, H.: Perspective transformer nets: learning single-view 3D object reconstruction without 3D supervision. In: NIPS (2016)
Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., Wang, J.: CausalVAE: disentangled representation learning via neural structural causal models. In: CVPR (2021)
Yi, H., et al.: MMFace: a multi-metric regression network for unconstrained face reconstruction. In: CVPR (2019)
Yu, Y., Chen, J., Gao, T., Yu, M.: DAG-GNN: DAG structure learning with graph neural networks. In: ICML (2019)
Zhang, R., Tsai, P.S., Cryer, J.E., Shah, M.: Shape-from-shading: a survey. TPAMI 21(8), 690–706 (1999)
Zhang, W., Sun, J., Tang, X.: Cat head detection - how to effectively exploit shape and texture features. In: Forsyth, D., Torr, P., Zisserman, A. (eds.) ECCV 2008. LNCS, vol. 5305, pp. 802–816. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-88693-8_59
Zheng, X., Aragam, B., Ravikumar, P.K., Xing, E.P.: DAGs with no tears: continuous optimization for structure learning. In: NeurIPS (2018)
Zhou, T., Brown, M., Snavely, N., Lowe, D.G.: Unsupervised learning of depth and ego-motion from video. In: CVPR (2017)
Zhu, R., Kiani Galoogahi, H., Wang, C., Lucey, S.: Rethinking reprojection: closing the loop for pose-aware shape reconstruction from a single image. In: ICCV (2017)
Zoph, B., Le, Q.V.: Neural architecture search with reinforcement learning. In: ICLR (2017)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
1 Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Liu, W., Liu, Z., Paull, L., Weller, A., Schölkopf, B. (2022). Structural Causal 3D Reconstruction. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_9
Download citation
DOI: https://doi.org/10.1007/978-3-031-19769-7_9
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-19768-0
Online ISBN: 978-3-031-19769-7
eBook Packages: Computer ScienceComputer Science (R0)