Skip to main content

Towards High-Fidelity Single-View Holistic Reconstruction of Indoor Scenes

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13661))

Included in the following conference series:

Abstract

We present a new framework to reconstruct holistic 3D indoor scenes including both room background and indoor objects from single-view images. Existing methods can only produce 3D shapes of indoor objects with limited geometry quality because of the heavy occlusion of indoor scenes. To solve this, we propose an instance-aligned implicit function (InstPIFu) for detailed object reconstruction. Combining with instance-aligned attention module, our method is empowered to decouple mixed local features toward the occluded instances. Additionally, unlike previous methods that simply represents the room background as a 3D bounding box, depth map or a set of planes, we recover the fine geometry of the background via implicit representation. Extensive experiments on the SUN RGB-D, Pix3D, 3D-FUTURE, and 3D-FRONT datasets demonstrate that our method outperforms existing approaches in both background and foreground object reconstruction. Our code and model will be made publicly available.

H. Liu and Y. Zheng—Contributed equally to this paper.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018 (2021)

    Google Scholar 

  2. Chen, Y., Huang, S., Yuan, T., Qi, S., Zhu, Y., Zhu, S.C.: Holistic++ scene understanding: single-view 3D holistic scene parsing and human pose estimation with human-object interaction and physical commonsense. arXiv preprint arXiv:1909.01507 (2019)

  3. Chen, Z., Zhang, H.: Learning implicit fields for generative shape modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5939–5948 (2019)

    Google Scholar 

  4. Choy, C.B., Xu, D., Gwak, J., Chen, K., Savarese, S.: 3D–R2N2: a unified approach for single and multi-view 3D object reconstruction. In: European Conference on Computer Vision, pp. 628–644. Springer (2016). https://doi.org/10.1007/978-3-319-46484-8_38

  5. Dasgupta, S., Fang, K., Chen, K., Savarese, S.: DeLay: robust spatial layout estimation for cluttered indoor scenes. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 616–624 (2016)

    Google Scholar 

  6. Deprelle, T., Groueix, T., Fisher, M., Kim, V.G., Russell, B.C., Aubry, M.: Learning elementary structures for 3D shape generation and matching. arXiv preprint arXiv:1908.04725 (2019)

  7. Du, Y., et al.: Learning to exploit stability for 3D scene parsing. In: Advances in Neural Information Processing Systems, pp. 1726–1736 (2018)

    Google Scholar 

  8. Dupont, E., Martin, M.B., Colburn, A., Sankar, A., Susskind, J., Shan, Q.: Equivariant neural rendering. In: International Conference on Machine Learning, pp. 2761–2770. PMLR (2020)

    Google Scholar 

  9. Fan, H., Su, H., Guibas, L.J.: A point set generation network for 3D object reconstruction from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 605–613 (2017)

    Google Scholar 

  10. Fu, H., et al.: 3D-FRONT: 3D furnished rooms with layouts and semantics. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10933–10942 (2021)

    Google Scholar 

  11. Fu, H., et al.: 3D-future: 3D furniture shape with texture. Int. J. Comput. Vis. 1–25 (2021). https://doi.org/10.1007/s11263-021-01534-z

  12. Gkioxari, G., Malik, J., Johnson, J.: Mesh R-CNN. arXiv preprint arXiv:1906.02739 (2019)

  13. Groueix, T., Fisher, M., Kim, V.G., Russell, B., Aubry, M.: AtlasNet: a Papier-Mâché approach to learning 3D surface generation. In: Proceedings of IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  14. He, K., Gkioxari, G., Dollár, P., Girshick, R.: Mask R-CNN. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2961–2969 (2017)

    Google Scholar 

  15. Hedau, V., Hoiem, D., Forsyth, D.: Recovering the spatial layout of cluttered rooms. In: 2009 IEEE 12th International Conference on Computer Vision, pp. 1849–1856. IEEE (2009)

    Google Scholar 

  16. Hu, J., Shen, L., Sun, G.: Squeeze-and-excitation networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132–7141 (2018)

    Google Scholar 

  17. Huang, S., Qi, S., Xiao, Y., Zhu, Y., Wu, Y.N., Zhu, S.C.: Cooperative holistic scene understanding: unifying 3D object, layout, and camera pose estimation. In: Advances in Neural Information Processing Systems, pp. 207–218 (2018)

    Google Scholar 

  18. Huang, S., Qi, S., Zhu, Y., Xiao, Y., Xu, Y., Zhu, S.C.: Holistic 3D scene parsing and reconstruction from a single RGB image. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 187–203 (2018)

    Google Scholar 

  19. Hueting, M., Reddy, P., Kim, V., Yumer, E., Carr, N., Mitra, N.: SeeThrough: finding chairs in heavily occluded indoor scene images. arXiv preprint arXiv:1710.10473 (2017)

  20. Izadinia, H., Shan, Q., Seitz, S.M.: IM2CAD. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5134–5143 (2017)

    Google Scholar 

  21. Kato, H., Ushiku, Y., Harada, T.: Neural 3D mesh renderer. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3907–3916 (2018)

    Google Scholar 

  22. Kulkarni, N., Misra, I., Tulsiani, S., Gupta, A.: 3D-RelNet: joint object and relational network for 3D prediction. In: International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  23. Kurenkov, A., et al.: DeformNet: free-form deformation network for 3D shape reconstruction from a single image. In: 2018 IEEE Winter Conference on Applications of Computer Vision (WACV), pp. 858–866. IEEE (2018)

    Google Scholar 

  24. Lee, D.C., Hebert, M., Kanade, T.: Geometric reasoning for single image structure recovery. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 2136–2143. IEEE (2009)

    Google Scholar 

  25. Li, L., Khan, S., Barnes, N.: Silhouette-assisted 3D object instance reconstruction from a cluttered scene. In: Proceedings of the IEEE International Conference on Computer Vision Workshops (2019)

    Google Scholar 

  26. Liao, Y., Donne, S., Geiger, A.: Deep marching cubes: learning explicit surface representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2916–2925 (2018)

    Google Scholar 

  27. Liu, C., Kim, K., Gu, J., Furukawa, Y., Kautz, J.: PlaneRCNN: 3D plane detection and reconstruction from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4450–4459 (2019)

    Google Scholar 

  28. Mallya, A., Lazebnik, S.: Learning informative edge maps for indoor scene layout prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 936–944 (2015)

    Google Scholar 

  29. Mandikal, P., KL, N., Babu, R.V.: 3D-PSRNet: part segmented 3D point cloud reconstruction from a single image. In: Proceedings of the European Conference on Computer Vision (ECCV) (2018)

    Google Scholar 

  30. Mescheder, L., Oechsle, M., Niemeyer, M., Nowozin, S., Geiger, A.: Occupancy networks: learning 3D reconstruction in function space. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4460–4470 (2019)

    Google Scholar 

  31. Michalkiewicz, M., Pontes, J.K., Jack, D., Baktashmotlagh, M., Eriksson, A.: Deep level sets: implicit surface representations for 3D shape inference. arXiv preprint arXiv:1901.06802 (2019)

  32. Navaneet, K., Mandikal, P., Agarwal, M., Babu, R.V.: CAPNet: continuous approximation projection for 3D point cloud reconstruction using 2D supervision. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, pp. 8819–8826 (2019)

    Google Scholar 

  33. Newell, A., Yang, K., Deng, J.: Stacked hourglass networks for human pose estimation. In: Leibe, B., Matas, J., Sebe, N., Welling, M. (eds.) ECCV 2016. LNCS, vol. 9912, pp. 483–499. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46484-8_29

    Chapter  Google Scholar 

  34. Nie, Y., Han, X., Guo, S., Zheng, Y., Chang, J., Zhang, J.J.: Total3DUnderstanding: joint layout, object pose and mesh reconstruction for indoor scenes from a single image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 55–64 (2020)

    Google Scholar 

  35. Niemeyer, M., Mescheder, L., Oechsle, M., Geiger, A.: Differentiable volumetric rendering: learning implicit 3D representations without 3D supervision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3504–3515 (2020)

    Google Scholar 

  36. Pan, J., Han, X., Chen, W., Tang, J., Jia, K.: Deep mesh reconstruction from single RGB images via topology modification networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 9964–9973 (2019)

    Google Scholar 

  37. Park, J.J., Florence, P., Straub, J., Newcombe, R., Lovegrove, S.: DeepSDF: learning continuous signed distance functions for shape representation. arXiv preprint arXiv:1901.05103 (2019)

  38. Paschalidou, D., Ulusoy, A.O., Geiger, A.: Superquadrics revisited: learning 3D shape parsing beyond cuboids. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10344–10353 (2019)

    Google Scholar 

  39. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. In: Advances in Neural Information Processing Systems, pp. 91–99 (2015)

    Google Scholar 

  40. Ren, Y., Li, S., Chen, C., Kuo, C.-C.J.: A coarse-to-fine indoor layout estimation (CFILE) method. In: Lai, S.-H., Lepetit, V., Nishino, K., Sato, Y. (eds.) ACCV 2016. LNCS, vol. 10115, pp. 36–51. Springer, Cham (2017). https://doi.org/10.1007/978-3-319-54193-8_3

    Chapter  Google Scholar 

  41. Riegler, G., Ulusoy, A.O., Geiger, A.: OctNet: learning deep 3D representations at high resolutions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3577–3586 (2017)

    Google Scholar 

  42. Saito, S., Huang, Z., Natsume, R., Morishima, S., Kanazawa, A., Li, H.: PIFu: pixel-aligned implicit function for high-resolution clothed human digitization. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 2304–2314 (2019)

    Google Scholar 

  43. Song, S., Lichtenberg, S.P., Xiao, J.: SUN RGB-D: A RGB-D scene understanding benchmark suite. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 567–576 (2015)

    Google Scholar 

  44. Stekovic, S., Hampali, S., Rad, M., Sarkar, S.D., Fraundorfer, F., Lepetit, V.: General 3D room layout from a single view by render-and-compare. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12361, pp. 187–203. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58517-4_12

    Chapter  Google Scholar 

  45. Sun, X., et al.: Pix3d: dataset and methods for single-image 3D shape modeling. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2974–2983 (2018)

    Google Scholar 

  46. Tang, J., Han, X., Pan, J., Jia, K., Tong, X.: A skeleton-bridged deep learning approach for generating meshes of complex topologies from single RGB images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4541–4550 (2019)

    Google Scholar 

  47. Tatarchenko, M., Dosovitskiy, A., Brox, T.: Octree generating networks: efficient convolutional architectures for high-resolution 3D outputs. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2088–2096 (2017)

    Google Scholar 

  48. Tian, Y., et al.: Learning to infer and execute 3D shape programs. arXiv preprint arXiv:1901.02875 (2019)

  49. Tulsiani, S., Gupta, S., Fouhey, D.F., Efros, A.A., Malik, J.: Factoring shape, pose, and layout from the 2D image of a 3D scene. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 302–310 (2018)

    Google Scholar 

  50. Tulsiani, S., Su, H., Guibas, L.J., Efros, A.A., Malik, J.: Learning shape abstractions by assembling volumetric primitives. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2635–2643 (2017)

    Google Scholar 

  51. Wallace, B., Hariharan, B.: Few-shot generalization for single-image 3D reconstruction via priors. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 3818–3827 (2019)

    Google Scholar 

  52. Wang, N., Zhang, Y., Li, Z., Fu, Y., Liu, W., Jiang, Y.G.: Pixel2Mesh: generating 3D mesh models from single RGB images. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 52–67 (2018)

    Google Scholar 

  53. Wang, P.S., Sun, C.Y., Liu, Y., Tong, X.: Adaptive O-CNN: a patch-based deep representation of 3D shapes. In: SIGGRAPH Asia 2018 Technical Papers, p. 217. ACM (2018)

    Google Scholar 

  54. Xu, Q., Wang, W., Ceylan, D., Mech, R., Neumann, U.: DISN: deep implicit surface network for high-quality single-view 3D reconstruction. arXiv preprint arXiv:1905.10711 (2019)

  55. Yang, C., Zheng, J., Dai, X., Tang, R., Ma, Y., Yuan, X.: Learning to reconstruct 3D non-cuboid room layout from a single RGB image. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 2534–2543 (2022)

    Google Scholar 

  56. Zhang, C., Cui, Z., Zhang, Y., Zeng, B., Pollefeys, M., Liu, S.: Holistic 3D scene understanding from a single image with implicit representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8833–8842 (2021)

    Google Scholar 

  57. Zou, C., Colburn, A., Shan, Q., Hoiem, D.: LayoutNet: reconstructing the 3D room layout from a single RGB image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2051–2059 (2018)

    Google Scholar 

Download references

Acknowledgement

The work was supported in part by the National Key R &D Program of China with grant No. 2018YFB1800800, the Basic Research Project No. HZQB-KCZYZ-2021067 of Hetao Shenzhen-HK S &T Cooperation Zone, by Shenzhen Outstanding Talents Training Fund 202002, by Guangdong Research Projects No. 2017ZT07X152 and No. 2019CX01X104, and by the Guangdong Provincial Key Laboratory of Future Networks of Intelligence (Grant No. 2022B12 12010001). It was also supported by NSFC-62172348, NSFC-61902334 and Shenzhen General Project (No. JCYJ20190814112007258). Thanks to the ITSO in CUHKSZ for their High-Performance Computing Services.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiaoguang Han .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 6335 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liu, H., Zheng, Y., Chen, G., Cui, S., Han, X. (2022). Towards High-Fidelity Single-View Holistic Reconstruction of Indoor Scenes. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13661. Springer, Cham. https://doi.org/10.1007/978-3-031-19769-7_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19769-7_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19768-0

  • Online ISBN: 978-3-031-19769-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics