Skip to main content

Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 15119))

Included in the following conference series:

  • 221 Accesses

Abstract

The correct insertion of virtual objects in images of real-world scenes requires a deep understanding of the scene’s lighting, geometry and materials, as well as the image formation process. While recent large-scale diffusion models have shown strong generative and inpainting capabilities, we find that current models do not sufficiently “understand” the scene shown in a single picture to generate consistent lighting effects (shadows, bright reflections, etc.) while preserving the identity and details of the composited object. We propose using a personalized large diffusion model as guidance to a physically based inverse rendering process. Our method recovers scene lighting and tone-mapping parameters, allowing the photorealistic composition of arbitrary virtual objects in single frames or videos of indoor or outdoor scenes. Our physically based pipeline further enables automatic materials and tone-mapping refinement.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Poly Haven - The Public 3D Asset Library. https://polyhaven.com

  2. Balaji, Y., et al.: eDiff-i: text-to-image diffusion models with ensemble of expert denoisers. arXiv preprint arXiv:2211.01324 (2022)

  3. Bangaru, S.P., Li, T.M., Durand, F.: Unbiased warped-area sampling for differentiable rendering. ACM Trans. Graph. 39(6) (2020). https://doi.org/10.1145/3414685.3417833

  4. Barron, J.T., Malik, J.: Shape, illumination, and reflectance from shading. IEEE Trans. Pattern Anal. Mach. Intell. 37(8), 1670–1687 (2014)

    Article  Google Scholar 

  5. Barrow, H., Tenenbaum, J., Hanson, A., Riseman, E.: Recovering intrinsic scene characteristics. Comput. Vis. Syst 2, 3–26 (1978)

    Google Scholar 

  6. Bell, S., Bala, K., Snavely, N.: Intrinsic images in the wild. ACM Trans. Graph. (TOG) 33(4), 159 (2014)

    Article  Google Scholar 

  7. Black, M.J., Anandan, P.: The robust estimation of multiple motions: parametric and piecewise-smooth flow fields. Comput. Vis. Image Underst. 63(1), 75–104 (1996)

    Article  Google Scholar 

  8. Boss, M., Jampani, V., Kim, K., Lensch, H.P., Kautz, J.: Two-shot spatially-varying BRDF and shape estimation. In: CVPR (2020)

    Google Scholar 

  9. Bousseau, A., Paris, S., Durand, F.: User-assisted intrinsic images. ACM Trans. Graph. (TOG) 28, 130 (2009)

    Article  Google Scholar 

  10. Chari, P., et al.: Personalized restoration via dual-pivot tuning. arXiv preprint arXiv:2312.17234 (2023)

  11. Dai, X., et al.: Emu: enhancing image generation models using photogenic needles in a haystack. arXiv preprint arXiv:2309.15807 (2023)

  12. Dastjerdi, M.R.K., Eisenmann, J., Hold-Geoffroy, Y., Lalonde, J.F.: EverLight: indoor-outdoor editable HDR lighting estimation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 7420–7429 (2023)

    Google Scholar 

  13. Durkan, C., Bekasov, A., Murray, I., Papamakarios, G.: Neural spline flows. In: Advances in Neural Information Processing Systems, vol. 32 (2019)

    Google Scholar 

  14. Gal, R., Alaluf, Y., Atzmon, Y., Patashnik, O., Bermano, A.H., Chechik, G., Cohen-Or, D.: An image is worth one word: personalizing text-to-image generation using textual inversion (2022). https://doi.org/10.48550/ARXIV.2208.01618

  15. Gardner, M.A., Hold-Geoffroy, Y., Sunkavalli, K., Gagné, C., Lalonde, J.F.: Deep parametric indoor lighting estimation. In: ICCV, pp. 7175–7183 (2019)

    Google Scholar 

  16. Gardner, M.A., et al.: Learning to predict indoor illumination from a single image. arXiv preprint arXiv:1704.00090 (2017)

  17. Garon, M., Sunkavalli, K., Hadap, S., Carr, N., Lalonde, J.F.: Fast spatially-varying indoor lighting estimation. In: CVPR, pp. 6908–6917 (2019)

    Google Scholar 

  18. Grosse, R., Johnson, M.K., Adelson, E.H., Freeman, W.T.: Ground truth dataset and baseline evaluations for intrinsic image algorithms. In: ICCV, pp. 2335–2342. IEEE (2009)

    Google Scholar 

  19. Haque, A., Tancik, M., Efros, A., Holynski, A., Kanazawa, A.: Instruct-NeRF2NeRF: editing 3D scenes with instructions. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (2023)

    Google Scholar 

  20. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. arXiv preprint arxiv:2006.11239 (2020)

  21. Ho, J., Salimans, T.: Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598 (2022)

  22. Hold-Geoffroy, Y., Athawale, A., Lalonde, J.F.: Deep sky modeling for single image outdoor lighting estimation. In: CVPR, pp. 6927–6935 (2019)

    Google Scholar 

  23. Hold-Geoffroy, Y., Sunkavalli, K., Hadap, S., Gambaretto, E., Lalonde, J.F.: Deep outdoor illumination estimation. In: CVPR, pp. 7312–7321 (2017)

    Google Scholar 

  24. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. In: International Conference on Learning Representations (2022). https://openreview.net/forum?id=nZeVKeeFYf9

  25. Jakob, W., et al.: Mitsuba 3 renderer (2022). https://mitsuba-renderer.org

  26. Kajiya, J.T.: The rendering equation. In: Proceedings of the 13th Annual Conference on Computer Graphics and Interactive Techniques, pp. 143–150 (1986)

    Google Scholar 

  27. Karimi Dastjerdi, M.R., Hold-Geoffroy, Y., Eisenmann, J., Khodadadeh, S., Lalonde, J.F.: Guided co-modulated GAN for 360 field of view extrapolation. In: International Conference on 3D Vision (3DV) (2022)

    Google Scholar 

  28. Ke, B., Obukhov, A., Huang, S., Metzger, N., Daudt, R.C., Schindler, K.: Repurposing diffusion-based image generators for monocular depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) (2024)

    Google Scholar 

  29. Kocsis, P., Sitzmann, V., Nießner, M.: Intrinsic image diffusion for single-view material estimation. In: arxiv (2023)

    Google Scholar 

  30. Kovacs, B., Bell, S., Snavely, N., Bala, K.: Shading annotations in the wild. In: CVPR, pp. 6998–7007 (2017)

    Google Scholar 

  31. Land, E.H., McCann, J.J.: Lightness and retinex theory. Josa 61(1), 1–11 (1971)

    Article  Google Scholar 

  32. LeGendre, C., et al.: DeepLight: learning illumination for unconstrained mobile mixed reality. In: CVPR, pp. 5918–5928 (2019)

    Google Scholar 

  33. Li, T.M., Aittala, M., Durand, F., Lehtinen, J.: Differentiable monte Carlo ray tracing through edge sampling. ACM Trans. Graph. 37(6) (2018). https://doi.org/10.1145/3272127.3275109

  34. Li, Z., Snavely, N.: CGintrinsics: better intrinsic image decomposition through physically-based rendering. In: ECCV, pp. 371–387 (2018)

    Google Scholar 

  35. Li, Z., Shafiei, M., Ramamoorthi, R., Sunkavalli, K., Chandraker, M.: Inverse rendering for complex indoor scenes: shape, spatially-varying lighting and SVBRDF from a single image. In: CVPR, pp. 2475–2484 (2020)

    Google Scholar 

  36. Li, Z., Xu, Z., Ramamoorthi, R., Sunkavalli, K., Chandraker, M.: Learning to reconstruct shape and spatially-varying reflectance from a single image. ACM Trans. Graph. (TOG) 37(6), 1–11 (2018)

    Article  Google Scholar 

  37. Li, Z., Yu, L., Okunev, M., Chandraker, M., Dong, Z.: Spatiotemporally consistent HDR indoor lighting estimation. ACM Trans. Graph. 42(3) (2023). https://doi.org/10.1145/3595921

  38. Li, Z., et al.: OpenRooms: an end-to-end open framework for photorealistic indoor scene datasets. arXiv preprint arXiv:2007.12868 (2020)

  39. Loubet, G., Holzschuch, N., Jakob, W.: Reparameterizing discontinuous integrands for differentiable rendering. ACM Trans. Graph. 38(6) (2019). https://doi.org/10.1145/3355089.3356510

  40. Lyu, L., et al.: Diffusion posterior illumination for ambiguity-aware inverse rendering. ACM Trans. Graph. 42(6) (2023)

    Google Scholar 

  41. Meng, C., et al.: SDEdit: guided image synthesis and editing with stochastic differential equations. In: International Conference on Learning Representations (2022)

    Google Scholar 

  42. Nimier-David, M., Speierer, S., Ruiz, B., Jakob, W.: Radiative backpropagation: an adjoint method for lightning-fast differentiable rendering. ACM Trans. Graph. 39(4) (2020). https://doi.org/10.1145/3386569.3392406

  43. Nimier-David, M., Vicini, D., Zeltner, T., Jakob, W.: Mitsuba 2: a retargetable forward and inverse renderer. ACM Trans. Graph. 38(6) (2019). https://doi.org/10.1145/3355089.3356498

  44. Phongthawee, P., et al.: DiffusionLight: light probes for free by painting a chrome ball. In: ArXiv (2023)

    Google Scholar 

  45. Poole, B., Jain, A., Barron, J.T., Mildenhall, B.: DreamFusion: Text-to-3D using 2D diffusion. arXiv (2022)

    Google Scholar 

  46. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models (2021)

    Google Scholar 

  47. Ruiz, N., Li, Y., Jampani, V., Pritch, Y., Rubinstein, M., Aberman, K.: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation (2022)

    Google Scholar 

  48. Saharia, C., et al.: Photorealistic text-to-image diffusion models with deep language understanding. Adv. Neural. Inf. Process. Syst. 35, 36479–36494 (2022)

    Google Scholar 

  49. Sarkar, A., Mai, H., Mahapatra, A., Lazebnik, S., Forsyth, D., Bhattad, A.: Shadows don’t lie and lines can’t bend! generative models don’t know projective geometry...for now (2023)

    Google Scholar 

  50. Sengupta, S., Gu, J., Kim, K., Liu, G., Jacobs, D.W., Kautz, J.: Neural inverse rendering of an indoor scene from a single image. In: ICCV (2019)

    Google Scholar 

  51. Shah, V., et al.: ZipLoRA: any subject in any style by effectively merging LoRAs (2023)

    Google Scholar 

  52. Shi, J., Xiong, W., Lin, Z., Jung, H.J.: InstantBooth: personalized text-to-image generation without test-time finetuning. arXiv preprint arXiv:2304.03411 (2023)

  53. Song, S., Funkhouser, T.: Neural Illumination: lighting prediction for indoor environments. In: CVPR, pp. 6918–6926 (2019)

    Google Scholar 

  54. Srinivasan, P.P., Mildenhall, B., Tancik, M., Barron, J.T., Tucker, R., Snavely, N.: Lighthouse: predicting lighting volumes for spatially-coherent illumination. In: CVPR, pp. 8080–8089 (2020)

    Google Scholar 

  55. Sun, P., et al.: Scalability in perception for autonomous driving: Waymo open dataset. In: CVPR (2020)

    Google Scholar 

  56. Tang, J., Zhu, Y., Wang, H., Chan, J.H., Li, S., Shi, B.: Estimating spatially-varying lighting in urban scenes with disentangled representation. In: ECCV (2022)

    Google Scholar 

  57. Tang, L., et al.: RealFill: reference-driven generation for authentic image completion. arXiv preprint arXiv:2309.16668 (2023)

  58. Tao, A., Sapra, K., Catanzaro, B.: Hierarchical multi-scale attention for semantic segmentation. arXiv preprint arXiv:2005.10821 (2020)

  59. Veach, E., Guibas, L.J.: Optimally combining sampling techniques for monte Carlo rendering. In: Proceedings of the 22nd Annual Conference on Computer Graphics and Interactive Techniques, pp. 419–428 (1995)

    Google Scholar 

  60. Vicini, D., Speierer, S., Jakob, W.: Path replay backpropagation: differentiating light paths using constant memory and linear time. ACM Trans. Graph. 40(4) (2021). https://doi.org/10.1145/3450626.3459804

  61. Wang, G., Yang, Y., Loy, C.C., Liu, Z.: StyleLight: HDR panorama generation for lighting estimation and editing. In: European Conference on Computer Vision (ECCV) (2022)

    Google Scholar 

  62. Wang, Z., Chen, W., Acuna, D., Kautz, J., Fidler, S.: Neural light field estimation for street scenes with differentiable virtual object insertion. In: ECCV (2022)

    Google Scholar 

  63. Wang, Z., Philion, J., Fidler, S., Kautz, J.: Learning indoor inverse rendering with 3D spatially-varying lighting. In: ICCV (2021)

    Google Scholar 

  64. Wimbauer, F., Wu, S., Rupprecht, C.: De-rendering 3D objects in the wild. In: CVPR (2022)

    Google Scholar 

  65. Yan, K., Lassner, C., Budge, B., Dong, Z., Zhao, S.: Efficient estimation of boundary integrals for path-space differentiable rendering. ACM Trans. Graph. 41(4) (2022). https://doi.org/10.1145/3528223.3530080

  66. Yang, J., et al.: EmerNeRF: emergent spatial-temporal scene decomposition via self-supervision. arXiv preprint arXiv:2311.02077 (2023)

  67. Yu, H.X., et al.: Accidental light probes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12521–12530 (2023)

    Google Scholar 

  68. Yu, X., Guo, Y.C., Li, Y., Liang, D., Zhang, S.H., Qi, X.: Text-to-3D with classifier score distillation (2023)

    Google Scholar 

  69. Yu, Y., Smith, W.A.: InverseRenderNet: learning single image inverse rendering. In: CVPR (2019)

    Google Scholar 

  70. Zhan, F., et al.: EMLight: lighting estimation via spherical distribution approximation. In: Proceedings of the AAAI Conference on Artificial Intelligence (2021)

    Google Scholar 

  71. Zhang, C., Miller, B., Yan, K., Gkioulekas, I., Zhao, S.: Path-space differentiable rendering. ACM Trans. Graph. 39(4) (2020). https://doi.org/10.1145/3386569.3392383

  72. Zhang, C., Yu, Z., Zhao, S.: Path-space differentiable rendering of participating media. ACM Trans. Graph. 40(4) (2021). https://doi.org/10.1145/3450626.3459782

  73. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 586–595 (2018)

    Google Scholar 

  74. Zhang, Z., Roussel, N., Jakob, W.: Projective sampling for differentiable rendering of geometry. ACM Trans. Graph. 42(6) (2023). https://doi.org/10.1145/3618385

  75. Zhao, Q., Tan, P., Dai, Q., Shen, L., Wu, E., Lin, S.: A closed-form solution to retinex with nonlocal texture constraints. IEEE Trans. Pattern Anal. Mach. Intell. 34(7), 1437–1444 (2012)

    Google Scholar 

  76. Zhao, Y., Guo, T.: POINTAR: efficient lighting estimation for mobile augmented reality. arXiv preprint arXiv:2004.00006 (2020)

  77. Zhu, Y., Zhang, Y., Li, S., Shi, B.: Spatially-varying outdoor lighting estimation from intrinsics. In: CVPR (2021)

    Google Scholar 

Download references

Acknowledgements

The authors are grateful for the feedback received from Nicholas Sharp and Huan Ling during the project. We thank the original artists of the 3D assets used in this work: inciprocal, peyman.khaleghi, Kuutti Siitonen, TurboSquid and their artists Hum3D and Amaranthus.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zian Wang .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7793 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Liang, R. et al. (2025). Photorealistic Object Insertion with Diffusion-Guided Inverse Rendering. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15119. Springer, Cham. https://doi.org/10.1007/978-3-031-73030-6_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73030-6_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73029-0

  • Online ISBN: 978-3-031-73030-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics