Skip to main content

PolyOculus: Simultaneous Multi-view Image-Based Novel View Synthesis

  • Conference paper
  • First Online:
Computer Vision – ECCV 2024 (ECCV 2024)

Abstract

This paper considers the problem of generative novel view synthesis (GNVS), generating novel, plausible views of a scene given a limited number of known views. Here, we propose a set-based generative model that can simultaneously generate multiple, self-consistent new views, conditioned on any number of views. Our approach is not limited to generating a single image at a time and can condition on a variable number of views. As a result, when generating a large number of views, our method is not restricted to a low-order autoregressive generation approach and is better able to maintain generated image quality over large sets of images. We evaluate our model on standard NVS datasets and show that it outperforms the state-of-the-art image-based GNVS baselines. Further, we show that the model is capable of generating sets of views that have no natural sequential ordering, like loops and binocular trajectories, and significantly outperforms other methods on such tasks. Our project page is available at: https://yorkucvil.github.io/PolyOculus-NVS/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Anciukevicius, T., et al.: RenderDiffusion: image diffusion for 3D reconstruction, inpainting and generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  2. Aoki, Y., Goforth, H., Srivatsan, R.A., Lucey, S.: PointNetLK: robust & efficient point cloud registration using PointNet. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  3. Avidan, S., Shashua, A.: Novel view synthesis in tensor space. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (1997)

    Google Scholar 

  4. Barron, J.T., Mildenhall, B., Verbin, D., Srinivasan, P.P., Hedman, P.: Zip-NeRF: anti-aliased grid-based neural radiance fields. In: Proceedings of the International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  5. Bautista, M.Á., Guo, P., et al.: GAUDI: a neural architect for immersive 3D scene generation. Neural Inform. Process. Syst. (NeurIPS) (2022)

    Google Scholar 

  6. Blattmann, A., Rombach, R., Ling, H., Dockhorn, T., Kim, S.W., Fidler, S., Kreis, K.: Align your latents: high-resolution video synthesis with latent diffusion models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  7. Carion, N., Massa, F., Synnaeve, G., Usunier, N., Kirillov, A., Zagoruyko, S.: End-to-end object detection with transformers. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020). https://doi.org/10.1007/978-3-030-58452-8_24

  8. Chan, E.R., et al.: GeNVS: generative novel view synthesis with 3D-aware diffusion models. In: Proceedings of the International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  9. Chan, S., Shum, H.Y., Ng, K.T.: Image-based rendering and synthesis. IEEE Signal Processing Magazine (2007)

    Google Scholar 

  10. Chang, A., et al.: Matterport3D: learning from RGB-D data in indoor environments. In: Proceedings of the International Conference on 3D Vision (3DV) (2017)

    Google Scholar 

  11. Chen, S.E., Williams, L.: View interpolation for image synthesis. In: Proceedings of SIGGRAPH (1993)

    Google Scholar 

  12. Croitoru, F.A., Hondru, V., Ionescu, R.T., Shah, M.: Diffusion models in vision: a survey. IEEE Trans. Pattern Anal. Mach. Intell. (PAMI) (2023)

    Google Scholar 

  13. Deitke, M., et al.: Objaverse-XL: A universe of 10M+ 3D objects. Neural Inform. Process. Syst. (NeurIPS) (2023)

    Google Scholar 

  14. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  15. Flynn, J., et al.: DeepView: view synthesis with learned gradient descent. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2019)

    Google Scholar 

  16. Gao, K., Gao, Y., He, H., Lu, D., Xu, L., Li, J.: NeRF: Neural radiance field in 3D vision, a comprehensive review. arXiv preprint arXiv:2210.00379 (2022)

  17. Geyer, M., Bar-Tal, O., Bagon, S., Dekel, T.: TokenFlow: consistent diffusion features for consistent video editing. In: Proceedings of the International Conference on Learning Representations (ICLR) (2024)

    Google Scholar 

  18. He, C., Li, R., Li, S., Zhang, L.: Voxel set transformer: a set-to-set approach to 3D object detection from point clouds. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  19. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Neural Inform. Process. Syst. (NeurIPS) (2017)

    Google Scholar 

  20. Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Neural Inform. Process. Syst. (NeurIPS) (2020)

    Google Scholar 

  21. Ho, J., Salimans, T., Gritsenko, A., Chan, W., Norouzi, M., Fleet, D.J.: Video diffusion models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

  22. Kim, J., Yoo, J., Lee, J., Hong, S.: SetVAE: learning hierarchical composition for generative modeling of set-structured data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  23. Kim, S.W., et al.: NeuralField-LDM: scene generation with hierarchical latent diffusion models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  24. Laveau, S., Faugeras, O.D.: 3-D scene representation as a collection of images. In: Proceedings of the International Conference on Pattern Recognition (ICPR) (1994)

    Google Scholar 

  25. Lee, J., Lee, Y., Kim, J., Kosiorek, A., Choi, S., Teh, Y.W.: Set transformer: a framework for attention-based permutation-invariant neural networks. In: Proceedings of the International Conference on Machine Learning (ICML) (2019)

    Google Scholar 

  26. Liu, A., Tucker, R., Jampani, V., Makadia, A., Snavely, N., Kanazawa, A.: Infinite nature: perpetual view generation of natural scenes from a single image. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  27. Liu, R., Wu, R., Hoorick, B.V., Tokmakov, P., Zakharov, S., Vondrick, C.: Zero-1-to-3: zero-shot one image to 3D object. In: Proceedings of the International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  28. Lowe, D.G.: Object recognition from local scale-invariant features. In: Proceedings of the International Conference on Computer Vision (ICCV) (1999)

    Google Scholar 

  29. Luo, S., Hu, W.: Diffusion probabilistic models for 3D point cloud generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  30. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. In: Proceedings of the European Conference on Computer Vision (ECCV) (2020). https://doi.org/10.1007/978-3-030-58452-8_24

  31. Nie, W., Vahdat, A., Anandkumar, A.: Controllable and compositional generation with latent-space energy-based models. Neural Inform. Process. Syst. (NeurIPS) (2021)

    Google Scholar 

  32. Po, R., et al.: State of the art on diffusion models for visual computing. arXiv preprint arXiv:2310.07204 (2023)

  33. Qi, C.R., Su, H., Mo, K., Guibas, L.J.: PointNet: deep learning on point sets for 3D classification and segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

    Google Scholar 

  34. Qi, C.R., Yi, L., Su, H., Guibas, L.J.: PointNet++: deep hierarchical feature learning on point sets in a metric space. Neural Inform. Process. Syst. (NeurIPS) (2017)

    Google Scholar 

  35. Ravanbakhsh, S., Schneider, J., Poczos, B.: Deep learning with sets and point clouds. In: Proceedings of the International Conference on Learning Representations (ICLR) (2016)

    Google Scholar 

  36. Ren, X., Wang, X.: Look outside the room: synthesizing a consistent long-term 3D scene video from a single image. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  37. Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  38. Rombach, R., Esser, P., Ommer, B.: Geometry-free view synthesis: transformers and no 3D priors. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

    Google Scholar 

  39. Ronneberger, O., Fischer, P., Brox, T.: U-Net: Convolutional Networks for Biomedical Image Segmentation. In: Navab, N., Hornegger, J., Wells, W., Frangi, A. (eds.) Medical Image Computing and Computer-Assisted Intervention – MICCAI 2015. MICCAI 2015. LNCS, vol 9351. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28

  40. Sajjadi, M.S.M., et al.: Scene representation transformer: Geometry-free novel view synthesis through set-latent scene representations. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2022)

    Google Scholar 

  41. Sarlin, P.E., DeTone, D., Malisiewicz, T., Rabinovich, A.: SuperGlue: learning feature matching with graph neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2020)

    Google Scholar 

  42. Savva, M., et al.: Habitat: a platform for embodied AI research. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  43. Scharstein, D.: Stereo vision for view synthesis. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (1996)

    Google Scholar 

  44. Segol, N., Lipman, Y.: On universal equivariant set networks. In: Proceedings of the International Conference on Learning Representations (ICLR) (2020)

    Google Scholar 

  45. Seitz, S.M., Dyer, C.R.: Physically-valid view synthesis by image interpolation. In: ICCV Workshop on Representation of Visual Scenes (1995)

    Google Scholar 

  46. Shum, H., Kang, S.B.: Review of image-based rendering techniques. In: Visual Communications and Image Processing. SPIE (2000)

    Google Scholar 

  47. Sohl-Dickstein, J., Weiss, E., Maheswaranathan, N., Ganguli, S.: Deep unsupervised learning using nonequilibrium thermodynamics. In: Proceedings of the International Conference on Machine Learning (ICML) (2015)

    Google Scholar 

  48. Song, J., Meng, C., Ermon, S.: Denoising diffusion implicit models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2021)

    Google Scholar 

  49. Szymanowicz, S., Rupprecht, C., Vedaldi, A.: Viewset diffusion: (0-)image-conditioned 3D generative models from 2D data. In: Proceedings of the International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  50. Tancik, M., et al.: Fourier features let networks learn high frequency functions in low dimensional domains. Neural Inform. Process. Syst. (NeurIPS) (2020)

    Google Scholar 

  51. Tewari, A., et al.: Diffusion with forward models: solving stochastic inverse problems without direct supervision. Neural Inform. Process. Syst. (NeurIPS) (2023)

    Google Scholar 

  52. Tseng, H.Y., Li, Q., Kim, C., Alsisan, S., Huang, J.B., Kopf, J.: Consistent view synthesis with pose-guided diffusion models. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2023)

    Google Scholar 

  53. Wang, Q., et al.: IBRNet: learning multi-view image-based rendering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  54. Watson, D., Chan, W., Martin-Brualla, R., Ho, J., Tagliasacchi, A., Norouzi, M.: Novel view synthesis with diffusion models. In: Proceedings of the International Conference on Learning Representations (ICLR) (2023)

    Google Scholar 

  55. Xie, Y., et al.: Neural fields in visual computing and beyond. Comput. Graph. Forum (2022)

    Google Scholar 

  56. Yogamani, S., et al.: WoodScape: a multi-task, multi-camera fisheye dataset for autonomous driving. In: Proceedings of the International Conference on Computer Vision (ICCV) (2019)

    Google Scholar 

  57. Yu, A., Ye, V., Tancik, M., Kanazawa, A.: pixelNeRF: neural radiance fields from one or few images. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2021)

    Google Scholar 

  58. Yu, J.J., Forghani, F., Derpanis, K.G., Brubaker, M.A.: Long-term photometric consistent novel view synthesis with diffusion models. In: Proceedings of the International Conference on Computer Vision (ICCV) (2023)

    Google Scholar 

  59. Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R.R., Smola, A.J.: Deep sets. Neural Inform. Process. Syst. (NeurIPS) (2017)

    Google Scholar 

  60. Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2018)

    Google Scholar 

  61. Zhang, Y., Hare, J., Prugel-Bennett, A.: Deep set prediction networks. Neural Inform. Process. Syst. (NeurIPS) (2019)

    Google Scholar 

  62. Zhang, Y., Hare, J., Prügel-Bennett, A.: FSPool: learning set representations with featurewise sort pooling. In: Proceedings of the International Conference on Learning Representations (ICLR) (2019)

    Google Scholar 

  63. Zhang, Y., Zhang, D.W., Lacoste-Julien, S., Burghouts, G.J., Snoek, C.G.: Multiset-equivariant set prediction with approximate implicit differentiation. In: Proceedings of the International Conference on Learning Representations (ICLR) (2022)

    Google Scholar 

  64. Zhou, T., Tucker, R., Flynn, J., Fyffe, G., Snavely, N.: Stereo magnification. ACM Trans. Graph. (2018)

    Google Scholar 

Download references

Acknowledgements

This work was completed with support from the Vector Institute, and was funded in part by the Canada First Research Excellence Fund (CFREF) for the Vision: Science to Applications (VISTA) program (M.A.B., K.G.D., T.T.A.A.), the NSERC Discovery Grant program (M.A.B., K.G.D.), and the NSERC Canada Graduate Scholarship Doctoral program (J.J.Y.).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jason J. Yu .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (zip 83188 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Yu, J.J., Aumentado-Armstrong, T., Forghani, F., Derpanis, K.G., Brubaker, M.A. (2025). PolyOculus: Simultaneous Multi-view Image-Based Novel View Synthesis. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds) Computer Vision – ECCV 2024. ECCV 2024. Lecture Notes in Computer Science, vol 15121. Springer, Cham. https://doi.org/10.1007/978-3-031-73036-8_25

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-73036-8_25

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-73035-1

  • Online ISBN: 978-3-031-73036-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics