Skip to main content

Improving Dynamic 3D Gaussian Splatting from Monocular Videos with Object Motion Information

  • Conference paper
  • First Online:
Advanced Intelligent Computing Technology and Applications (ICIC 2024)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14872))

Included in the following conference series:

  • 773 Accesses

Abstract

Despite the significant advancements achieved by recent 3D-Gaussian-based approaches in dynamic scene reconstruction, their efficacy is markedly diminished in monocular settings, particularly under conditions of rapid object motion. This issue arises from the inherent one-to-many mapping between monocular video and the dynamic scene, i.e., discerning precise object motion states from a monocular video is challenging while varying motion states may correspond to distinct scenes. To alleviate the issue, firstly, we explicitly extract the object motion states information from the monocular video with a pretrained video tracking model, TAM, and then separate 3D Gaussians into static and dynamic subsets based on such motion states information. Secondly, we present a three-stage training strategy to optimize 3D Gaussian across distinct motion states. Moreover, we introduce an innovative augmentation technique that provides augment views for supervising 3D Gaussians, thereby enriching the model with more multi-view information, pivotal for accurate interpretation of motion states. Our empirical evaluations on Nvidia and iPhone, two of the most challenging monocular datasets, demonstrates our method's superior reconstruction capabilities over other dynamic Gaussian models.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Agarwal, A., Arora, C.: Attention attention everywhere: monocular depth prediction with skip attention. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pp. 5861–5870 (2023)

    Google Scholar 

  2. Bhat, S.F., Alhashim, I., Wonka, P.: AdaBins: depth estimation using adaptive bins. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4009–4018 (2021)

    Google Scholar 

  3. Bhat, S.F., Alhashim, I., Wonka, P.: LocalBins: improving depth estimation by learning local distributions. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision, pp. 480–496. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19769-7_28

  4. Bhat, S.F., Birkl, R., Wofk, D., Wonka, P., Müller, M.: ZoeDepth: zero-shot transfer by combining relative and metric depth. arXiv preprint arXiv:2302.12288 (2023)

  5. Cao, A., Johnson, J.: HexPlane: a fast representation for dynamic scenes. In: CVPR (2023)

    Google Scholar 

  6. Chen, A., Xu, Z., Geiger, A., Yu, J., Su, H.: TensoRF: tensorial radiance fields. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds.) European Conference on Computer Vision (ECCV), vol. 13692. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-19824-3_20

  7. Fang, J., et al.: Fast dynamic radiance fields with time-aware neural voxels. In: SIGGRAPH Asia 2022 Conference Papers (2022)

    Google Scholar 

  8. Gao, H., Li, R., Tulsiani, S., Russell, B., Kanazawa, A.: Monocular dynamic view synthesis: a reality check. In: Advances in Neural Information Processing Systems (2022)

    Google Scholar 

  9. Katsumata, K., Vo, D.M., Nakayama, H.: An efficient 3D Gaussian representation for monocular/multi-view dynamic scenes. arXiv preprint arXiv:2311.12897 (2023)

  10. Kerbl, B., Kopanas, G., Leimkühler, T., Drettakis, G.: 3D Gaussian splatting for real-time radiance field rendering. ACM Trans. Graph. 42(4), 1–14 (2023). https://repo-sam.inria.fr/fungraph/3d-gaussian-splatting/

  11. Kirillov, A., et al.: Segment anything. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4015–4026 (2023)

    Google Scholar 

  12. Kopf, J., Rong, X., Huang, J.B.: Robust consistent video depth estimation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1611–1621 (2021)

    Google Scholar 

  13. Li, Z., Niklaus, S., Snavely, N., Wang, O.: Neural scene flow fields for space-time view synthesis of dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6498–6508 (2021)

    Google Scholar 

  14. Li, Z., Wang, Q., Cole, F., Tucker, R., Snavely, N.: DynIBaR: neural dynamic image-based rendering. arXiv preprint arXiv:2211.11082 (2022)

  15. Li, Z., Wang, X., Liu, X., Jiang, J.: BinsFormer: revisiting adaptive bins for monocular depth estimation. arXiv preprint arXiv:2204.00987 (2022)

  16. Luiten, J., Kopanas, G., Leibe, B., Ramanan, D.: Dynamic 3D Gaussians: tracking by persistent dynamic view synthesis. Preprint (2023)

    Google Scholar 

  17. Luo, X., Huang, J.B., Szeliski, R., Matzen, K., Kopf, J.: Consistent video depth estimation. ACM Trans. Graph. (ToG) 39(4), 71–81 (2020)

    Article  Google Scholar 

  18. Mildenhall, B., Srinivasan, P.P., Tancik, M., Barron, J.T., Ramamoorthi, R., Ng, R.: NeRF: representing scenes as neural radiance fields for view synthesis. Commun. ACM 65(1), 99–106 (2021)

    Article  Google Scholar 

  19. Müller, T., Evans, A., Schied, C., Keller, A.: Instant neural graphics primitives with a multiresolution hash encoding. ACM Trans. Graph. (TOG) 41(4), 1–15 (2022)

    Article  Google Scholar 

  20. Park, K., et al.: Nerfies: deformable neural radiance fields. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 5865–5874 (2021)

    Google Scholar 

  21. Park, K., et al.: HyperNeRF: a higher-dimensional representation for topo logically varying neural radiance fields. ACM Trans. Graph. 40(6), 1–12 (2021)

    Google Scholar 

  22. Pumarola, A., Corona, E., Pons-Moll, G., Moreno-Noguer, F.: D-NeRF: neural radiance fields for dynamic scenes. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10318–10327 (2021)

    Google Scholar 

  23. Ranftl, R., Lasinger, K., Hafner, D., Schindler, K., Koltun, V.: Towards robust monocular depth estimation: mixing datasets for zero-shot cross-dataset transfer. IEEE Trans. Pattern Anal. Mach. Intell. 44(3), 1623–1637 (2020)

    Article  Google Scholar 

  24. Fridovich-Keil, S., Meanti, G., Warburg, F.R., Recht, B., Kanazawa, A.: K-planes: explicit radiance fields in space, time, and appearance. In: CVPR (2023)

    Google Scholar 

  25. Shao, R., Zheng, Z., Tu, H., Liu, B., Zhang, H., Liu, Y.: Tensor4D: efficient neural 4D decomposition for high-fidelity dynamic reconstruction and rendering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2023)

    Google Scholar 

  26. Sun, C., Sun, M., Chen, H.T.: Direct voxel grid optimization: super-fast convergence for radiance fields reconstruction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5459–5469 (2022)

    Google Scholar 

  27. Wu, G., et al.: 4D Gaussian splatting for real-time dynamic scene rendering. arXiv preprint arXiv:2310.08528 (2023)

  28. Xie, T., et al.: PhysGaussian: physics-integrated 3D Gaussians for generative dynamics. arXiv preprint arXiv:2311.12198 (2023)

  29. Yang, J., Gao, M., Li, Z., Gao, S., Wang, F., Zheng, F.: Track anything: segment anything meets videos. arXiv preprint arXiv:2304.11968 (2023)

  30. Yang, Z., Gao, X., Zhou, W., Jiao, S., Zhang, Y., Jin, X.: Deformable 3D Gaussians for high-fidelity monocular dynamic scene reconstruction. arXiv preprint arXiv:2309.13101 (2023)

  31. Yin, W., et al.: Metric3D: towards zero-shot metric 3D prediction from a single image. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 9043–9053 (2023)

    Google Scholar 

  32. Zhang, Z., Cole, F., Tucker, R., Freeman, W.T., Dekel, T.: Consistent depth of moving objects in video. ACM Trans. Graph. (TOG) 40(4), 1–12 (2021)

    Google Scholar 

  33. Zhou, K., et al.: DynPoint: dynamic neural point for view synthesis. In: Advances in Neural Information Processing Systems, vol. 36 (2024)

    Google Scholar 

Download references

Acknowledgments

This work was supported in part by the “Pioneer” and “Leading Goose” R&D Program of Zhejiang (No. 2023C01143), the Anhui Provincial Major Science and Technology Project (No. 202203a05020016), and the National Key R&D Program of China (No. 2022YFB3303400).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Zhangjin Huang .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Singapore Pte Ltd.

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Luo, Y., Huang, Z., Huang, X. (2024). Improving Dynamic 3D Gaussian Splatting from Monocular Videos with Object Motion Information. In: Huang, DS., Pan, Y., Zhang, Q. (eds) Advanced Intelligent Computing Technology and Applications. ICIC 2024. Lecture Notes in Computer Science, vol 14872. Springer, Singapore. https://doi.org/10.1007/978-981-97-5612-4_8

Download citation

  • DOI: https://doi.org/10.1007/978-981-97-5612-4_8

  • Published:

  • Publisher Name: Springer, Singapore

  • Print ISBN: 978-981-97-5611-7

  • Online ISBN: 978-981-97-5612-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics