Skip to main content
Log in

Instance-Aware Scene Layout Forecasting

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

A Correction to this article was published on 07 February 2022

This article has been updated

Abstract

Forecasting scene layout is of vital importance in many vision applications, e.g., enabling autonomous vehicles to plan actions early. It is a challenging problem as it involves understanding of the past scene layouts and the diverse object interactions in the scene, and then forecasting what the scene will look like at a future time. Prior works learn a direct mapping from past pixels to future pixel-wise labels and ignore the underlying object interactions in the scene, resulting in temporally incoherent and averaged predictions. In this paper, we propose a learning framework to forecast semantic scene layouts (represented by instance maps) from an instance-aware perspective. Specifically, our framework explicitly models the dynamics of individual instances and captures their interactions in a scene. Under this formulation, we are able to enforce instance-level constraints to forecast scene layouts by effectively reasoning about their spatial and semantic relations. Experimental results show that our model can predict sharper and more accurate future instance maps than the baselines and prior methods, yielding state-of-the-art performances on short-term, mid-term and long-term scene layout forecasting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Change history

References

  • Achanta, R., Hemami, S., Estrada, F., & Süsstrunk, S. (2009). Frequency-tuned salient region detection. In CVPR.

  • Chao, Y. W., Yang, J., Price, B., Cohen, S., & Deng, J. (2017). Forecasting human dynamics from static images. In ICCV.

  • Chen, Q., & Koltun, V. (2016). Full flow: Optical flow estimation by global optimization over regular grids. In CVPR.

  • Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In ICCV.

  • Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.

  • Dosovitskiy, A., & Koltun, V. (2017). Learning to act by predicting the future. In ICLR.

  • Ehrig, M., & érome Euzenat, J. (2005). Relaxed precision and recall for ontology matching. In Integrating ontologies workshop proceedings.

  • Gammulle, H., Denman, S., Sridharan, S., & Fookes, C.(2019). Predicting the future: A jointly learnt model for action anticipation. In ICCV.

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.

  • Graber, C., Tsai, G., Firman, M., Brostow, G., & Schwing, A. (2021). Panoptic segmentation forecasting. arXiv:2104.03962.

  • Guan, J., Yuan, Y., Kitani, K.M., & Rhinehart, N. (2020). Generative hybrid representations for activity forecasting with no-regret learning. In CVPR.

  • He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV.

  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.

  • Hoai, M., & De la Torre, F.(2014). Max-margin early event detectors. In IJCV.

  • Hu, A., Cotter, F., Mohan, N., Gurau, C., & Kendall, A. (2020). Probabilistic future prediction for video scene understanding. In ECCV.

  • Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In NeurIPS (pp. 2017–2025).

  • Jin, X., Li, X., Xiao, H., Shen, X., Lin, Z., Yang, J., Chen, Y., Dong, J., Liu, L., & Jie, Z., et al. (2017). Video scene parsing with predictive feature learning. In ICCV.

  • Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., & Yan, S. (2017). Predicting scene parsing and motion dynamics in the future. In NeurIPS.

  • Kim, W., Tanaka, M., Okutomi, M., & Sasaki, Y. (2020). Adaptive future frame prediction with ensemble network. arXiv:2011.06788.

  • Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.

  • Kwon, Y. H., & Park, M. G. (2019). Predicting future frames using retrospective cycle gan. In CVPR.

  • Li, W. H., Hong, F. T., & Zheng, W. S. (2019). Learning to learn relation for important people detection in still images. In CVPR.

  • Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.

  • Luc, P., Couprie, C., Lecun, Y., & Verbeek, J. (2018). Predicting future instance segmentation by forecasting convolutional features. In ECCV.

  • Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In ICCV.

  • Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In ICLR.

  • Qi, X., Chen, Q., Jia, J., & Koltun, V. (2018). Semi-parametric image synthesis. In CVPR (pp. 8808–8816).

  • Qi, X., Liu, Z., Chen, Q., & Jia, J. (2019). 3d motion decomposition for rgbd future dynamic scene synthesis. In CVPR.

  • Qiao, X., Zheng, Q., Cao, Y., & Lau, R. W. (2019). Tell me where i am: Object-level scene context prediction. In CVPR (pp. 2633–2641).

  • Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S. (2014). Video (language) modeling: A baseline for generative models of natural videos. arXiv:1412.6604.

  • Rochan, M., et al. (2018). Future semantic segmentation with convolutional LSTM. In BMVC.

  • Šarić, J., Oršić, M., Antunović, T., Vražić, S., & Šegvić, S. (2019). Single level feature-to-feature forecasting with deformable convolutions. In GCPR.

  • Saric, J., Orsic, M., Antunovic, T., Vrazic, S., & Segvic, S. (2020). Warp to the future: Joint forecasting of features and feature motion. In CVPR.

  • Shalev-Shwartz, S., Ben-Zrihem, N., Cohen, A., & Shashua, A. (2016). Long-term planning by short-term prediction. arXiv:1602.01580.

  • Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating the future by watching unlabeled video. In CVPR.

  • Vondrick, C., & Torralba, A. (2017). Generating the future with adversarial transformers. In CVPR.

  • Walker, J., Gupta, A., & Hebert, M. (2015). Dense optical flow prediction from a static image. In ICCV.

  • Yang, L., Fan, Y., & Xu, N. (2019). Video instance segmentation. In ICCV.

  • Yuen, J., & Torralba, A. (2010). A data-driven approach for event prediction. In ECCV.

  • Zhou, Y., & Berg, T. L. (2016). Learning temporal transformations from time-lapse videos. In ECCV.

Download references

Acknowledgements

This work was supported by the Research Grants Council of Hong Kong (Grant No. 11205620).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rynson W. H. Lau.

Additional information

Communicated by Boxin Shi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Qiao, X., Zheng, Q., Cao, Y. et al. Instance-Aware Scene Layout Forecasting. Int J Comput Vis 130, 504–516 (2022). https://doi.org/10.1007/s11263-021-01560-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-021-01560-x

Keywords

Navigation