Instance-Aware Scene Layout Forecasting

Qiao, Xiaotian; Zheng, Quanlong; Cao, Ying; Lau, Rynson W. H.

doi:10.1007/s11263-021-01560-x

Instance-Aware Scene Layout Forecasting

Published: 05 January 2022

Volume 130, pages 504–516, (2022)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Xiaotian Qiao ORCID: orcid.org/0000-0002-5351-8335^1,2,
Quanlong Zheng²,
Ying Cao² &
…
Rynson W. H. Lau²

589 Accesses
1 Citation
1 Altmetric
Explore all metrics

A Correction to this article was published on 07 February 2022

This article has been updated

Abstract

Forecasting scene layout is of vital importance in many vision applications, e.g., enabling autonomous vehicles to plan actions early. It is a challenging problem as it involves understanding of the past scene layouts and the diverse object interactions in the scene, and then forecasting what the scene will look like at a future time. Prior works learn a direct mapping from past pixels to future pixel-wise labels and ignore the underlying object interactions in the scene, resulting in temporally incoherent and averaged predictions. In this paper, we propose a learning framework to forecast semantic scene layouts (represented by instance maps) from an instance-aware perspective. Specifically, our framework explicitly models the dynamics of individual instances and captures their interactions in a scene. Under this formulation, we are able to enforce instance-level constraints to forecast scene layouts by effectively reasoning about their spatial and semantic relations. Experimental results show that our model can predict sharper and more accurate future instance maps than the baselines and prior methods, yielding state-of-the-art performances on short-term, mid-term and long-term scene layout forecasting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 2

Fig. 9

Attention mechanisms in computer vision: A survey

Article Open access 15 March 2022

3D Object Detection for Autonomous Driving: A Comprehensive Survey

Article 27 April 2023

Visual attention network

Article Open access 28 July 2023

Change history

07 February 2022
A Correction to this paper has been published: https://doi.org/10.1007/s11263-022-01577-w

References

Achanta, R., Hemami, S., Estrada, F., & Süsstrunk, S. (2009). Frequency-tuned salient region detection. In CVPR.
Chao, Y. W., Yang, J., Price, B., Cohen, S., & Deng, J. (2017). Forecasting human dynamics from static images. In ICCV.
Chen, Q., & Koltun, V. (2016). Full flow: Optical flow estimation by global optimization over regular grids. In CVPR.
Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In ICCV.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.
Dosovitskiy, A., & Koltun, V. (2017). Learning to act by predicting the future. In ICLR.
Ehrig, M., & érome Euzenat, J. (2005). Relaxed precision and recall for ontology matching. In Integrating ontologies workshop proceedings.
Gammulle, H., Denman, S., Sridharan, S., & Fookes, C.(2019). Predicting the future: A jointly learnt model for action anticipation. In ICCV.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.
Graber, C., Tsai, G., Firman, M., Brostow, G., & Schwing, A. (2021). Panoptic segmentation forecasting. arXiv:2104.03962.
Guan, J., Yuan, Y., Kitani, K.M., & Rhinehart, N. (2020). Generative hybrid representations for activity forecasting with no-regret learning. In CVPR.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Hoai, M., & De la Torre, F.(2014). Max-margin early event detectors. In IJCV.
Hu, A., Cotter, F., Mohan, N., Gurau, C., & Kendall, A. (2020). Probabilistic future prediction for video scene understanding. In ECCV.
Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In NeurIPS (pp. 2017–2025).
Jin, X., Li, X., Xiao, H., Shen, X., Lin, Z., Yang, J., Chen, Y., Dong, J., Liu, L., & Jie, Z., et al. (2017). Video scene parsing with predictive feature learning. In ICCV.
Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., & Yan, S. (2017). Predicting scene parsing and motion dynamics in the future. In NeurIPS.
Kim, W., Tanaka, M., Okutomi, M., & Sasaki, Y. (2020). Adaptive future frame prediction with ensemble network. arXiv:2011.06788.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kwon, Y. H., & Park, M. G. (2019). Predicting future frames using retrospective cycle gan. In CVPR.
Li, W. H., Hong, F. T., & Zheng, W. S. (2019). Learning to learn relation for important people detection in still images. In CVPR.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.
Luc, P., Couprie, C., Lecun, Y., & Verbeek, J. (2018). Predicting future instance segmentation by forecasting convolutional features. In ECCV.
Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In ICCV.
Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In ICLR.
Qi, X., Chen, Q., Jia, J., & Koltun, V. (2018). Semi-parametric image synthesis. In CVPR (pp. 8808–8816).
Qi, X., Liu, Z., Chen, Q., & Jia, J. (2019). 3d motion decomposition for rgbd future dynamic scene synthesis. In CVPR.
Qiao, X., Zheng, Q., Cao, Y., & Lau, R. W. (2019). Tell me where i am: Object-level scene context prediction. In CVPR (pp. 2633–2641).
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S. (2014). Video (language) modeling: A baseline for generative models of natural videos. arXiv:1412.6604.
Rochan, M., et al. (2018). Future semantic segmentation with convolutional LSTM. In BMVC.
Šarić, J., Oršić, M., Antunović, T., Vražić, S., & Šegvić, S. (2019). Single level feature-to-feature forecasting with deformable convolutions. In GCPR.
Saric, J., Orsic, M., Antunovic, T., Vrazic, S., & Segvic, S. (2020). Warp to the future: Joint forecasting of features and feature motion. In CVPR.
Shalev-Shwartz, S., Ben-Zrihem, N., Cohen, A., & Shashua, A. (2016). Long-term planning by short-term prediction. arXiv:1602.01580.
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating the future by watching unlabeled video. In CVPR.
Vondrick, C., & Torralba, A. (2017). Generating the future with adversarial transformers. In CVPR.
Walker, J., Gupta, A., & Hebert, M. (2015). Dense optical flow prediction from a static image. In ICCV.
Yang, L., Fan, Y., & Xu, N. (2019). Video instance segmentation. In ICCV.
Yuen, J., & Torralba, A. (2010). A data-driven approach for event prediction. In ECCV.
Zhou, Y., & Berg, T. L. (2016). Learning temporal transformations from time-lapse videos. In ECCV.

Download references

Acknowledgements

This work was supported by the Research Grants Council of Hong Kong (Grant No. 11205620).

Author information

Authors and Affiliations

Xidian University, Xi’an, China
Xiaotian Qiao
City University of Hong Kong, Hong Kong SAR, China
Xiaotian Qiao, Quanlong Zheng, Ying Cao & Rynson W. H. Lau

Authors

Xiaotian Qiao
View author publications
You can also search for this author in PubMed Google Scholar
Quanlong Zheng
View author publications
You can also search for this author in PubMed Google Scholar
Ying Cao
View author publications
You can also search for this author in PubMed Google Scholar
Rynson W. H. Lau
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Rynson W. H. Lau.

Additional information

Communicated by Boxin Shi.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Qiao, X., Zheng, Q., Cao, Y. et al. Instance-Aware Scene Layout Forecasting. Int J Comput Vis 130, 504–516 (2022). https://doi.org/10.1007/s11263-021-01560-x

Download citation

Received: 08 May 2021
Accepted: 30 November 2021
Published: 05 January 2022
Issue Date: February 2022
DOI: https://doi.org/10.1007/s11263-021-01560-x

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Instance-Aware Scene Layout Forecasting

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

3D Object Detection for Autonomous Driving: A Comprehensive Survey

Visual attention network

Change history

07 February 2022

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Instance-Aware Scene Layout Forecasting

Abstract

Access this article

Similar content being viewed by others

Attention mechanisms in computer vision: A survey

3D Object Detection for Autonomous Driving: A Comprehensive Survey

Visual attention network

Change history

07 February 2022

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation