Abstract
Forecasting scene layout is of vital importance in many vision applications, e.g., enabling autonomous vehicles to plan actions early. It is a challenging problem as it involves understanding of the past scene layouts and the diverse object interactions in the scene, and then forecasting what the scene will look like at a future time. Prior works learn a direct mapping from past pixels to future pixel-wise labels and ignore the underlying object interactions in the scene, resulting in temporally incoherent and averaged predictions. In this paper, we propose a learning framework to forecast semantic scene layouts (represented by instance maps) from an instance-aware perspective. Specifically, our framework explicitly models the dynamics of individual instances and captures their interactions in a scene. Under this formulation, we are able to enforce instance-level constraints to forecast scene layouts by effectively reasoning about their spatial and semantic relations. Experimental results show that our model can predict sharper and more accurate future instance maps than the baselines and prior methods, yielding state-of-the-art performances on short-term, mid-term and long-term scene layout forecasting.
Similar content being viewed by others
Change history
07 February 2022
A Correction to this paper has been published: https://doi.org/10.1007/s11263-022-01577-w
References
Achanta, R., Hemami, S., Estrada, F., & Süsstrunk, S. (2009). Frequency-tuned salient region detection. In CVPR.
Chao, Y. W., Yang, J., Price, B., Cohen, S., & Deng, J. (2017). Forecasting human dynamics from static images. In ICCV.
Chen, Q., & Koltun, V. (2016). Full flow: Optical flow estimation by global optimization over regular grids. In CVPR.
Chen, Q., & Koltun, V. (2017). Photographic image synthesis with cascaded refinement networks. In ICCV.
Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., & Schiele, B. (2016). The cityscapes dataset for semantic urban scene understanding. In CVPR.
Dosovitskiy, A., & Koltun, V. (2017). Learning to act by predicting the future. In ICLR.
Ehrig, M., & érome Euzenat, J. (2005). Relaxed precision and recall for ontology matching. In Integrating ontologies workshop proceedings.
Gammulle, H., Denman, S., Sridharan, S., & Fookes, C.(2019). Predicting the future: A jointly learnt model for action anticipation. In ICCV.
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In NeurIPS.
Graber, C., Tsai, G., Firman, M., Brostow, G., & Schwing, A. (2021). Panoptic segmentation forecasting. arXiv:2104.03962.
Guan, J., Yuan, Y., Kitani, K.M., & Rhinehart, N. (2020). Generative hybrid representations for activity forecasting with no-regret learning. In CVPR.
He, K., Gkioxari, G., Dollár, P., & Girshick, R. (2017). Mask r-cnn. In ICCV.
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image recognition. In CVPR.
Hoai, M., & De la Torre, F.(2014). Max-margin early event detectors. In IJCV.
Hu, A., Cotter, F., Mohan, N., Gurau, C., & Kendall, A. (2020). Probabilistic future prediction for video scene understanding. In ECCV.
Jaderberg, M., Simonyan, K., & Zisserman, A., et al. (2015). Spatial transformer networks. In NeurIPS (pp. 2017–2025).
Jin, X., Li, X., Xiao, H., Shen, X., Lin, Z., Yang, J., Chen, Y., Dong, J., Liu, L., & Jie, Z., et al. (2017). Video scene parsing with predictive feature learning. In ICCV.
Jin, X., Xiao, H., Shen, X., Yang, J., Lin, Z., Chen, Y., Jie, Z., Feng, J., & Yan, S. (2017). Predicting scene parsing and motion dynamics in the future. In NeurIPS.
Kim, W., Tanaka, M., Okutomi, M., & Sasaki, Y. (2020). Adaptive future frame prediction with ensemble network. arXiv:2011.06788.
Kingma, D.P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kwon, Y. H., & Park, M. G. (2019). Predicting future frames using retrospective cycle gan. In CVPR.
Li, W. H., Hong, F. T., & Zheng, W. S. (2019). Learning to learn relation for important people detection in still images. In CVPR.
Lin, T. Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., & Zitnick, C. L. (2014). Microsoft coco: Common objects in context. In ECCV.
Luc, P., Couprie, C., Lecun, Y., & Verbeek, J. (2018). Predicting future instance segmentation by forecasting convolutional features. In ECCV.
Luc, P., Neverova, N., Couprie, C., Verbeek, J., & LeCun, Y. (2017). Predicting deeper into the future of semantic segmentation. In ICCV.
Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In ICLR.
Qi, X., Chen, Q., Jia, J., & Koltun, V. (2018). Semi-parametric image synthesis. In CVPR (pp. 8808–8816).
Qi, X., Liu, Z., Chen, Q., & Jia, J. (2019). 3d motion decomposition for rgbd future dynamic scene synthesis. In CVPR.
Qiao, X., Zheng, Q., Cao, Y., & Lau, R. W. (2019). Tell me where i am: Object-level scene context prediction. In CVPR (pp. 2633–2641).
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., & Chopra, S. (2014). Video (language) modeling: A baseline for generative models of natural videos. arXiv:1412.6604.
Rochan, M., et al. (2018). Future semantic segmentation with convolutional LSTM. In BMVC.
Šarić, J., Oršić, M., Antunović, T., Vražić, S., & Šegvić, S. (2019). Single level feature-to-feature forecasting with deformable convolutions. In GCPR.
Saric, J., Orsic, M., Antunovic, T., Vrazic, S., & Segvic, S. (2020). Warp to the future: Joint forecasting of features and feature motion. In CVPR.
Shalev-Shwartz, S., Ben-Zrihem, N., Cohen, A., & Shashua, A. (2016). Long-term planning by short-term prediction. arXiv:1602.01580.
Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Anticipating the future by watching unlabeled video. In CVPR.
Vondrick, C., & Torralba, A. (2017). Generating the future with adversarial transformers. In CVPR.
Walker, J., Gupta, A., & Hebert, M. (2015). Dense optical flow prediction from a static image. In ICCV.
Yang, L., Fan, Y., & Xu, N. (2019). Video instance segmentation. In ICCV.
Yuen, J., & Torralba, A. (2010). A data-driven approach for event prediction. In ECCV.
Zhou, Y., & Berg, T. L. (2016). Learning temporal transformations from time-lapse videos. In ECCV.
Acknowledgements
This work was supported by the Research Grants Council of Hong Kong (Grant No. 11205620).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Boxin Shi.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Qiao, X., Zheng, Q., Cao, Y. et al. Instance-Aware Scene Layout Forecasting. Int J Comput Vis 130, 504–516 (2022). https://doi.org/10.1007/s11263-021-01560-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-021-01560-x