Learnable spatiotemporal feature pyramid for prediction of future optical flow in videos

Wadhwa, Laisha; Mukherjee, Snehasis

doi:10.1007/s00138-020-01145-7

Learnable spatiotemporal feature pyramid for prediction of future optical flow in videos

Original Paper
Published: 17 November 2020

Volume 32, article number 18, (2021)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

313 Accesses
1 Citation
Explore all metrics

Abstract

The success of deep learning-based techniques in solving various computer vision problems motivated the researchers to apply deep learning to predict the optical flow of a video in the next frame. However, the problem of predicting the motion of an object in the next few frames remains an unsolved and less explored problem. Given a sequence of frames, predicting the motion in the next few frames of the video becomes difficult in cases where the displacement of optical flow vector across frames is large. Traditional CNNs often fail to learn the dynamics of the objects across frames in case of large displacements of objects in consecutive frames. In this paper, we present an efficient CNN based on the concept of feature pyramid for extracting the spatial features from a few consecutive frames. The spatial features extracted from consecutive frames by a modified PWC-Net architecture are fed into a bidirectional LSTM for obtaining the temporal features. The proposed spatiotemporal feature pyramid is able to capture the abrupt motion of the moving objects in video, especially when displacement of the object is large across the consecutive frames. Further, the proposed spatiotemporal pyramidal feature can effectively predict the optical flow in next few frames, instead of predicting only the next frame. The proposed method of predicting optical flow outperforms the state of the art when applied on challenging datasets such as “MPI Sintel Final Pass,” “Monkaa” and “Flying Chairs” where abrupt and large displacement of the moving objects in consecutive frames is the main challenge.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Video summarization using deep learning techniques: a detailed analysis and investigation

Article 15 March 2023

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

References

Simonyan, K., Zisserman, A.: Two-stream convolutional networks for action recognition in videos. In: Advances in Neural Information Processing Systems (NIPS) (2014)
Janai, J., Guney, F., Behl, A., Geiger, A.: Computer vision for autonomous vehicles: problems, datasets and state-of-the-art (2017). arXiv preprint arXiv:1704.05519
Bonneel, N., Tompkin, J., Sunkavalli, K., Sun, D., Paris, S., Pfister, H.: Blind video temporal consistency. ACM SIGGRAPH 34(6), 196 (2015)
Google Scholar
Baker, S., Scharstein, D., Lewis, J.P., Roth, S., Black, M.J., Szeliski, R.: A database and evaluation methodology for optical flow. Int. J. Comput. Vis. 92, 1–31 (2011)
Article Google Scholar
Butler, D.J., Wulff, J., Stanley, G.B., Black, M.J.: A naturalistic open source movie for optical flow evaluation. In: ECCV (2012)
Walker, J., Gupta, A., Hebert, M.: Dense optical flow prediction from a static image. In: Proceedings of the IEEE International Conference on Computer Vision (2015)
Lotter, W., Kreiman, G., Cox, D.: Deep predictive coding networks for video prediction and unsupervised learning (2016). arXiv preprint arXiv:1605.08104
Sun, D., Yang, X., Liu, M.-Y., Kautz, J.: PWC-Net: CNNs for optical flow using pyramid, warping, and cost volume. In: CVPR (2018)
Ren, Z., Gallo, O., Sun, D., Yang, M.H., Sudderth, E.B., Kautz, J.: A fusion approach for multi-frame optical flow estimation. In: WACV (2019)
Bailer, C., Taetz, B., Stricker, D.: Flowfields: dense correspondence fields for highly accurate large displacement optical flow estimation. In: ICCV (2015)
Bailer, C., Varanasi, K., Stricker, D.: CNN-based patch matching for optical flow with thresholded hinge embedding loss. In: CVPR (2017)
Srivastava, N., Mansimov, E., Salakhutdinov, R.: Unsupervised learning of video representations using LSTMs. In: ICML (2015)
Hornand, B., Schunck, B.: Determining optical flow. In: Artificial Intelligence (1998)
Ilg, E., Mayer, N., Saikia, T., Keuper, M., Dosovitskiy, A., Brox, T.: Flownet 2.0: evolution of optical flow estimation with deepnetworks. In: CVPR (2017)
Wang, Y., Long, M., Wang, J., Gao, Z., Yu, P.S.: PredRNN: recurrent neural networks for predictive learning using spatiotemporal LSTMs. In: NIPS (2017)
Yuen, J., Torralba, A.: A data-driven approach for event prediction. In: ECCV (2010)
Hoai, M., De la Torre, F.: Max-margin early event detectors. IJCV 107(2), 191202 (2014)
Google Scholar
Lan, T., Chen, T.-C., Savarese, S.: A hierarchical representation for future action prediction. In: ECCV (2014)
Walker, J., Gupta, A., Hebert, M.: Patch to the future: unsupervised visual prediction. In: CVPR (2014)
Huang, A., Kitani, K.M.: Action-reaction: forecasting the dynamics of human interaction. In: ECCV (2014)
Ranzato, M., Szlam, A., Bruna, J., Mathieu, M., Collobert, R., Chopra, S.: Video (language) modeling: a baseline for generative models of natural videos (2014). arXiv preprint arXiv:1412.6604
Zhu, B., Tian, L.-F., Du, Q.-L., Wu, Q.-X., Sahl, F.Z., Yeboah, Y.: Adaptive dual fractional-order variational optical flow model for motion estimation. IET Comput. Vis. 13(3), 277–284 (2019)
Article Google Scholar
Xu, J., Ranftl, R., Koltun, V.: Accurate optical flow via direct cost volume processing. In: CVPR (2017)
Maurer, D., Bruhn, A.: ProFlow: learning to predict optical flow. In: BMVC (2018)
Lotter, W., Kreiman, G., Cox, D.: “Deep predictive coding networks for video prediction and unsupervised learning (2016). arXiv preprint arXiv:1605.08104
Dosovitskiy, A., Fischery, P., Ilg, E., Hazirbas, C., Golkov, V., van der Smagt, P., Cremers, D., Brox, T., et al.: Flownet: learning optical flow with convolutional networks. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2015)
Mayer, N., Ilg, E., Häusser, P., Fischer, P., Cremers, D., Dosovitskiy, A., Brox, T.: A large dataset to train convolutional networks for disparity, optical flow, and scene flow estimation. In: CVPR (2016)
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR (2015)

Download references

Acknowledgements

The authors wish to thank Nvidia for providing a TITAN Xp GPU which is used for performing the experiments related to this study. Funding for the GPU was provided by Nvidia (Grant No. 0322218017192).

Author information

Authors and Affiliations

IIIT SriCity, Chittoor, India
Laisha Wadhwa
Shiv Nadar University, Greater Noida, India
Snehasis Mukherjee

Authors

Laisha Wadhwa
View author publications
You can also search for this author in PubMed Google Scholar
Snehasis Mukherjee
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Snehasis Mukherjee.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wadhwa, L., Mukherjee, S. Learnable spatiotemporal feature pyramid for prediction of future optical flow in videos. Machine Vision and Applications 32, 18 (2021). https://doi.org/10.1007/s00138-020-01145-7

Download citation

Received: 04 April 2020
Revised: 25 August 2020
Accepted: 15 October 2020
Published: 17 November 2020
DOI: https://doi.org/10.1007/s00138-020-01145-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Learnable spatiotemporal feature pyramid for prediction of future optical flow in videos

Abstract

Access this article

Similar content being viewed by others

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Video summarization using deep learning techniques: a detailed analysis and investigation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Learnable spatiotemporal feature pyramid for prediction of future optical flow in videos

Abstract

Access this article

Similar content being viewed by others

Encoder-Decoder with Atrous Separable Convolution for Semantic Image Segmentation

Video summarization using deep learning techniques: a detailed analysis and investigation

Deep Learning Techniques—R-CNN to Mask R-CNN: A Survey

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation