Skip to main content
Log in

Video prediction: a step-by-step improvement of a video synthesis network

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Although focusing on the field of video generation has made some progress in network performance and computational efficiency, there is still much room for improvement in terms of the predicted frame number and clarity. In this paper, a depth learning model is proposed to predict future video frames. The model can predict video streams with complex pixel distributions of up to 32 frames. Our framework is mainly composed of two modules: a fusion image prediction generator and an image-video translator. The fusion picture prediction generator is realized by a U-Net neural network built by a 3D convolution, and the image-video translator is composed of a conditional generative adversarial network built by a 2D convolution network. In the proposed framework, given a set of fusion images and labels, the image picture prediction generator can learn the pixel distribution of the fitted label pictures from the fusion images. The image-video translator then translates the output of the fused image prediction generator into future video frames. In addition, this paper proposes an accompanying convolution model and corresponding algorithm for improving image sharpness. Our experimental results prove the effectiveness of this framework.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Abbreviations

z :

Gaussian noise

x :

Monitoring conditions (this paper refers to the edge detection map)

X :

Collection of edge detection maps

y :

Labels (this article refers to real samples)

s :

Semantic segmentation graph

S :

Semantic segmentation graph collection

m :

Input fusion and pictures

M :

Imported collection of fusion images

\( \hat{m} \) :

Predicted output of fusion images

\( \hat{M} \) :

Collection of fused images for predictive output

G :

Generator functions

D :

Discriminator function

F( X, Y ):

Structural similarity function

ϕ :

Variance

μ :

Average value

w :

Weighting Matrix

h :

Height

c :

Number of channels

C :

Constants

V :

Objective function

L :

Loss function

n :

Batchsize

N :

Number of loss function samples

η :

Learning Rate

\( {\left\Vert \bullet \right\Vert}_2^2 \) :

Square of L2 parametric number

Tr[·]:

Traces of the matrix

References

  1. Kalchbrenner N, Oord A, Simonyan K, Danihelka I, Vinyals O, Graves, A, Kavukcuoglu K (2017) Video pixel networks. In: 2017 International Conference on Machine Learning, pp 1–2

  2. Lotter W, Kreiman G, Cox D (2016) Deep predictive coding networks for video prediction and unsupervised learning. arXiv preprint arXiv:1605.08104

  3. Byeon W, Wang Q, Srivastava RK, Koumoutsakos P (2018).Contextvp: fully context-aware video prediction. In: 2018 In Proceedings of the European Conference on Computer Vision (ECCV), pp 753-769

  4. Finn C, Goodfellow I, Levine S (2016) Unsupervised learning for physical interaction through video prediction. arXiv preprint arXiv:1605.07157

  5. Shi X, Chen Z, Wang H, Yeung DY, Wong WK, Woo WC. (2015) Convolutional LSTM network: a machine learning approach for precipitation nowcasting. arXiv preprint arXiv:1506.04214

  6. Xue T, Wu J, Bouman KL, Freeman WT (2016) Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. arXiv preprint arXiv:1607.02586

  7. Villegas R, Yang J, Hong S, Lin X, Lee H (2017) Decomposing motion and content for natural video sequence prediction. arXiv preprint arXiv:1706.08033

  8. Mathieu M, Couprie C, LeCun Y (2015) Deep multi-scale video prediction beyond mean square error. arXiv preprint arXiv:1511.05440

  9. Liu W, Luo W, Lian D, Gao S (2018) Future frame prediction for anomaly detection–a new baseline. In: 2018 Proceedings of the IEEE Conference On Computer Vision and Pattern Recognition, pp 1-7

  10. Yi Z, Zhang H, Tan P, Gong M (2017) Dualgan: unsupervised dual learning for image-to-image translation. In: 2017 in Proceedings of the IEEE international conference on computer vision, pp 2-3

  11. Liang X, Lee L, Dai W, Xing, EP (2017) Dual motion gan for future-flow embedded video prediction. In: 2017 In proceedings of the IEEE international conference on computer vision, pp 1–7

  12. Denton E, Birodkar V (2017) Unsupervised learning of disentangled representations from video. arXiv preprint arXiv:1705.10915

  13. Villegas R, Yang J, Zou Y, Sohn S, Lin X, Lee H (2017). Learning to generate long-term future via hierarchical prediction. In:2017 in international conference on machine learning, pp. 3560-3569

  14. Oprea S, Martinez-Gonzalez P, Garcia-Garcia A, Castro-Vargas JA, Orts-Escolano S, Garcia-Rodriguez J, Argyros A (2020) A review on deep learning techniques for video prediction. IEEE Trans Pattern Anal Mach Intell:1

  15. Hsieh JT, Liu B, Huang DA, Fei-Fei L, Niebles JC (2018) Learning to decompose and disentangle representations for video prediction. arXiv preprint arXiv:1806.04166

  16. Xu Y, Gao L, Tian K, Zhou S, Sun H (2019) Non-local convlstm for video compression artifact reduction. In:2019 in Proceedings of the IEEE/CVF International Conference On Computer Vision, pp 7043-7052)

  17. Srivastava N, Mansimov E, Salakhudinov R (2015) Unsupervised learning of video representations using lstms. In:2015 in International conference on machine learning, pp 843-852

  18. Walker J, Doersch C, Gupta A, Hebert M (2016, October) An uncertain future: forecasting from static images using variational autoencoders. In:2016 in European Conference on Computer Vision (ECCV), pp 835-851

  19. Ye Y, Singh M, Gupta A, Tulsiani S (2019) Compositional video prediction. In:2019 in proceedings of the IEEE/CVF International Conference on Computer Vision, pp 10353-10362

  20. Saito M, Matsumoto E, Saito S (2017) Temporal generative adversarial nets with singular value clipping. In: 2017 in IEEE International Conference on Computer Vision (ICCV), pp 2830-2839

  21. Tulyakov S, Liu MY, Yang X, Kautz J (2018) MoCoGAN: decomposing motion and content for video generation. In:2018 in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp 1526-1535

  22. Wang TC, Liu MY, Tao A, Liu G, Kautz J, Catanzaro B (2019) Few-shot video-to-video synthesis. arXiv preprint arXiv:1910.12713

  23. Wang TC, Liu MY, Zhu JY, Liu G, Tao A, Kautz J, Catanzaro B (2018) Video-to-video synthesis. arXiv preprint arXiv:1808.06601

  24. Goodfellow IJ, Pouget-Abadie J, Mirza M, Xu B, Warde-Farley D, Ozair S, Bengio Y (2014) Generative adversarial networks. arXiv preprint arXiv:1406.2661

  25. Mirza M, Osindero S (2014) Conditional generative adversarial nets. arXiv preprint arXiv:1411.1784

  26. Wang Y, Zhang J, Zhu H, Long M, Wang J, Yu PS (2019) Memory in memory: a predictive neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In:2019 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 9154-9162

  27. Karacan L, Akata Z, Erdem A, Erdem E (2016) Learning to generate images of outdoor scenes from attributes and semantic layouts. arXiv preprint arXiv:1612.00215

  28. Reed S, Akata Z, Mohan S, Tenka S, Schiele B, Lee H (2016) Learning what and where to draw. arXiv preprint arXiv:1610.02454

  29. Lee AX, Zhang R, Ebert F, Abbeel P, Finn C, Levine S (2018) Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523

  30. Liang X, Lee L, Dai W, Xing EP (2017) Dual motion Gan for future-flow embedded video prediction. In:2017 In proceedings of the IEEE international conference on computer vision, pp 1744-1752

  31. Vondrick C, Pirsiavash H, Torralba A (2016) Generating videos with scene dynamics. arXiv preprint arXiv:1609.02612

  32. Tran D, Bourdev L, Fergus R, Torresani L, Paluri M (2015) Learning spatiotemporal features with 3d convolutional networks. In:2015 in Proceedings of the IEEE international conference on computer vision, pp 4489-4497

  33. Bousmalis K, Silberman N, Dohan D, Erhan D, Krishnan D (2017). Unsupervised pixel-level domain adaptation with generative adversarial networks. In:2017 in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3722-3731

  34. Huang X, Liu MY, Belongie S, Kautz J (2018) Multimodal unsupervised image-to-image translation. In:2018 in Proceedings of the European conference on computer vision (ECCV), pp 172-189

  35. Isola P, Zhu JY, Zhou T, Efros AA (2017) Image-to-image translation with conditional adversarial networks. In:2017 in Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1125-1134

  36. Liu MY, Breuel T, Kautz J (2017) Unsupervised image-to-image translation networks. arXiv preprint arXiv:1703.00848

  37. Liu MY, Tuzel O (2016) Coupled generative adversarial networks. arXiv preprint arXiv:1606.07536

  38. Shrivastava A, Pfister T, Tuzel O, Susskind J, Wang W, Webb R (2017) Learning from simulated and unsupervised images through adversarial training. In:2017 In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2107-2116

  39. Taigman Y, Polyak A, Wolf L (2016). Unsupervised cross-domain image generation. arXiv preprint arXiv:1611.02200

  40. Wang TC, Liu MY, Zhu JY, Tao A, Kautz J, Catanzaro B (2018) High-resolution image synthesis and semantic manipulation with conditional gans. In:2018 in proceedings of the IEEE conference on computer vision and pattern recognition, pp. 8798-8807

  41. Zhu JY, Park T, Isola P, Efros AA. (2017). Unpaired image-to-image translation using cycle-consistent adversarial networks. In:2017 In Proceedings of the IEEE international conference on computer vision, pp 2223-2232

  42. Zhu JY, Zhang R, Pathak D, Darrell T, Efros AA., Wang O, Shechtman E (2017) Toward multimodal image-to-image translation. arXiv preprint arXiv:1711.11586

  43. Kittler J (1983) On the accuracy of the Sobel edge detector. Image Vis Comput 1(1):37–42

    Article  Google Scholar 

  44. Torre V, Poggio TA (1986) On edge detection. IEEE 2:147–163

    Google Scholar 

  45. Arbelaez P, Maire M, Fowlkes C, Malik J (2010) Contour detection and hierarchical image segmentation. IEEE Trans Pattern Anal Mach Intell 33(5):898–916

    Article  Google Scholar 

  46. Xiaofeng R, Bo L (2012) Discriminatively trained sparse code gradients for contour detection. In Advances in neural information processing systems, pp 584-592

  47. Dollár P, Zitnick CL (2014) Fast edge detection using structured forests. In:2014 IEEE transactions on pattern analysis and machine intelligence 37(8):1558–1570

    Google Scholar 

  48. Xie S, Tu Z (2015) Holistically-nested edge detection. In:2015 in proceedings of the IEEE international conference on computer vision, pp 1395-1403

  49. Long J, Shelhamer E, Darrell T (2015) Fully convolutional networks for semantic segmentation. In:2015 in proceedings of the IEEE conference on computer vision and pattern recognition, pp 3431-3440

  50. Bertasius G, Shi J, Torresani L (2015) Deepedge: a multi-scale bifurcated deep network for top-down contour detection. In:2015 In Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4380-4389

  51. Wang Z, Zhu S, Li Y, Cui Z (2016) Convolutional neural network based deep conditional random fields for stereo matching. J Vis Commun Image Represent 40:739–750

    Article  Google Scholar 

  52. Canny J (1986) A computational approach to edge detection. IEEE Trans Pattern Anal Mach Intell 6:679–698

    Article  Google Scholar 

  53. Yu Z, Feng C, Liu MY, Ramalingam S (2017) Casenet: Deep category-aware semantic edge detection. In :2017 In Proceedings of the IEEE conference on computer vision and pattern recognition,pp. 5964–5973

  54. Karpathy A, Toderici G, Shetty S, Leung T, Sukthankar R, Fei-Fei L (2014) Large-scale video classification with convolutional neural networks. In: 2014 In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp 1725-1732

  55. Ronneberger O, Fischer P, Brox T (2015) U-net: Convolutional networks for biomedical image segmentation. In:2015 In International Conference on Medical image computing and computer-assisted intervention,pp 234–241

  56. Chen Q, Koltun V (2017) Photographic image synthesis with cascaded refinement networks. In:2017 In Proceedings of the IEEE international conference on computer vision, pp 1511-1520

  57. Pérez-Hernández F, Tableik S, Lamas A, Olmos R, Fujita H, Herrera F (2020) Object detection binary classifiers methodology based on deep learning to identify small objects handled similarly: application in video surveillance. Knowl-Based Syst 194:105590

    Article  Google Scholar 

  58. Olmos R, Tableik S, Lamas A, Perez-Hernandez F, Herrera F (2019) A binocular image fusion approach for minimizing false positives in handgun detection with deep learning. Information Fusion 49:271–280

    Article  Google Scholar 

  59. Ma J, Xu H, Jiang J, Mei X, Zhang XP (2020) DDcGAN: a dual-discriminator conditional generative adversarial network for multi-resolution image fusion. IEEE Trans Image Process 29:4980–4995

    Article  Google Scholar 

  60. Singh VK, Rashwan HA, Romani S, Akram F, Pandey N, Sarker MMK, Torrents-Barrena J (2020) Breast tumor segmentation and shape classification in mammograms using generative adversarial and convolutional neural network. Expert Syst Appl 139:112855

    Article  Google Scholar 

  61. Zhang J, Yawei H (2020) Image-to-image translation based on improved cycle-consistent generative adversarial network. J Electron Inf Technol 42(5):1216–1222

    Google Scholar 

  62. He J, Zhang S, Yang M, Shan Y, Huang T (2019) Bi-directional cascade network for perceptual edge detection. In: 2019 in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp 3828-3837

  63. Yu Z, Liu W, Zou Y, Feng C, Ramalingam S, Kumar BVK, Kautz J (2018) Simultaneous edge alignment and learning. In:2018 in Proceedings of the European Conference on Computer Vision (ECCV), pp 388-404

  64. Zhang Y, Shi L, Wu Y, Cheng K, Cheng J, Lu H (2020) Gesture recognition based on deep deformable 3D convolutional neural networks. Pattern Recogn 107:107416

    Article  Google Scholar 

Download references

Funding

This work is partially supported by the National Natural Science Foundation of China(61461053) and Yunnan University of the China Postgraduate Science Foundation under Grant (2020306).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hongwei Ding.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Jing, B., Ding, H., Yang, Z. et al. Video prediction: a step-by-step improvement of a video synthesis network. Appl Intell 52, 3640–3652 (2022). https://doi.org/10.1007/s10489-021-02500-5

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-021-02500-5

Keywords

Navigation