Image-to-Video Translation Using a VAE-GAN with Refinement Network

Wang, Shengli; Xieshi, Mulin; Zhou, Zhangpeng; Zhang, Xiang; Liu, Xujie; Tang, Zeyi; Xiahou, Jianbing; Lin, Pingyuan; Xu, Xuexin; Dai, Yuxing

doi:10.1007/978-3-031-13870-6_42

Shengli Wang¹³,
Mulin Xieshi¹⁴,
Zhangpeng Zhou¹³,
Xiang Zhang¹⁴,
Xujie Liu¹⁴,
Zeyi Tang¹⁴,
Jianbing Xiahou¹⁵,
Pingyuan Lin¹⁵,
Xuexin Xu¹⁵ &
…
Yuxing Dai¹⁵

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13393))

Included in the following conference series:

International Conference on Intelligent Computing

1552 Accesses

Abstract

With the development of deep learning technology, various techniques for image processing have emerged in the field of computer vision in recent years, and have excellent performance in a variety of application scenarios.

In contrast to the prediction task of predicting video with multiple consecutive frames before and after the input to predict the missing images in the middle, the task of image-to-video generation proposed in this paper does not require multiple consecutive frames, but rather the directional content generation of images by inputting the first frame image with the embedding vector of motion features, and to address some of the existing problems, this paper innovates the network architecture to solve the generated video problems such as incoherence, frame loss and blurring.

For multiple image-to-video translation tasks, we propose a VAE-RGAN network with a further refinement network. We add a refinement network and use new identity matching loss and connected feature matching loss to eliminate VAE and GAN’s respective shortcomings and enhance the visual quality of the generated videos. Weizmann datasets have been the subject of a wide range of qualitative and quantitative experiments. We draw the following conclusions from this empirical study: (1) Compared with state-of-the-art approaches, our approach (VAE-RGAN) exhibits significant improvements in generative capability; (2) Experiments shows that our designed VAE-RGAN structure achieves better results and the refinement network significantly improves the problems of a blur.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Babaeizadeh, M., et al.: Stochastic variational video prediction. arXiv preprint arXiv:1710.11252 (2017)
Bao, J., et al.: CVAE-GAN: fine-grained image generation through asymmetric training. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2745–2754 (2017)
Google Scholar
Gorelick, L., et al.: Actions as space-time shapes. IEEE Trans. Pattern Anal. Mach. Intell. 29(12), 2247–2253 (2007)
Article Google Scholar
Fan, L., et al.: Controllable image-to-video translation: a case study on facial expression generation. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33(01), pp. 3510–3517 (2019)
Google Scholar
He, J., Lehrmann, A., Marino, J., Mori, G., Sigal, L.: Probabilistic video generation using holistic attribute control. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11209, pp. 466–483. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01228-1_28
Chapter Google Scholar
Heusel, M., et al.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. Adv. Neural Inf. Process. Syst. 30, 1–12 (2017)
Google Scholar
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Lee, A.X., et al.: Stochastic adversarial video prediction. arXiv preprint arXiv:1804.01523 (2018)
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.-H.: Flow-grounded spatial-temporal video prediction from still images. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) Computer Vision – ECCV 2018. LNCS, vol. 11213, pp. 609–625. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01240-3_37
Chapter Google Scholar
Li, Y., et al.: Video generation from text. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32(1) (2018)
Google Scholar
Liang, X., et al.: Dual motion GAN for future-flow embedded video prediction. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 1744–1752 (2017)
Google Scholar
Mao, X., et al.: Least squares generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. 2794–2802 (2017)
Google Scholar
Nam, S., et al.: End-to-end time-lapse video synthesis from a single outdoor image. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1409–1418 (2019)
Google Scholar
Pan, J., et al.: Video generation from single semantic label map. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3733–3742 (2019)
Google Scholar
Ronneberger, O., Fischer, P., Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Navab, N., Hornegger, J., Wells, W.M., Frangi, A.F. (eds.) Medical Image Computing and Computer-Assisted Intervention — MICCAI 2015. LNCS, vol. 9351, pp. 234–241. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-24574-4_28
Chapter Google Scholar
Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2830–2839 (2017)
Google Scholar
Shen, G., et al.: Facial image-to-video translation by a hidden affine transformation. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2505–2513 (2019)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
Tulyakov, S., et al.: Mocogan: decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. Adv. Neural Inf. Proces. Syst. 29, 1–9 (2016)
Google Scholar
Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: forecasting from static images using variational autoencoders. In: Leibe, Bastian, Matas, Jiri, Sebe, Nicu, Welling, Max (eds.) Computer Vision – ECCV. LNCS, vol. 9911, pp. 835–851. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-46478-7_51
Chapter Google Scholar
Wang, T.C., et al.: Video-to-video synthesis. arXiv preprint arXiv:1808.06601 (2018)
Wang, T.H., et al.: Point-to-point video generation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10491–10500 (2019)
Google Scholar
Xue, T., et al.: Visual dynamics: probabilistic future frame synthesis via cross convolutional networks. Adv. Neural Inf. Process. Syst. 29, 1–9 (2016)
Google Scholar
Yu, T., et al.: Deep generative video prediction. Pattern Recogn. Lett. 110, 58–65 (2018)
Article Google Scholar
Zhang, C., Peng, Y.: Stacking VAE and GAN for context-aware text-to-image generation. In: 2018 IEEE Fourth International Conference on Multimedia Big Data (BigMM), pp. 1–5. IEEE (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Maintenance Company of State Grid Power Company in Gansu Province, Lanzhou Gansu, 730000, China
Shengli Wang & Zhangpeng Zhou
State Grid Info-Telecom Great Power Science and Technology Co., Ltd., Fuzhou, 350000, China
Mulin Xieshi, Xiang Zhang, Xujie Liu & Zeyi Tang
School of Informatics, Xiamen University, Xiamen, 361005, China
Jianbing Xiahou, Pingyuan Lin, Xuexin Xu & Yuxing Dai

Authors

Shengli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mulin Xieshi
View author publications
You can also search for this author in PubMed Google Scholar
Zhangpeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xujie Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zeyi Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jianbing Xiahou
View author publications
You can also search for this author in PubMed Google Scholar
Pingyuan Lin
View author publications
You can also search for this author in PubMed Google Scholar
Xuexin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yuxing Dai
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yuxing Dai .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
De-Shuang Huang
University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo
Xi'an Polytechnic University, Xi'an, China
Junfeng Jing
The University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne
Polytecnic of Bari, Bari, Italy
Vitoantonio Bevilacqua
Liverpool John Moores University, Liverpool, UK
Abir Hussain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S. et al. (2022). Image-to-Video Translation Using a VAE-GAN with Refinement Network. In: Huang, DS., Jo, KH., Jing, J., Premaratne, P., Bevilacqua, V., Hussain, A. (eds) Intelligent Computing Theories and Application. ICIC 2022. Lecture Notes in Computer Science, vol 13393. Springer, Cham. https://doi.org/10.1007/978-3-031-13870-6_42

Download citation

DOI: https://doi.org/10.1007/978-3-031-13870-6_42
Published: 15 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13869-0
Online ISBN: 978-3-031-13870-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics