Two-Channel VAE-GAN Based Image-To-Video Translation

Wang, Shengli; Xieshi, Mulin; Zhou, Zhangpeng; Zhang, Xiang; Liu, Xujie; Tang, Zeyi; Dai, Yuxing; Xu, Xuexin; Lin, Pingyuan

doi:10.1007/978-3-031-13870-6_36

Conference paper
First Online: 15 August 2022

1552 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13393))

Abstract

We propose a VAE-GAN network with a two-channel decoder for addressing multiple image-to-video translation tasks, i.e., generating multiple videos of different categories by a single model. We consider this image-to-video translation as a video generation task rather than a video prediction that needs multiple frames as input. After training, the model only requires the first frame of the video and its corresponding attribute to generate the required video. The advantage of combining the Variational Autoencoder (VAE) and Generative Adversarial Network (GAN) is to avoid the shortcomings of both: VAE components can give rise to blur, and unstable gradients caused by the GAN. Extensive qualitative and quantitative experiments are conducted on the MUG [1] dataset. We draw the following conclusions from this empirical study: compared with state-of-the-art approaches, our approach (VAE-GAN) exhibits significant improvements in generative capability.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Aifanti, N., Papachristou, C., Delopoulos, A.: The mug facial expression database. In: 11th International Workshop on Image Analysis for Multimedia Interactive Services WIAMIS 10, pp. 1–4 (2010)
Google Scholar
Arjovsky, M., Bottou, L.: Towards principled methods for training generative adversarial networks. In: 5th International Conference on Learning Representations, ICLR, Toulon, France, 24–26 April 2017 (2017)
Google Scholar
Babaeizadeh, M., Finn, C., Erhan, D., Campbell, R.H., Levine, S.: Stochastic variational video prediction. In: 6th International Conference on Learning Representations, ICLR 2018 (2018)
Google Scholar
Baltrusaitis, T., Robinson, P., Morency, L.: Openface: an open source facial behavior analysis toolkit. In: 2016 IEEE Winter Conference on Applications of Compute Vision, WACV, Lake Placid, NY, USA, 7–10 March 2016, pp. 1–10 (2016)
Google Scholar
Carreira, J., Zisserman, A.: Quo vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017, pp. 4724–4733 (2017)
Google Scholar
Fan, L., Huang, W., Gan, C., Huang, J., Gong, B.: Controllable image-to-video translation: a case study on facial expression generation. In: The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019, pp. 3510–3517 (2019)
Google Scholar
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27, pp. 2672–2680. Curran Associates, Inc. (2014)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local nash equilibrium. In: Advances in Neural Information Processing Systems, vol. 30, pp. 6626–6637 (2017)
Google Scholar
Johson, J., Alahi, A., Fei-Fei, L.: Perceptual losses for real-time style transfer and super-resolution. Computer Vision – ECCV 2016, pp. 694–711 (2016). https://doi.org/10.1007/978-3-319-46475-6_43
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, 7–9 May 2015 (2015)
Google Scholar
Lee, A.X., Zhang, R., Ebert, F., Abbeel, P., Finn, C., Levine, S.: Stochastic adversarial video prediction. CoRR (2018)
Google Scholar
Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., Yang, M.: Flow-grounded spatial-temporal video prediction from still images. In: Computer Vision - ECCV 2018 - 15th European Conference, pp. 609–625 (2018). https://doi.org/10.1007/978-3-030-01240-3_37
Li, Y., Min, M.R., Shen, D., Carlson, D.E., Carin, L.: Video generation from text. In: McIlraith, S.A., Weinberger, K.Q. (eds.) Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence, (AAAI-18), pp. 7065–7072 (2018)
Google Scholar
Mao, X., Li, Q., Xie, H., Lau, R.Y.K., Wang, Z., Smolley, S.P.: Least squares generative adversarial networks. In: IEEE International Conference on Computer Vision, ICCV 2017, pp. 2813–2821 (2017)
Google Scholar
Nam, S., Ma, C., Chai, M., Brendel, W., Xu, N., Kim, S.J.: End-to-end time-lapse video synthesis from a single outdoor image. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Long Beach, CA, USA, June 16–20, 2019, pp. 1409–1418 (2019)
Google Scholar
Pan, J., et al.: Video generation from single semantic label map. In: IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, pp. 3733–3742 (2019)
Google Scholar
Ronneberger, O., P.Fischer, Brox, T.: U-net: convolutional networks for biomedical image segmentation. In: Medical Image Computing and Computer-Assisted Intervention (MICCAI), pp. 234–241 (2015)
Google Scholar
Saito, M., Matsumoto, E., Saito, S.: Temporal generative adversarial nets with singular value clipping. In: IEEE International Conference on Computer Vision ICCV Venice, Italy, 22–29 October 2017, pp. 2849–2858 (2017)
Google Scholar
Salimans, T., et al.: Improved techniques for training GANs. In: Advances in Neural Information Processing Systems, vol. 29, pp. 2234–2242 (2016)
Google Scholar
Shen, G., et al.: Facial image-to-video translation by a hidden affine transformation. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 2505–2513 (2019)
Google Scholar
Shi, X., Chen, Z., Wang, H., Yeung, D.Y., Wong, W.K., WOO, W.C.: Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In: Advances in Neural Information Processing Systems, vol. 28, pp. 802–810 (2015)
Google Scholar
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition, pp. 1–14. Computational and Biological Learning Society (2015)
Google Scholar
Tulyakov, S., Liu, M., Yang, X., Kautz, J.: Mocogan: decomposing motion and content for video generation. In: 2018 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Salt Lake City, UT, USA, 18–22 June 2018, pp. 1526–1535 (2018)
Google Scholar
Vondrick, C., Pirsiavash, H., Torralba, A.: Generating videos with scene dynamics. In: Advances in Neural Information Processing Systems, vol. 29, pp. 613–621. Curran Associates, Inc. (2016)
Google Scholar
Walker, J., Doersch, C., Gupta, A., Hebert, M.: An uncertain future: Forecasting from static images using variational autoencoders. In: Computer Vision – ECCV 2016 - 14th European Conference, pp. 835–851 (2016).https://doi.org/10.1007/978-3-319-46478-7_51
Wang, T.C., et al.: Video- to-video synthesis. In: Advances in Neural Information Processing Systems, vol. 31, pp. 1144–1156. Curran Associates, Inc. (2018)
Google Scholar
Wang, T., Cheng, Y., Lin, C.H., Chen, H., Sun, M.: Point-to-point video generation. In: 2019 IEEE/CVF International Conference on Computer Vision, ICCV, Seoul, Korea (South), 27 October–2 November 2019, pp. 10490–10499 (2019)
Google Scholar
Xie, S., Girshick, R.B., Dollár, P., Tu, Z., He, K.: Aggregated residual transformations for deep neural networks. In: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR, Honolulu, HI, USA, 21–26 July 2017, pp. 5987–5995 (2017)
Google Scholar
Xue, T., Wu, J., Bouman, K., Freeman, B.: Visual dynamics: Probabilistic future frame synthesis via cross convolutional networks. In: Advances in Neural Information Processing Systems, vol. 29, pp. 91–99. Curran Associates, Inc. (2016)
Google Scholar
Zhang, C., Peng, Y.: Stacking VAE and GAN for context-aware text-to-image generation. In: Fourth IEEE International Conference on Multimedia Big Data, BigMM, Xi’an, China, 13–16 September 2018, pp. 1–5 (2018)
Google Scholar

Download references

Author information

Authors and Affiliations

Maintenance Company of State Grid Power Company in Gansu Province, Lanzhou, 730000, Gansu, China
Shengli Wang, Zhangpeng Zhou & Xujie Liu
State Grid Info-Telecom Great Power Science and Technology Co., LTD., Fuzhou, 350000, China
Mulin Xieshi, Xiang Zhang & Zeyi Tang
School of Informatics, Xiamen University, Xiamen, 361005, China
Yuxing Dai, Xuexin Xu & Pingyuan Lin

Authors

Shengli Wang
View author publications
You can also search for this author in PubMed Google Scholar
Mulin Xieshi
View author publications
You can also search for this author in PubMed Google Scholar
Zhangpeng Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Xiang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Xujie Liu
View author publications
You can also search for this author in PubMed Google Scholar
Zeyi Tang
View author publications
You can also search for this author in PubMed Google Scholar
Yuxing Dai
View author publications
You can also search for this author in PubMed Google Scholar
Xuexin Xu
View author publications
You can also search for this author in PubMed Google Scholar
Pingyuan Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Pingyuan Lin .

Editor information

Editors and Affiliations

Tongji University, Shanghai, China
De-Shuang Huang
University of Ulsan, Ulsan, Korea (Republic of)
Kang-Hyun Jo
Xi'an Polytechnic University, Xi'an, China
Junfeng Jing
The University of Wollongong, North Wollongong, NSW, Australia
Prashan Premaratne
Polytecnic of Bari, Bari, Italy
Vitoantonio Bevilacqua
Liverpool John Moores University, Liverpool, UK
Abir Hussain

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Wang, S. et al. (2022). Two-Channel VAE-GAN Based Image-To-Video Translation. In: Huang, DS., Jo, KH., Jing, J., Premaratne, P., Bevilacqua, V., Hussain, A. (eds) Intelligent Computing Theories and Application. ICIC 2022. Lecture Notes in Computer Science, vol 13393. Springer, Cham. https://doi.org/10.1007/978-3-031-13870-6_36

Download citation

DOI: https://doi.org/10.1007/978-3-031-13870-6_36
Published: 15 August 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-13869-0
Online ISBN: 978-3-031-13870-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics