Skip to main content
Log in

Towards Image-to-Video Translation: A Structure-Aware Approach via Multi-stage Generative Adversarial Networks

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper, we consider the problem of image-to-video translation, where one or a set of input images are translated into an output video which contains motions of a single object. Especially, we focus on predicting motions conditioned by high-level structures, such as facial expression and human pose. Recent approaches are either driven by structural conditions or temporal-based. Condition-driven approaches typically train transformation networks to generate future frames conditioned on the predicted structural sequence. Temporal-based approaches, on the other hand, have shown that short high-quality motions can be generated using 3D convolutional networks with temporal knowledge learned from massive training data. In this work, we combine the benefits of both approaches and propose a two-stage generative framework where videos are forecast from the structural sequence and then refined by temporal signals. To model motions more efficiently in the forecasting stage, we train networks with dense connections to learn residual motions between the current and future frames, which avoids learning motion-irrelevant details. To ensure temporal consistency in the refining stage, we adopt the ranking loss for adversarial training. We conduct extensive experiments on two image-to-video translation tasks: facial expression retargeting and human pose forecasting. Superior results over the state of the art on both tasks demonstrate the effectiveness of our approach.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  • Aifanti, N., Papachristou, C., & Delopoulos, A. (2010). The MUG facial expression database. In International workshop on image analysis for multimedia interactive services (WIAMIS).

  • Amos, B., Ludwiczuk, B., & Satyanarayanan, M. (2016). OpenFace: A general-purpose face recognition library with mobile applications. Technical report, CMU-CS-16-118, CMU School of Computer Science.

  • Arjovsky, M., Chintala, S., & Bottou, L. (2017). Wasserstein generative adversarial networks. In International conference on machine learning (ICML).

  • Blanz, V., & Vetter, T. (2003). Face recognition based on fitting a 3D morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 25(9), 1063–1074.

    Article  Google Scholar 

  • Cao, C., Weng, Y., Zhou, S., Tong, Y., & Zhou, K. (2014). Facewarehouse: A 3D facial expression database for visual computing. IEEE Transactions on Visualization and Computer Graphics (TVCG), 20(3), 413–425.

    Article  Google Scholar 

  • Chao, Y. W., Yang, J., Price, B., Cohen, S., & Deng, J. (2017). Forecasting human dynamics from static images. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Denton, E., & Birodkar, V. (2017). Unsupervised learning of disentangled representations from video. In Annual conference on neural information processing systems (NeurIPS) (pp. 4414–4423).

  • Denton, E., Chintala, S., Szlam, A., & Fergus, R. (2015). Deep generative image models using a Laplacian pyramid of adversarial networks. In Annual conference on neural information processing systems (NeurIPS).

  • Farnebäck, G. (2003). Two-frame motion estimation based on polynomial expansion. In Scandinavian conference on Image analysis (pp. 363–370).

  • Finn, C., Goodfellow, I., & Levine, S. (2016). Unsupervised Learning for Physical Interaction through Video Prediction. In Annual conference on neural information processing systems (NeurIPS).

  • Fragkiadaki, K., Levine, S., Felsen, P., & Malik, J. (2015). Recurrent network models for human dynamics. In IEEE international conference on computer vision (ICCV) (pp. 4346–4354).

  • Gatys, L., Ecker, A. S., & Bethge, M. (2015). Texture synthesis using convolutional neural networks. In Annual conference on neural information processing systems (NeurIPS) (pp. 262–270).

  • Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., et al. (2014). Generative adversarial nets. In Annual conference on neural information processing systems (NeurIPS) (pp. 2672–2680).

  • Gulrajani, I., Ahmed, F., Arjovsky, M., Dumoulin, V., & Courville, A. (2017). Improved training of Wasserstein GANs. In Annual conference on neural information processing systems (NeurIPS).

  • Huang, G., Liu, S., van der Maaten, L., & Weinberger, K. Q. (2018). Condensenet: An efficient densenet using learned group convolutions. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Huang, G., Liu, Z., van der Maaten, L., & Weinberger, K. Q. (2017). Densely connected convolutional networks. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Ionescu, C., Papava, D., Olaru, V., & Sminchisescu, C. (2014). Human3.6M: Large scale datasets and predictive methods for 3D human sensing in natural environments. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 36(7), 1325–1339.

    Article  Google Scholar 

  • Ji, S., Xu, W., Yang, M., & Yu, K. (2013). 3D convolutional neural networks for human action recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 35(1), 221–231.

    Article  Google Scholar 

  • Karras, T., Aila, T., Laine, S., & Lehtinen, J. (2018). Progressive growing of GANs for improved quality, stability, and variation. In International conference on learning representations (ICLR).

  • Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. In International conference on learning representations (ICLR).

  • Kingma, D. P., & Welling, M. (2014). Auto-encoding variational bayes. In International conference on learning representations (ICLR).

  • Laine, S., Karras, T., Aila, T., Herva, A., Saito, S., Yu, R., Li, H., & Lehtinen, J. (2017). Production-level facial performance capture using deep convolutional neural networks. In Proceedings of the ACM SIGGRAPH/Eurographics symposium on computer animation.

  • Li, Y., Fang, C., Yang, J., Wang, Z., Lu, X., & Yang, M. H. (2018). Flow-grounded spatial-temporal video prediction from still images. In European conference on computer vision (ECCV).

  • Liang, X., Lee, L., Dai, W., & Xing, E. P. (2017). Dual motion GAN for future-flow embedded video prediction. In IEEE international conference on computer vision (ICCV).

  • Liang, X., Zhang, H., Lin, L., & Xing, E. (2018). Generative semantic manipulation with mask-contrasting GAN. In European conference on computer vision (ECCV) (pp. 558–573).

  • Liu, Z., Yeh, R. A., Tang, X., Liu, Y., & Agarwala, A. (2017). Video frame synthesis using deep voxel flow. In IEEE international conference on computer vision (ICCV).

  • Lotter, W., Kreiman, G., & Cox, D. (2017). Deep predictive coding networks for video prediction and unsupervised learning. In International conference on learning representations (ICLR).

  • Lu, J., Issaranon, T., & Forsyth, D. (2017). SafetyNet: Detecting and rejecting adversarial examples robustly. In IEEE international conference on computer vision (ICCV).

  • Ma, L., Jia, X., Sun, Q., Schiele, B., Tuytelaars, T., & Van Gool, L. (2017). Pose guided person image generation. In Annual conference on neural information processing systems (NeurIPS) (pp. 405–415).

  • Mathieu, M., Couprie, C., & LeCun, Y. (2016). Deep multi-scale video prediction beyond mean square error. In International conference on learning representations (ICLR).

  • Mirza, M., & Osindero, S. (2014). Conditional generative adversarial nets. arXiv:1411.1784

  • Newell, A., Yang, K., & Deng, J. (2016). Stacked hourglass networks for human pose estimation. In European conference on computer vision (ECCV) (pp. 483–499).

  • Odena, A., Olah, C., & Shlens, J. (2017). Conditional image synthesis with auxiliary classifier GANs. In International conference on machine learning (ICML).

  • Olszewski, K., Li, Z., Yang, C., Zhou, Y., Yu, R., Huang, Z., Xiang, S., Saito, S., Kohli, P., & Li, H. (2017). Realistic dynamic facial textures from a single image using GANs. In IEEE international conference on computer vision (ICCV).

  • Pan, J., Wang, C., Jia, X., Shao, J., Sheng, L., Yan, J., & Wang, X. (2019). Video generation from single semantic label map. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3733–3742).

  • Paysan, P., Knothe, R., Amberg, B., Romdhani, S., & Vetter, T. (2009). A 3D face model for pose and illumination invariant face recognition. In IEEE international conference on advanced video and signal based surveillance (AVSS) for security, safety and monitoring in smart environments.

  • Peng, X., Feris, R. S., Wang, X., & Metaxas, D. N. (2016). A recurrent encoder-decoder network for sequential face alignment. In European conference on computer vision (ECCV) (pp. 38–56).

  • Peng, X., Huang, J., Hu, Q., Zhang, S., Elgammal, A., & Metaxas, D. (2015). From circle to 3-Sphere: Head pose estimation by instance parameterization. Computer Vision and Image Understanding (CVIU), 136, 92–102.

    Article  Google Scholar 

  • Peng, X., Tang, Z., Yang, F., Feris, R. S., & Metaxas, D. (2018). Jointly optimize data augmentation and network training: Adversarial data augmentation in human pose estimation. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 2226–2234).

  • Perarnau, G., van de Weijer, J., Raducanu, B., & Álvarez, J. M. (2016). Invertible conditional GANs for image editing. In NeurIPS workshop on adversarial training.

  • Pumarola, A., Agudo, A., Martinez, A., Sanfeliu, A., & Moreno-Noguer, F. (2018). GANimation: Anatomically-aware facial animation from a single image. In European conference on computer vision (ECCV).

  • Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., & Lee, H. (2016). Generative adversarial text-to-image synthesis. In International conference on machine learning (ICML).

  • Reed, S. E., Zhang, Y., Zhang, Y., & Lee, H. (2015). Deep visual analogy-making. In Annual conference on neural information processing systems (NeurIPS).

  • Rezende, D. J., Mohamed, S., & Wierstra, D. (2014). Stochastic backpropagation and approximate inference in deep generative models. In International conference on machine learning (ICML).

  • Ronneberger, O., Fischer, P., & Brox, T. (2015). U-Net: Convolutional networks for biomedical image segmentation. In Medical image computing and computer-assisted intervention (MICCAI).

  • Saito, M., Matsumoto, E., & Saito, S. (2017). Temporal generative adversarial nets with singular value clipping. In IEEE international conference on computer vision (ICCV).

  • Shen, W., & Liu, R. (2017). Learning residual images for face attribute manipulation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W. K., & Woo, W.-c. (2015). Convolutional LSTM network: A machine learning approach for precipitation nowcasting. In Annual conference on neural information processing systems (NeurIPS) (pp. 802–810).

  • Shrivastava, A., Pfister, T., Tuzel, O., Susskind, J., Wang, W., & Webb, R. (2017). Learning from simulated and unsupervised images through adversarial training. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Simonyan, K., & Zisserman, A. (2014). Two-stream convolutional networks for action recognition in videos. In Annual conference on neural information processing systems (NeurIPS) (pp. 568–576).

  • Simonyan, K., & Zisserman, A. (2015). Very deep convolutional networks for large-scale image recognition. In International conference on learning representations (ICLR).

  • Srivastava, N., Mansimov, E., & Salakhutdinov, R. (2015). Unsupervised learning of video representations using LSTMs. In International conference on machine learning (ICML).

  • Tang, Z., Peng, X., Geng, S., Wu, L., Zhang, S., & Metaxas, D. N. (2018a). Quantized densely connected U-nets for efficient landmark localization. In European conference on computer vision (ECCV).

  • Tang, Z., Peng, X., Geng, S., Zhu, Y., & Metaxas, D. (2018b). CU-Net: Coupled U-nets. In British machine vision conference (BMVC).

  • Thies, J., Zollhöfer, M., Stamminger, M., Theobalt, C., & Nießner, M. (2016). Face2Face: Real-time face capture and reenactment of RGB videos. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Tian, Y., Peng, X., Zhao, L., Zhang, S., & Metaxas, D. N. (2018). CR-GAN: Learning complete representations for multi-view generation. In International joint conference on artificial intelligence (IJCAI) (pp. 942–948).

  • Tian, Y., Zhao, L., Peng, X., & Metaxas, D. N. (2019). Rethinking kernel methods for node representation learning on graphs. In Annual conference on neural information processing systems (NeurIPS).

  • Tran, D., Bourdev, L., Fergus, R., Torresani, L., & Paluri, M. (2015). Learning spatiotemporal features with 3D convolutional networks. In IEEE international conference on computer vision (ICCV) (pp. 4489–4497).

  • Tulyakov, S., Liu, M. Y., Yang, X., & Kautz, J. (2018). MoCoGAN: Decomposing motion and content for video generation. In IEEE conference on computer vision and pattern recognition (CVPR).

  • van den Oord, A., Kalchbrenner, N., & Kavukcuoglu, K. (2016). Pixel recurrent neural networks. In International conference on machine learning (ICML).

  • Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., & Polosukhin, I. (2017). Attention is all you need. In Annual conference on neural information processing systems (NeurIPS).

  • Villegas, R., Yang, J., Hong, S., Lin, X., & Lee, H. (2017a). Decomposing motion and content for natural video sequence prediction. In International conference on learning representations (ICLR).

  • Villegas, R., Yang, J., Zou, Y., Sohn, S., Lin, X., & Lee, H. (2017b). Learning to generate long-term future via hierarchical prediction. In International conference on machine learning (ICML).

  • Vondrick, C., Pirsiavash, H., & Torralba, A. (2016). Generating videos with scene dynamics. In Annual conference on neural information processing systems (NeurIPS).

  • Wang, T. C., Liu, M. Y., Zhu, J. Y., Liu, G., Tao, A., Kautz, J., & Catanzaro, B. (2018). Video-to-video synthesis. In Annual conference on neural information processing systems (NeurIPS) (pp. 1144–1156).

  • Xiong, W., Luo, W., Ma, L., Liu, W., & Luo, J. (2018). Learning to generate time-lapse videos using multi-stage dynamic generative adversarial networks. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Yan, S., Li, Z., Xiong, Y., Yan, H., & Lin, D. (2019). Convolutional sequence generation for skeleton-based action synthesis. In IEEE international conference on computer vision (ICCV).

  • Yang, C., Wang, Z., Zhu, X., Huang, C., Shi, J., & Lin, D. (2018). Pose guided human video generation. In European conference on computer vision (ECCV).

  • Zhang, B., Wang, L., Wang, Z., Qiao, Y., & Wang, H. (2016). Real-time action recognition with enhanced motion vector CNNs. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhang, H., Sindagi, V., & Patel, V. M. (2017a). Image de-raining using a conditional generative adversarial network. arXiv:1701.05957.

  • Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. (2017b). StackGAN++: Realistic image synthesis with stacked generative adversarial networks. arXiv:1710.10916.

  • Zhang, H., Xu, T., Li, H., Zhang, S., Wang, X., Huang, X., & Metaxas, D. (2017c). StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks. In IEEE international conference on computer vision (ICCV).

  • Zhang, W., Zhu, M., & Derpanis, K. (2013a). From actemes to action: A strongly-supervised representation for detailed action understanding. In IEEE international conference on computer vision (ICCV).

  • Zhang, X., Yin, L., Cohn, J. F., Canavan, S., Reale, M., Horowitz, A., et al. (2013b). A high-resolution spontaneous 3D dynamic facial expression database. In IEEE international conference and workshops on automatic face and gesture recognition (FG) (pp. 1–6).

  • Zhang, Z., Xie, Y., & Yang, L. (2018a). Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhang, Z., Yang, L., & Zheng, Y. (2018b). Translating and segmenting multimodal medical volumes with cycle-and shape consistency generative adversarial network. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhao, L., Han, F., Peng, X., Zhang, X., Kapadia, M., Pavlovic, V., et al. (2019a). Cartoonish sketch-based face editing in videos using identity deformation transfer. Computers & Graphics, 79, 58–68.

    Article  Google Scholar 

  • Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. (2018). Learning to forecast and refine residual motion for image-to-video generation. In European conference on computer vision (ECCV) (pp. 387–403).

  • Zhao, L., Peng, X., Tian, Y., Kapadia, M., & Metaxas, D. N. (2019b). Semantic graph convolutional networks for 3D human pose regression. In IEEE conference on computer vision and pattern recognition (CVPR) (pp. 3425–3435).

  • Zhu, X., Lei, Z., Liu, X., Shi, H., & Li, S. (2016). Face alignment across large poses: A 3D solution. In IEEE conference on computer vision and pattern recognition (CVPR).

  • Zhu, X., Liu, X., Lei, Z., & Li, S. Z. (2019). Face alignment in full pose range: A 3D total solution. IEEE Transactions on Pattern Analysis and Machine Intelligence (TPAMI), 41(1), 78–92.

    Article  Google Scholar 

Download references

Acknowledgements

This work is partly supported by NSF 1763523, 1747778, 1733843 and 1703883 Awards. This work was also funded in part by grant BAAAFOSR-2013-0001 to Dimitris N. Metaxas. Mubbasir Kapadia has been funded in part by NSF IIS-1703883, NSF S&AS-1723869, and DARPA SocialSim-W911NF-17-C-0098.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Long Zhao.

Additional information

Communicated by Jun-Yan Zhu, Hongsheng Li, Eli Shechtman, Ming-Yu Liu, Jan Kautz, Antonio Torralba.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhao, L., Peng, X., Tian, Y. et al. Towards Image-to-Video Translation: A Structure-Aware Approach via Multi-stage Generative Adversarial Networks. Int J Comput Vis 128, 2514–2533 (2020). https://doi.org/10.1007/s11263-020-01328-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-020-01328-9

Keywords

Navigation