Skip to main content

Long-Text-to-Video-GAN

  • Conference paper
  • First Online:
Computer, Communication, and Signal Processing (ICCCSP 2022)

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 651))

  • 258 Accesses

Abstract

While there have been several works regarding the task of video generation from short text [1,2,3], which tend to focus more on the continuity of the generated images or frames, there has been very little attention drawn towards the task of story visualization [4], which attempts to generate dynamic scenes and characters described in a large amount of detail in a multi-para input text. We therefore propose our own novel take on this task which attempts to compile these dynamic scenes into a larger video, while also improving the scores of the current state of the art models in story visualization and video generation respectively. We intend to do this by making use of semantic disentangling connections [5] in between our generators in order to maintain global consistency between consecutive images, as well as ensuring similarity between the video re-description and the input text, thus leading to a higher image quality. Once these images are generated, we make use of a depth-aware video interpolation framework [6] in order to generate the remaining non-existing frames of the video in between the generated images. We then evaluate our model on the CLEVR-SV and Pororo-SV datasets for the story visualization task, and the UCF-101 dataset to measure the accuracy of the video generated. This way, we intend to outperform existing state-of-the-art models significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 99.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 129.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info
Hardcover Book
USD 129.99
Price excludes VAT (USA)
  • Durable hardcover edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Kim, D., Joo, D., Kim, J.: TiVGAN: text to image to video generation with step-by-step evolutionary generator. IEEE Access 8, 153113–153122 (2020)

    Article  Google Scholar 

  2. Li, Y., Min, M.R., Shen, D., Carlson, D.E., Carin, L.: Video generation from text. In: AAAI, vol. 2, p. 5 (2018)

    Google Scholar 

  3. Yu, H., Huang, Y., Pi, L., Wang, L.: Recurrent deconvolutional generative adversarial networks with application to text guided video generation. arXiv preprint arXiv:2008.05856 (2020)

  4. Li, Y., et al.: StoryGAN: a sequential conditional GAN for story visualization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern

    Google Scholar 

  5. Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2327–2336 (2019)

    Google Scholar 

  6. Bao, W., Lai, W.-S., Ma, C., Zhang, X., Gao, Z., Yang, M.-H.: Depth-aware video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3703–3712 (2019)

    Google Scholar 

  7. Goodfellow, I.: NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)

  8. Pu, Y., et al.: Variational autoencoder for deep learning of images, labels and captions. In: Advances in Neural Information Processing Systems, pp. 2352–2360 (2016)

    Google Scholar 

  9. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)

    Google Scholar 

  10. Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)

    Google Scholar 

  11. Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)

    Google Scholar 

  12. Sharma, S., Suhubdy, D., Michalski, V., Kahou, S.E., Bengio, Y.: ChatPainter: improving text to image generation using dialogue. arXiv preprint arXiv:1802.08216 (2018)

  13. Hao, Z., Huang, X., Belongie, S.: Controllable video generation with sparse trajectories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7854–7863 (2018)

    Google Scholar 

  14. Wang, W., Alameda-Pineda, X., Xu, D., Fua, P., Ricci, E., Sebe, N.: Every smile is unique: landmark-guided diverse smile generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7083–7092 (2018)

    Google Scholar 

  15. Rebuffi, S.-A., Bilen, H., Vedaldi, A.: Efficient parametrization of multi-domain deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8119–8127 (2018)

    Google Scholar 

  16. Niklaus, S., Liu, F.: Context-aware synthesis for video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1710 (2018)

    Google Scholar 

  17. Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: edge-preserving interpolation of correspondences for optical flow. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1164–1172 (2015)

    Google Scholar 

  18. Iashin, V., Rahtu, E.: Multi-modal dense video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 958–959 (2020)

    Google Scholar 

Download references

Acknowledgment

This work is supported by the Singapore Ministry of Education Academic Research grant T1 251RES1812, “Dynamic Hybrid Real-time Rendering with Hardware Accelerated Ray-tracing and Rasterization for Interactive Applications”. Special thanks to the ‘National Supercomputing Centre (NSCC) Singapore’, for providing the computational resources required for training our architecture.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ayman Talkani .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 IFIP International Federation for Information Processing

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Talkani, A., Bhojan, A. (2022). Long-Text-to-Video-GAN. In: Neuhold, E.J., Fernando, X., Lu, J., Piramuthu, S., Chandrabose, A. (eds) Computer, Communication, and Signal Processing. ICCCSP 2022. IFIP Advances in Information and Communication Technology, vol 651. Springer, Cham. https://doi.org/10.1007/978-3-031-11633-9_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-11633-9_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-11632-2

  • Online ISBN: 978-3-031-11633-9

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics