Long-Text-to-Video-GAN

Talkani, Ayman; Bhojan, Anand

doi:10.1007/978-3-031-11633-9_8

Ayman Talkani²⁰ &
Anand Bhojan²⁰

Part of the book series: IFIP Advances in Information and Communication Technology ((IFIPAICT,volume 651))

Included in the following conference series:

International Conference on Computer, Communication, and Signal Processing

258 Accesses

Abstract

While there have been several works regarding the task of video generation from short text [1,2,3], which tend to focus more on the continuity of the generated images or frames, there has been very little attention drawn towards the task of story visualization [4], which attempts to generate dynamic scenes and characters described in a large amount of detail in a multi-para input text. We therefore propose our own novel take on this task which attempts to compile these dynamic scenes into a larger video, while also improving the scores of the current state of the art models in story visualization and video generation respectively. We intend to do this by making use of semantic disentangling connections [5] in between our generators in order to maintain global consistency between consecutive images, as well as ensuring similarity between the video re-description and the input text, thus leading to a higher image quality. Once these images are generated, we make use of a depth-aware video interpolation framework [6] in order to generate the remaining non-existing frames of the video in between the generated images. We then evaluate our model on the CLEVR-SV and Pororo-SV datasets for the story visualization task, and the UCF-101 dataset to measure the accuracy of the video generated. This way, we intend to outperform existing state-of-the-art models significantly.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Kim, D., Joo, D., Kim, J.: TiVGAN: text to image to video generation with step-by-step evolutionary generator. IEEE Access 8, 153113–153122 (2020)
Article Google Scholar
Li, Y., Min, M.R., Shen, D., Carlson, D.E., Carin, L.: Video generation from text. In: AAAI, vol. 2, p. 5 (2018)
Google Scholar
Yu, H., Huang, Y., Pi, L., Wang, L.: Recurrent deconvolutional generative adversarial networks with application to text guided video generation. arXiv preprint arXiv:2008.05856 (2020)
Li, Y., et al.: StoryGAN: a sequential conditional GAN for story visualization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Google Scholar
Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2327–2336 (2019)
Google Scholar
Bao, W., Lai, W.-S., Ma, C., Zhang, X., Gao, Z., Yang, M.-H.: Depth-aware video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3703–3712 (2019)
Google Scholar
Goodfellow, I.: NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
Pu, Y., et al.: Variational autoencoder for deep learning of images, labels and captions. In: Advances in Neural Information Processing Systems, pp. 2352–2360 (2016)
Google Scholar
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Google Scholar
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
Google Scholar
Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Google Scholar
Sharma, S., Suhubdy, D., Michalski, V., Kahou, S.E., Bengio, Y.: ChatPainter: improving text to image generation using dialogue. arXiv preprint arXiv:1802.08216 (2018)
Hao, Z., Huang, X., Belongie, S.: Controllable video generation with sparse trajectories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7854–7863 (2018)
Google Scholar
Wang, W., Alameda-Pineda, X., Xu, D., Fua, P., Ricci, E., Sebe, N.: Every smile is unique: landmark-guided diverse smile generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7083–7092 (2018)
Google Scholar
Rebuffi, S.-A., Bilen, H., Vedaldi, A.: Efficient parametrization of multi-domain deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8119–8127 (2018)
Google Scholar
Niklaus, S., Liu, F.: Context-aware synthesis for video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1710 (2018)
Google Scholar
Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: edge-preserving interpolation of correspondences for optical flow. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1164–1172 (2015)
Google Scholar
Iashin, V., Rahtu, E.: Multi-modal dense video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 958–959 (2020)
Google Scholar

Download references

Acknowledgment

This work is supported by the Singapore Ministry of Education Academic Research grant T1 251RES1812, “Dynamic Hybrid Real-time Rendering with Hardware Accelerated Ray-tracing and Rasterization for Interactive Applications”. Special thanks to the ‘National Supercomputing Centre (NSCC) Singapore’, for providing the computational resources required for training our architecture.

Author information

Authors and Affiliations

School of Computing, National University of Singapore, COM1, 13, Computing Dr, Singapore, 117417, Singapore
Ayman Talkani & Anand Bhojan

Authors

Ayman Talkani
View author publications
You can also search for this author in PubMed Google Scholar
Anand Bhojan
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ayman Talkani .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Erich J. Neuhold
Ryerson University, Toronto, ON, Canada
Xavier Fernando
University of Huddersfield, Huddersfield, UK
Joan Lu
University of Florida, Gainesville, FL, USA
Selwyn Piramuthu
SSN College of Engineering, Kalavakkam, India
Aravindan Chandrabose

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Talkani, A., Bhojan, A. (2022). Long-Text-to-Video-GAN. In: Neuhold, E.J., Fernando, X., Lu, J., Piramuthu, S., Chandrabose, A. (eds) Computer, Communication, and Signal Processing. ICCCSP 2022. IFIP Advances in Information and Communication Technology, vol 651. Springer, Cham. https://doi.org/10.1007/978-3-031-11633-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-031-11633-9_8
Published: 22 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11632-2
Online ISBN: 978-3-031-11633-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Societies and partnerships

The International Federation for Information Processing (opens in a new tab)