Abstract
While there have been several works regarding the task of video generation from short text [1,2,3], which tend to focus more on the continuity of the generated images or frames, there has been very little attention drawn towards the task of story visualization [4], which attempts to generate dynamic scenes and characters described in a large amount of detail in a multi-para input text. We therefore propose our own novel take on this task which attempts to compile these dynamic scenes into a larger video, while also improving the scores of the current state of the art models in story visualization and video generation respectively. We intend to do this by making use of semantic disentangling connections [5] in between our generators in order to maintain global consistency between consecutive images, as well as ensuring similarity between the video re-description and the input text, thus leading to a higher image quality. Once these images are generated, we make use of a depth-aware video interpolation framework [6] in order to generate the remaining non-existing frames of the video in between the generated images. We then evaluate our model on the CLEVR-SV and Pororo-SV datasets for the story visualization task, and the UCF-101 dataset to measure the accuracy of the video generated. This way, we intend to outperform existing state-of-the-art models significantly.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Kim, D., Joo, D., Kim, J.: TiVGAN: text to image to video generation with step-by-step evolutionary generator. IEEE Access 8, 153113–153122 (2020)
Li, Y., Min, M.R., Shen, D., Carlson, D.E., Carin, L.: Video generation from text. In: AAAI, vol. 2, p. 5 (2018)
Yu, H., Huang, Y., Pi, L., Wang, L.: Recurrent deconvolutional generative adversarial networks with application to text guided video generation. arXiv preprint arXiv:2008.05856 (2020)
Li, Y., et al.: StoryGAN: a sequential conditional GAN for story visualization. In: Proceedings of the IEEE Conference on Computer Vision and Pattern
Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2327–2336 (2019)
Bao, W., Lai, W.-S., Ma, C., Zhang, X., Gao, Z., Yang, M.-H.: Depth-aware video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3703–3712 (2019)
Goodfellow, I.: NIPS 2016 tutorial: Generative adversarial networks. arXiv preprint arXiv:1701.00160 (2016)
Pu, Y., et al.: Variational autoencoder for deep learning of images, labels and captions. In: Advances in Neural Information Processing Systems, pp. 2352–2360 (2016)
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 5907–5915 (2017)
Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Sharma, S., Suhubdy, D., Michalski, V., Kahou, S.E., Bengio, Y.: ChatPainter: improving text to image generation using dialogue. arXiv preprint arXiv:1802.08216 (2018)
Hao, Z., Huang, X., Belongie, S.: Controllable video generation with sparse trajectories. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7854–7863 (2018)
Wang, W., Alameda-Pineda, X., Xu, D., Fua, P., Ricci, E., Sebe, N.: Every smile is unique: landmark-guided diverse smile generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7083–7092 (2018)
Rebuffi, S.-A., Bilen, H., Vedaldi, A.: Efficient parametrization of multi-domain deep neural networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 8119–8127 (2018)
Niklaus, S., Liu, F.: Context-aware synthesis for video frame interpolation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1701–1710 (2018)
Revaud, J., Weinzaepfel, P., Harchaoui, Z., Schmid, C.: EpicFlow: edge-preserving interpolation of correspondences for optical flow. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1164–1172 (2015)
Iashin, V., Rahtu, E.: Multi-modal dense video captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, pp. 958–959 (2020)
Acknowledgment
This work is supported by the Singapore Ministry of Education Academic Research grant T1 251RES1812, “Dynamic Hybrid Real-time Rendering with Hardware Accelerated Ray-tracing and Rasterization for Interactive Applications”. Special thanks to the ‘National Supercomputing Centre (NSCC) Singapore’, for providing the computational resources required for training our architecture.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 IFIP International Federation for Information Processing
About this paper
Cite this paper
Talkani, A., Bhojan, A. (2022). Long-Text-to-Video-GAN. In: Neuhold, E.J., Fernando, X., Lu, J., Piramuthu, S., Chandrabose, A. (eds) Computer, Communication, and Signal Processing. ICCCSP 2022. IFIP Advances in Information and Communication Technology, vol 651. Springer, Cham. https://doi.org/10.1007/978-3-031-11633-9_8
Download citation
DOI: https://doi.org/10.1007/978-3-031-11633-9_8
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11632-2
Online ISBN: 978-3-031-11633-9
eBook Packages: Computer ScienceComputer Science (R0)