Modular StoryGAN with Background and Theme Awareness for Story Visualization

Szűcs, Gábor; Al-Shouha, Modafar

doi:10.1007/978-3-031-09037-0_23

Modular StoryGAN with Background and Theme Awareness for Story Visualization

Conference paper
First Online: 02 June 2022

1749 Accesses
1 Citations

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13363))

Abstract

Story visualization is a novel topic that intersects computer vision and natural language processing. In this task, given a series of natural language sentences that compose a story, a sequence of images should be generated that correspond to the sentences. Prior works have introduced recurrent generative models which outperform text-to-image models on this task; however, local and global consistency is a challenging attribute of these solutions. For the improvement, we proposed a new modular model architecture named Modular StoryGAN containing the best promising components of prior works. To measure the local and global consistency we introduced background and theme awareness, which are expected attributes of the solutions. Based on the human evaluation, the generated images demonstrate that Modular StoryGAN possesses background and theme awareness. Besides the subjective evaluation, the objective one also shows that our model outperforms the state-of-the-art CP-CSV and DuCo models.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 129.00; Price excludes VAT (USA)

Softcover Book: USD 169.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Babu, K.K., Dubey, S.R.: CDGAN: cyclic discriminative generative adversarial networks for image-to-image transformation. J. Vis. Commun. Image Represent. 82(2022), 103382 (2021)
Google Scholar
Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)
Google Scholar
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014
Google Scholar
Deng, K., Fei, T., Huang, X., Peng, Y.: IRC-GAN: introspective recurrent convolutional GAN for text-to-video generation. In: IJCAI Proceedings of the 28^th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2216–2222 (2019). https://doi.org/10.24963/ijcai.2019/307
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Google Scholar
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
Google Scholar
Hu, Y., Luo, C., Chen, Z.: Make it move: controllable image-to-video generation with text descriptions. arXiv preprint arXiv:2112.02815 (2021)
Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story qa by deep embedded memory networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2016–2022 (2017)
Google Scholar
Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T., Bansal, M.: MART: memory-augmented recurrent transformer for coherent video paragraph captioning. In: the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2603–2614 (2020)
Google Scholar
Li, C., Kong, L., Zhou, Z.: Improved-StoryGAN for sequential images visualization. J. Vis. Commun. Image Represent. 73, 102956 (2020)
Article Google Scholar
Li, Y., et al.: StoryGAN: a sequential conditional GAN for story visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6329–6338 (2019)
Google Scholar
Li, Y., Min, M.R., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: 32nd AAAI Conference on Artificial Intelligence, pp. 7065–7072. AAAI Press (2018)
Google Scholar
Maharana, A., Bansal, M.: Integrating visuospatial, linguistic, and commonsense structure into story visualization. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6772–6786 (2021)
Google Scholar
Maharana, A., Hannan, D., Bansal, M.: Improving generation and evaluation of visual stories via semantic consistency. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2427–2442 (2021)
Google Scholar
Marwah, T., Mittal, G., Balasubramanian, V.N.: Attentive semantic video generation using captions. In: IEEE International Conference on Computer Vision, pp. 1426–1434 (2017)
Google Scholar
Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Google Scholar
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016)
Google Scholar
Sharma, S., Asri, L.E., Schulz, H., Zumer, J.: Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. arXiv preprint arXiv:1706.09799 (2017)
Song, Y.-Z., Rui Tam, Z., Chen, H.-J., Lu, H.-H., Shuai, H.-H.: Character-preserving coherent story visualization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 18–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_2
Chapter Google Scholar
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MocoGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)
Google Scholar
Wen, Z., Xie, L., Feng, H., Tan, Y.: Robust fusion algorithm based on RBF neural network with TS fuzzy model and its application to infrared flame detection problem. Appl. Soft Comput. 76, 251–264 (2019)
Article Google Scholar
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Google Scholar
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4471–4480 (2019)
Google Scholar
Yu, Y., Tu, Z., Lu, L., Chen, X., Zhan, H., Sun, Z.: Text2Video: automatic video generation based on text scripts. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2753–2755 (2021)
Google Scholar
Zeng, G., Li, Z., Zhang, Y.: PororoGAN: an improved story visualization model on Pororo-SV dataset. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, pp. 155–159 (2019)
Google Scholar
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV), vol. 1, pp. 5908–5916 (2017). https://doi.org/10.1109/ICCV.2017.629
Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)
Google Scholar

Download references

Acknowledgment

The work was supported by the Ministry of Innovation and Technology through the MEC_R_21 program of the National Research, Development and Innovation Office.

Author information

Authors and Affiliations

Budapest University of Technology and Economics, Műegyetem rkp. 3, Budapest, 1111, Hungary
Gábor Szűcs & Modafar Al-Shouha

Authors

Gábor Szűcs
View author publications
You can also search for this author in PubMed Google Scholar
Modafar Al-Shouha
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Gábor Szűcs .

Editor information

Editors and Affiliations

Télécom SudParis, Palaiseau, France
Mounîm El Yacoubi
École de Technologie Supérieure, Montreal, QC, Canada
Eric Granger
Hong Kong Baptist University, Kowloon, Kowloon, Hong Kong
Pong Chi Yuen
Indian Statistical Institute, Kolkata, India
Umapada Pal
Université Paris Cité, Paris, France
Nicole Vincent

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Szűcs, G., Al-Shouha, M. (2022). Modular StoryGAN with Background and Theme Awareness for Story Visualization. In: El Yacoubi, M., Granger, E., Yuen, P.C., Pal, U., Vincent, N. (eds) Pattern Recognition and Artificial Intelligence. ICPRAI 2022. Lecture Notes in Computer Science, vol 13363. Springer, Cham. https://doi.org/10.1007/978-3-031-09037-0_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-09037-0_23
Published: 02 June 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09036-3
Online ISBN: 978-3-031-09037-0
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics