Skip to main content

Modular StoryGAN with Background and Theme Awareness for Story Visualization

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13363))

Abstract

Story visualization is a novel topic that intersects computer vision and natural language processing. In this task, given a series of natural language sentences that compose a story, a sequence of images should be generated that correspond to the sentences. Prior works have introduced recurrent generative models which outperform text-to-image models on this task; however, local and global consistency is a challenging attribute of these solutions. For the improvement, we proposed a new modular model architecture named Modular StoryGAN containing the best promising components of prior works. To measure the local and global consistency we introduced background and theme awareness, which are expected attributes of the solutions. Based on the human evaluation, the generated images demonstrate that Modular StoryGAN possesses background and theme awareness. Besides the subjective evaluation, the objective one also shows that our model outperforms the state-of-the-art CP-CSV and DuCo models.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   129.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   169.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Babu, K.K., Dubey, S.R.: CDGAN: cyclic discriminative generative adversarial networks for image-to-image transformation. J. Vis. Commun. Image Represent. 82(2022), 103382 (2021)

    Google Scholar 

  2. Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)

    Google Scholar 

  3. Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014

    Google Scholar 

  4. Deng, K., Fei, T., Huang, X., Peng, Y.: IRC-GAN: introspective recurrent convolutional GAN for text-to-video generation. In: IJCAI Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2216–2222 (2019). https://doi.org/10.24963/ijcai.2019/307

  5. Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)

    Google Scholar 

  6. Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)

    Google Scholar 

  7. Hu, Y., Luo, C., Chen, Z.: Make it move: controllable image-to-video generation with text descriptions. arXiv preprint arXiv:2112.02815 (2021)

  8. Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story qa by deep embedded memory networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2016–2022 (2017)

    Google Scholar 

  9. Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T., Bansal, M.: MART: memory-augmented recurrent transformer for coherent video paragraph captioning. In: the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2603–2614 (2020)

    Google Scholar 

  10. Li, C., Kong, L., Zhou, Z.: Improved-StoryGAN for sequential images visualization. J. Vis. Commun. Image Represent. 73, 102956 (2020)

    Article  Google Scholar 

  11. Li, Y., et al.: StoryGAN: a sequential conditional GAN for story visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6329–6338 (2019)

    Google Scholar 

  12. Li, Y., Min, M.R., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: 32nd AAAI Conference on Artificial Intelligence, pp. 7065–7072. AAAI Press (2018)

    Google Scholar 

  13. Maharana, A., Bansal, M.: Integrating visuospatial, linguistic, and commonsense structure into story visualization. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6772–6786 (2021)

    Google Scholar 

  14. Maharana, A., Hannan, D., Bansal, M.: Improving generation and evaluation of visual stories via semantic consistency. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2427–2442 (2021)

    Google Scholar 

  15. Marwah, T., Mittal, G., Balasubramanian, V.N.: Attentive semantic video generation using captions. In: IEEE International Conference on Computer Vision, pp. 1426–1434 (2017)

    Google Scholar 

  16. Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)

    Google Scholar 

  17. Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016)

    Google Scholar 

  18. Sharma, S., Asri, L.E., Schulz, H., Zumer, J.: Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. arXiv preprint arXiv:1706.09799 (2017)

  19. Song, Y.-Z., Rui Tam, Z., Chen, H.-J., Lu, H.-H., Shuai, H.-H.: Character-preserving coherent story visualization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 18–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_2

    Chapter  Google Scholar 

  20. Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MocoGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)

    Google Scholar 

  21. Wen, Z., Xie, L., Feng, H., Tan, Y.: Robust fusion algorithm based on RBF neural network with TS fuzzy model and its application to infrared flame detection problem. Appl. Soft Comput. 76, 251–264 (2019)

    Article  Google Scholar 

  22. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)

    Google Scholar 

  23. Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)

  24. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4471–4480 (2019)

    Google Scholar 

  25. Yu, Y., Tu, Z., Lu, L., Chen, X., Zhan, H., Sun, Z.: Text2Video: automatic video generation based on text scripts. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2753–2755 (2021)

    Google Scholar 

  26. Zeng, G., Li, Z., Zhang, Y.: PororoGAN: an improved story visualization model on Pororo-SV dataset. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, pp. 155–159 (2019)

    Google Scholar 

  27. Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV), vol. 1, pp. 5908–5916 (2017). https://doi.org/10.1109/ICCV.2017.629

  28. Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)

    Google Scholar 

Download references

Acknowledgment

The work was supported by the Ministry of Innovation and Technology through the MEC_R_21 program of the National Research, Development and Innovation Office.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Gábor Szűcs .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Szűcs, G., Al-Shouha, M. (2022). Modular StoryGAN with Background and Theme Awareness for Story Visualization. In: El Yacoubi, M., Granger, E., Yuen, P.C., Pal, U., Vincent, N. (eds) Pattern Recognition and Artificial Intelligence. ICPRAI 2022. Lecture Notes in Computer Science, vol 13363. Springer, Cham. https://doi.org/10.1007/978-3-031-09037-0_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-09037-0_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-09036-3

  • Online ISBN: 978-3-031-09037-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics