Abstract
Story visualization is a novel topic that intersects computer vision and natural language processing. In this task, given a series of natural language sentences that compose a story, a sequence of images should be generated that correspond to the sentences. Prior works have introduced recurrent generative models which outperform text-to-image models on this task; however, local and global consistency is a challenging attribute of these solutions. For the improvement, we proposed a new modular model architecture named Modular StoryGAN containing the best promising components of prior works. To measure the local and global consistency we introduced background and theme awareness, which are expected attributes of the solutions. Based on the human evaluation, the generated images demonstrate that Modular StoryGAN possesses background and theme awareness. Besides the subjective evaluation, the objective one also shows that our model outperforms the state-of-the-art CP-CSV and DuCo models.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Babu, K.K., Dubey, S.R.: CDGAN: cyclic discriminative generative adversarial networks for image-to-image transformation. J. Vis. Commun. Image Represent. 82(2022), 103382 (2021)
Bahdanau, D., Cho, K.H., Bengio, Y.: Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015 (2015)
Chung, J., Gulcehre, C., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. In: NIPS 2014 Workshop on Deep Learning, December 2014
Deng, K., Fei, T., Huang, X., Peng, Y.: IRC-GAN: introspective recurrent convolutional GAN for text-to-video generation. In: IJCAI Proceedings of the 28th International Joint Conference on Artificial Intelligence (IJCAI), pp. 2216–2222 (2019). https://doi.org/10.24963/ijcai.2019/307
Goodfellow, I., et al.: Generative adversarial nets. In: Advances in Neural Information Processing Systems, vol. 27 (2014)
Heusel, M., Ramsauer, H., Unterthiner, T., Nessler, B., Hochreiter, S.: GANs trained by a two time-scale update rule converge to a local Nash equilibrium. In: Advances in Neural Information Processing Systems, pp. 6626–6637 (2017)
Hu, Y., Luo, C., Chen, Z.: Make it move: controllable image-to-video generation with text descriptions. arXiv preprint arXiv:2112.02815 (2021)
Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story qa by deep embedded memory networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2016–2022 (2017)
Lei, J., Wang, L., Shen, Y., Yu, D., Berg, T., Bansal, M.: MART: memory-augmented recurrent transformer for coherent video paragraph captioning. In: the 58th Annual Meeting of the Association for Computational Linguistics, pp. 2603–2614 (2020)
Li, C., Kong, L., Zhou, Z.: Improved-StoryGAN for sequential images visualization. J. Vis. Commun. Image Represent. 73, 102956 (2020)
Li, Y., et al.: StoryGAN: a sequential conditional GAN for story visualization. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6329–6338 (2019)
Li, Y., Min, M.R., Shen, D., Carlson, D., Carin, L.: Video generation from text. In: 32nd AAAI Conference on Artificial Intelligence, pp. 7065–7072. AAAI Press (2018)
Maharana, A., Bansal, M.: Integrating visuospatial, linguistic, and commonsense structure into story visualization. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6772–6786 (2021)
Maharana, A., Hannan, D., Bansal, M.: Improving generation and evaluation of visual stories via semantic consistency. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2427–2442 (2021)
Marwah, T., Mittal, G., Balasubramanian, V.N.: Attentive semantic video generation using captions. In: IEEE International Conference on Computer Vision, pp. 1426–1434 (2017)
Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)
Reed, S., Akata, Z., Yan, X., Logeswaran, L., Schiele, B., Lee, H.: Generative adversarial text to image synthesis. In: International Conference on Machine Learning, pp. 1060–1069. PMLR (2016)
Sharma, S., Asri, L.E., Schulz, H., Zumer, J.: Relevance of unsupervised metrics in task-oriented dialogue for evaluating natural language generation. arXiv preprint arXiv:1706.09799 (2017)
Song, Y.-Z., Rui Tam, Z., Chen, H.-J., Lu, H.-H., Shuai, H.-H.: Character-preserving coherent story visualization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 18–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_2
Tulyakov, S., Liu, M.Y., Yang, X., Kautz, J.: MocoGAN: decomposing motion and content for video generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1526–1535 (2018)
Wen, Z., Xie, L., Feng, H., Tan, Y.: Robust fusion algorithm based on RBF neural network with TS fuzzy model and its application to infrared flame detection problem. Appl. Soft Comput. 76, 251–264 (2019)
Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)
Yu, F., Koltun, V.: Multi-scale context aggregation by dilated convolutions. arXiv preprint arXiv:1511.07122 (2015)
Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Free-form image inpainting with gated convolution. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 4471–4480 (2019)
Yu, Y., Tu, Z., Lu, L., Chen, X., Zhan, H., Sun, Z.: Text2Video: automatic video generation based on text scripts. In: Proceedings of the 29th ACM International Conference on Multimedia, pp. 2753–2755 (2021)
Zeng, G., Li, Z., Zhang, Y.: PororoGAN: an improved story visualization model on Pororo-SV dataset. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, pp. 155–159 (2019)
Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: IEEE International Conference on Computer Vision (ICCV), vol. 1, pp. 5908–5916 (2017). https://doi.org/10.1109/ICCV.2017.629
Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)
Acknowledgment
The work was supported by the Ministry of Innovation and Technology through the MEC_R_21 program of the National Research, Development and Innovation Office.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 Springer Nature Switzerland AG
About this paper
Cite this paper
Szűcs, G., Al-Shouha, M. (2022). Modular StoryGAN with Background and Theme Awareness for Story Visualization. In: El Yacoubi, M., Granger, E., Yuen, P.C., Pal, U., Vincent, N. (eds) Pattern Recognition and Artificial Intelligence. ICPRAI 2022. Lecture Notes in Computer Science, vol 13363. Springer, Cham. https://doi.org/10.1007/978-3-031-09037-0_23
Download citation
DOI: https://doi.org/10.1007/978-3-031-09037-0_23
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-09036-3
Online ISBN: 978-3-031-09037-0
eBook Packages: Computer ScienceComputer Science (R0)