Skip to main content

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

  • Conference paper
  • First Online:
Computer Vision – ECCV 2022 (ECCV 2022)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13697))

Included in the following conference series:

  • 3490 Accesses

Abstract

Recent advances in text-to-image synthesis have led to large pretrained transformers with excellent capabilities to generate visualizations from a given text. However, these models are ill-suited for specialized tasks like story visualization, which requires an agent to produce a sequence of images given a corresponding sequence of captions, forming a narrative. Moreover, we find that the story visualization task fails to accommodate generalization to unseen plots and characters in new narratives. Hence, we first propose the task of story continuation, where the generated visual story is conditioned on a source image, allowing for better generalization to narratives with new characters. Then, we enhance or ‘retro-fit’ the pretrained text-to-image synthesis models with task-specific modules for (a) sequential image generation and (b) copying relevant elements from an initial frame. We explore full-model finetuning, as well as prompt-based tuning for parameter-efficient adaptation, of the pretrained model. We evaluate our approach StoryDALL-E on two existing datasets, PororoSV and FlintstonesSV, and introduce a new dataset DiDeMoSV collected from a video-captioning dataset. We also develop a model StoryGANc based on Generative Adversarial Networks (GAN) for story continuation, and compare with the StoryDALL-E model to demonstrate the advantages of our approach. We show that our retro-fitting approach outperforms GAN-based models for story continuation. We also demonstrate that the ‘retro-fitting’ approach facilitates copying of visual elements from the source image and improved continuity in visual frames. Finally, our analysis suggests that pretrained transformers struggle with comprehending narratives containing multiple characters, and translating them into appropriate imagery. Our work encourages future research into story continuation and large-scale models for the task (Code and data are available at https://github.com/adymaharana/storydalle).

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/kakaobrain/minDALL-E.

References

  1. Borgeaud, S., et al.: Improving language models by retrieving from trillions of tokens. arXiv preprint arXiv:2112.04426 (2021)

  2. Changpinyo, S., Sharma, P., Ding, N., Soricut, R.: Conceptual 12M: pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558–3568 (2021)

    Google Scholar 

  3. Cho, J., Zala, A., Bansal, M.: DALL-Eval: probing the reasoning skills and social biases of text-to-image generative transformers. arXiv preprint arXiv:2202.04053 (2022)

  4. Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 12873–12883 (2021)

    Google Scholar 

  5. Frans, K., Soros, L., Witkowski, O.: CLIPDraw: exploring text-to-drawing synthesis through language-image encoders. arXiv preprint arXiv:2106.14843 (2021)

  6. Goodfellow, I.J., et al.: Generative adversarial nets. In: NeurIPS (2014)

    Google Scholar 

  7. Guo, D., Rush, A.M., Kim, Y.: Parameter-efficient transfer learning with diff pruning. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4884–4896 (2021)

    Google Scholar 

  8. Gupta, T., Schwenk, D., Farhadi, A., Hoiem, D., Kembhavi, A.: Imagine this! scripts to compositions to videos. In: Ferrari, V., Hebert, M., Sminchisescu, C., Weiss, Y. (eds.) ECCV 2018. LNCS, vol. 11212, pp. 610–626. Springer, Cham (2018). https://doi.org/10.1007/978-3-030-01237-3_37

    Chapter  Google Scholar 

  9. He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., Neubig, G.: Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 (2021)

  10. Henderson, J., Ruder, S., et al.: Compacter: efficient low-rank hypercomplex adapter layers. In: Advances in Neural Information Processing Systems (2021)

    Google Scholar 

  11. Hendricks, L.A., Wang, O., Shechtman, E., Sivic, J., Darrell, T., Russell, B.: Localizing moments in video with natural language. In: Proceedings of the IEEE International Conference on Computer Vision (ICCV) (2017)

    Google Scholar 

  12. Hinz, T., Heinrich, S., Wermter, S.: Semantic object accuracy for generative text-to-image synthesis. IEEE Trans. Pattern Analy. Mach. Intell. (2020)

    Google Scholar 

  13. Houlsby, N., et al.: Parameter-efficient transfer learning for NLP. In: International Conference on Machine Learning, pp. 2790–2799. PMLR (2019)

    Google Scholar 

  14. Hu, E.J., et al.: LoRA: low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685 (2021)

  15. Isola, P., Zhu, J.Y., Zhou, T., Efros, A.A.: Image-to-image translation with conditional adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1125–1134 (2017)

    Google Scholar 

  16. Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: CLEVR: a diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2901–2910 (2017)

    Google Scholar 

  17. Kang, M., Park, J.: ContraGAN: contrastive learning for conditional image generation. In: NeurIPS (2020)

    Google Scholar 

  18. Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4401–4410 (2019)

    Google Scholar 

  19. Kim, K.M., Heo, M.O., Choi, S.H., Zhang, B.T.: DeepStory: video story QA by deep embedded memory networks. In: Proceedings of the 26th International Joint Conference on Artificial Intelligence, pp. 2016–2022 (2017)

    Google Scholar 

  20. Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 3045–3059 (2021)

    Google Scholar 

  21. Li, C., Kong, L., Zhou, Z.: Improved-storyGAN for sequential images visualization. J. Vis. Commun. Image Represent. 73, 102956 (2020). https://doi.org/10.1016/j.jvcir.2020.102956. http://www.sciencedirect.com/science/article/pii/S1047320320301826

  22. Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 4582–4597 (2021)

    Google Scholar 

  23. Li, Y., et al.: StoryGAN: a sequential conditional GAN for story visualization. In: Proceedings of the IEEE Conference on CVPR, pp. 6329–6338 (2019)

    Google Scholar 

  24. Liang, J., Pei, W., Lu, F.: CPGAN: full-spectrum content-parsing generative adversarial networks for text-to-image synthesis. arXiv preprint arXiv:1912.08562 (2019)

  25. Liu, Y., et al.: RoBERTa: a robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692 (2019)

  26. Mahabadi, R.K., Ruder, S., Dehghani, M., Henderson, J.: Parameter-efficient multi-task fine-tuning for transformers via shared hypernetworks. In: Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp. 565–576 (2021)

    Google Scholar 

  27. Maharana, A., Bansal, M.: Integrating visuospatial, linguistic, and commonsense structure into story visualization. In: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp. 6772–6786 (2021)

    Google Scholar 

  28. Maharana, A., Hannan, D., Bansal, M.: Improving generation and evaluation of visual stories via semantic consistency. In: Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2427–2442 (2021)

    Google Scholar 

  29. Mao, Y., et al.: UNIPELT: a unified framework for parameter-efficient language model tuning. arXiv preprint arXiv:2110.07577 (2021)

  30. Van den Oord, A., Kalchbrenner, N., Espeholt, L., Vinyals, O., Graves, A., et al.: Conditional image generation with pixeLCNN decoders. In: Advances in Neural Information Processing Systems 29 (2016)

    Google Scholar 

  31. Qiao, T., Zhang, J., Xu, D., Tao, D.: MirrorGAN: learning text-to-image generation by redescription. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 1505–1514 (2019)

    Google Scholar 

  32. Radford, A., et al.: Learning transferable visual models from natural language supervision. In: International Conference on Machine Learning, pp. 8748–8763. PMLR (2021)

    Google Scholar 

  33. Ramesh, A., et al.: Zero-shot text-to-image generation. In: International Conference on Machine Learning, pp. 8821–8831. PMLR (2021)

    Google Scholar 

  34. Sanh, V., Debut, L., Chaumond, J., Wolf, T.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. In: 5th Workshop on Energy Efficient Machine Learning and Cognitive Computing (NeurIPS) (2019)

    Google Scholar 

  35. Sharma, P., Ding, N., Goodman, S., Soricut, R.: Conceptual captions: a cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556–2565 (2018)

    Google Scholar 

  36. Song, Y.-Z., Rui Tam, Z., Chen, H.-J., Lu, H.-H., Shuai, H.-H.: Character-preserving coherent story visualization. In: Vedaldi, A., Bischof, H., Brox, T., Frahm, J.-M. (eds.) ECCV 2020. LNCS, vol. 12362, pp. 18–33. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-58520-4_2

    Chapter  Google Scholar 

  37. Sung, Y.L., Cho, J., Bansal, M.: Vl-adapter: parameter-efficient transfer learning for vision-and-language tasks. arXiv preprint arXiv:2112.06825 (2021)

  38. Szűcs, G., Al-Shouha, M.: Modular StoryGAN with background and theme awareness for story visualization. In: El Yacoubi, M., Granger, E., Yuen, P.C., Pal, U., Vincent, N. (eds.) ICPRAI 2022. LNCS, vol. 13363, pp. 275–286. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-09037-0_23

  39. Thomee, B., et al.: YFCC100M: the new data in multimedia research. Commun. ACM 59(2), 64–73 (2016)

    Article  Google Scholar 

  40. Xu, T., et al.: AttnGAN: fine-grained text to image generation with attentional generative adversarial networks. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1316–1324 (2018)

    Google Scholar 

  41. Yan, W., Zhang, Y., Abbeel, P., Srinivas, A.: VideoGPT: video generation using VQ-VAE and transformers. arXiv preprint arXiv:2104.10157 (2021)

  42. Yin, G., Liu, B., Sheng, L., Yu, N., Wang, X., Shao, J.: Semantics disentangling for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 2327–2336 (2019)

    Google Scholar 

  43. Yu, J., Lin, Z., Yang, J., Shen, X., Lu, X., Huang, T.S.: Generative image inpainting with contextual attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5505–5514 (2018)

    Google Scholar 

  44. Zaken, E.B., Ravfogel, S., Goldberg, Y.: BitFit: simple parameter-efficient fine-tuning for transformer-based masked language-models. arXiv preprint arXiv:2106.10199 (2021)

  45. Zeng, G., Li, Z., Zhang, Y.: PororoGAN: an improved story visualization model on Pororo-SV dataset. In: Proceedings of the 2019 3rd International Conference on Computer Science and Artificial Intelligence, pp. 155–159 (2019)

    Google Scholar 

  46. Zhang, H., Koh, J.Y., Baldridge, J., Lee, H., Yang, Y.: Cross-modal contrastive learning for text-to-image generation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 833–842 (2021)

    Google Scholar 

  47. Zhang, H., et al.: StackGAN: text to photo-realistic image synthesis with stacked generative adversarial networks. In: ICCV (2017)

    Google Scholar 

  48. Zhu, M., Pan, P., Chen, W., Yang, Y.: DM-GAN: dynamic memory generative adversarial networks for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5802–5810 (2019)

    Google Scholar 

Download references

Acknowledgement

We thank the reviewers for their useful feedback. This work was supported by ARO Award W911NF2110220, DARPA KAIROS Grant FA8750-19-2-1004, NSF-AI Engage Institute DRL-211263. The views, opinions, and/or findings contained in this article are those of the authors, not the funding agency.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Adyasha Maharana .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 7505 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Maharana, A., Hannan, D., Bansal, M. (2022). StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation. In: Avidan, S., Brostow, G., Cissé, M., Farinella, G.M., Hassner, T. (eds) Computer Vision – ECCV 2022. ECCV 2022. Lecture Notes in Computer Science, vol 13697. Springer, Cham. https://doi.org/10.1007/978-3-031-19836-6_5

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-19836-6_5

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-19835-9

  • Online ISBN: 978-3-031-19836-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics