skip to main content
10.1145/3343031.3350571acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections
research-article

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

Published: 15 October 2019 Publication History

Abstract

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products. In this paper, we tackle a new multimedia task of automatic storyboard creation to facilitate this process and inspire human artists. Inspired by the fact that our understanding of languages is based on our past experience, we propose a novel inspire-and-create framework with a story-to-image retriever that selects relevant cinematic images for inspiration and a storyboard creator that further refines and renders images to improve the relevancy and visual consistency. The proposed retriever dynamically employs contextual information in the story with hierarchical attentions and applies dense visual-semantic matching to accurately retrieve and ground images. The creator then employs three rendering steps to increase the flexibility of retrieved images, which include erasing irrelevant regions, unifying styles of images and substituting consistent characters. We carry out extensive experiments on both in-domain and out-of-domain visual story datasets. The proposed model achieves better quantitative performance than the state-of-the-art baselines for storyboard creation. Qualitative visualizations and user studies further verify that our approach can create high-quality storyboards even for stories in the wild.

References

[1]
Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060--1069, 2016.
[2]
Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907--5915, 2017.
[3]
Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1316--1324, 2018.
[4]
Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. To create what you tell: Generating videos from captions. In Proceedings of the 25th ACM international conference on Multimedia, pages 1789--1798. ACM, 2017.
[5]
Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization. 2019.
[6]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672--2680, 2014.
[7]
Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234--2242, 2016.
[8]
Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.
[9]
Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse
[10]
: Improving visual-semantic embeddings with hard negatives. 2018.
[11]
Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 299--307, 2017.
[12]
Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128--3137, 2015.
[13]
Hareesh Ravi, Lezi Wang, Carlos Muniz, Leonid Sigal, Dimitris Metaxas, and Mubbasir Kapadia. Show me a story: Towards coherent neural story illustration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7613--7621, 2018.
[14]
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019.
[15]
George Lakoff. Explaining embodied cognition results. Topics in cognitive science, 4(4):773--785, 2012.
[16]
Bei Liu, Jianlong Fu, Makoto P Kato, and Masatoshi Yoshikawa. Beyond narrative description: Generating poetry from images by multi-adversarial training. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 783--791. ACM, 2018.
[17]
Nanxing Li, Bei Liu, Zhizhong Han, Yu-Shen Liu, and Jianlong Fu. Emotion reinforced visual storytelling. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pages 297--305. ACM, 2019.
[18]
Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. Da-gan: Instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5657--5666, 2018.
[19]
Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6199--6208, 2018.
[20]
Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121--2129, 2013.
[21]
Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5005--5013, 2016.
[22]
Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6163--6171, 2018.
[23]
Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889--1897, 2014.
[24]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201--216, 2018.
[25]
Gunhee Kim, Seungwhan Moon, and Leonid Sigal. Ranking and retrieval of image sequences from multiple paragraph queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1993--2001, 2015.
[26]
Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233--1239, 2016.
[27]
Jason Weston, Emily Dinan, and Alexander H Miller. Retrieve and refine: Improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776, 2018.
[28]
Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. A retrieve-and-edit framework for predicting structured outputs. In Advances in Neural Information Processing Systems, pages 10052--10062, 2018.
[29]
Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.
[30]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077--6086, 2018.
[31]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32--73, 2017.
[32]
Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961--2969, 2017.
[33]
Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photo cartoonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9465--9474, 2018.
[34]
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740--755. Springer, 2014.
[35]
Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532--1543, 2014.
[36]
Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.

Cited By

View all
  • (2024)ScaMo: Towards Text to Video Storyboard Generation Using Scale and Movement of ShotsProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700279(1-8)Online publication date: 3-Dec-2024
  • (2024)RetAssist: Facilitating Vocabulary Learners with Generative Images in Story Retelling PracticesProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661581(2019-2036)Online publication date: 1-Jul-2024
  • (2024)ReelFramer: Human-AI Co-Creation for News-to-Video TranslationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642868(1-20)Online publication date: 11-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
MM '19: Proceedings of the 27th ACM International Conference on Multimedia
October 2019
2794 pages
ISBN:9781450368896
DOI:10.1145/3343031
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cross-modal retrieval
  2. inspire-and-create
  3. storyboard creation

Qualifiers

  • Research-article

Funding Sources

Conference

MM '19
Sponsor:

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;
Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)95
  • Downloads (Last 6 weeks)5
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)ScaMo: Towards Text to Video Storyboard Generation Using Scale and Movement of ShotsProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700279(1-8)Online publication date: 3-Dec-2024
  • (2024)RetAssist: Facilitating Vocabulary Learners with Generative Images in Story Retelling PracticesProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661581(2019-2036)Online publication date: 1-Jul-2024
  • (2024)ReelFramer: Human-AI Co-Creation for News-to-Video TranslationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642868(1-20)Online publication date: 11-May-2024
  • (2024)CollageVis: Rapid Previsualization Tool for Indie Filmmaking using Video CollagesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642575(1-16)Online publication date: 11-May-2024
  • (2024)Griffith: A Storyboarding Tool Designed with Japanese Animation ProfessionalsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642121(1-14)Online publication date: 11-May-2024
  • (2024)Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story IllustrationIEEE Transactions on Multimedia10.1109/TMM.2023.329694426(2456-2466)Online publication date: 2024
  • (2024)Script-to-Storyboard-to-Story Reel Framework2024 28th International Conference Information Visualisation (IV)10.1109/IV64223.2024.00067(350-355)Online publication date: 22-Jul-2024
  • (2024)Artistic Fusion: AI Powered Artistry for Story Boarding2024 4th International Conference on Sustainable Expert Systems (ICSES)10.1109/ICSES63445.2024.10763187(795-800)Online publication date: 15-Oct-2024
  • (2024)ArtAI4DS: AI Art and Its Empowering Role in Digital StorytellingEntertainment Computing – ICEC 202410.1007/978-3-031-74353-5_6(78-93)Online publication date: 1-Oct-2024
  • (2023)TeViS: Translating Text Synopses to Video StoryboardsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612417(4968-4979)Online publication date: 26-Oct-2023
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media