research-article

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences

Authors:

Jin ZhouAuthors Info & Claims

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

Pages 2236 - 2244

https://doi.org/10.1145/3343031.3350571

Published: 15 October 2019 Publication History

Abstract

A storyboard is a sequence of images to illustrate a story containing multiple sentences, which has been a key process to create different story products. In this paper, we tackle a new multimedia task of automatic storyboard creation to facilitate this process and inspire human artists. Inspired by the fact that our understanding of languages is based on our past experience, we propose a novel inspire-and-create framework with a story-to-image retriever that selects relevant cinematic images for inspiration and a storyboard creator that further refines and renders images to improve the relevancy and visual consistency. The proposed retriever dynamically employs contextual information in the story with hierarchical attentions and applies dense visual-semantic matching to accurately retrieve and ground images. The creator then employs three rendering steps to increase the flexibility of retrieved images, which include erasing irrelevant regions, unifying styles of images and substituting consistent characters. We carry out extensive experiments on both in-domain and out-of-domain visual story datasets. The proposed model achieves better quantitative performance than the state-of-the-art baselines for storyboard creation. Qualitative visualizations and user studies further verify that our approach can create high-quality storyboards even for stories in the wild.

References

[1]

Scott Reed, Zeynep Akata, Xinchen Yan, Lajanugen Logeswaran, Bernt Schiele, and Honglak Lee. Generative adversarial text to image synthesis. In International Conference on Machine Learning, pages 1060--1069, 2016.

[2]

Han Zhang, Tao Xu, Hongsheng Li, Shaoting Zhang, Xiaogang Wang, Xiaolei Huang, and Dimitris N Metaxas. Stackgan: Text to photo-realistic image synthesis with stacked generative adversarial networks. In Proceedings of the IEEE International Conference on Computer Vision, pages 5907--5915, 2017.

[3]

Tao Xu, Pengchuan Zhang, Qiuyuan Huang, Han Zhang, Zhe Gan, Xiaolei Huang, and Xiaodong He. Attngan: Fine-grained text to image generation with attentional generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1316--1324, 2018.

[4]

Yingwei Pan, Zhaofan Qiu, Ting Yao, Houqiang Li, and Tao Mei. To create what you tell: Generating videos from captions. In Proceedings of the 25th ACM international conference on Multimedia, pages 1789--1798. ACM, 2017.

Digital Library

[5]

Yitong Li, Zhe Gan, Yelong Shen, Jingjing Liu, Yu Cheng, Yuexin Wu, Lawrence Carin, David Carlson, and Jianfeng Gao. Storygan: A sequential conditional gan for story visualization. 2019.

[6]

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing systems, pages 2672--2680, 2014.

Digital Library

[7]

Tim Salimans, Ian Goodfellow, Wojciech Zaremba, Vicki Cheung, Alec Radford, and Xi Chen. Improved techniques for training gans. In Advances in neural information processing systems, pages 2234--2242, 2016.

Digital Library

[8]

Ryan Kiros, Ruslan Salakhutdinov, and Richard S Zemel. Unifying visual-semantic embeddings with multimodal neural language models. arXiv preprint arXiv:1411.2539, 2014.

[9]

Fartash Faghri, David J Fleet, Jamie Ryan Kiros, and Sanja Fidler. Vse

[10]

: Improving visual-semantic embeddings with hard negatives. 2018.

[11]

Hyeonseob Nam, Jung-Woo Ha, and Jeonghee Kim. Dual attention networks for multimodal reasoning and matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 299--307, 2017.

[12]

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3128--3137, 2015.

[13]

Hareesh Ravi, Lezi Wang, Carlos Muniz, Leonid Sigal, Dimitris Metaxas, and Mubbasir Kapadia. Show me a story: Towards coherent neural story illustration. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7613--7621, 2018.

[14]

Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. Videobert: A joint model for video and language representation learning. arXiv preprint arXiv:1904.01766, 2019.

[15]

George Lakoff. Explaining embodied cognition results. Topics in cognitive science, 4(4):773--785, 2012.

[16]

Bei Liu, Jianlong Fu, Makoto P Kato, and Masatoshi Yoshikawa. Beyond narrative description: Generating poetry from images by multi-adversarial training. In 2018 ACM Multimedia Conference on Multimedia Conference, pages 783--791. ACM, 2018.

Digital Library

[17]

Nanxing Li, Bei Liu, Zhizhong Han, Yu-Shen Liu, and Jianlong Fu. Emotion reinforced visual storytelling. In Proceedings of the 2019 on International Conference on Multimedia Retrieval, pages 297--305. ACM, 2019.

Digital Library

[18]

Shuang Ma, Jianlong Fu, Chang Wen Chen, and Tao Mei. Da-gan: Instance-level image translation by deep attention generative adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5657--5666, 2018.

[19]

Zizhao Zhang, Yuanpu Xie, and Lin Yang. Photographic text-to-image synthesis with a hierarchically-nested adversarial network. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6199--6208, 2018.

[20]

Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio, Jeff Dean, Tomas Mikolov, et al. Devise: A deep visual-semantic embedding model. In Advances in neural information processing systems, pages 2121--2129, 2013.

Digital Library

[21]

Liwei Wang, Yin Li, and Svetlana Lazebnik. Learning deep structure-preserving image-text embeddings. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5005--5013, 2016.

[22]

Yan Huang, Qi Wu, Chunfeng Song, and Liang Wang. Learning semantic concepts and order for image and sentence matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6163--6171, 2018.

[23]

Andrej Karpathy, Armand Joulin, and Li F Fei-Fei. Deep fragment embeddings for bidirectional image sentence mapping. In Advances in neural information processing systems, pages 1889--1897, 2014.

Digital Library

[24]

Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. Stacked cross attention for image-text matching. In Proceedings of the European Conference on Computer Vision (ECCV), pages 201--216, 2018.

Digital Library

[25]

Gunhee Kim, Seungwhan Moon, and Leonid Sigal. Ranking and retrieval of image sequences from multiple paragraph queries. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 1993--2001, 2015.

[26]

Ting-Hao Kenneth Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1233--1239, 2016.

[27]

Jason Weston, Emily Dinan, and Alexander H Miller. Retrieve and refine: Improved sequence generation models for dialogue. arXiv preprint arXiv:1808.04776, 2018.

[28]

Tatsunori B Hashimoto, Kelvin Guu, Yonatan Oren, and Percy S Liang. A retrieve-and-edit framework for predicting structured outputs. In Advances in Neural Information Processing Systems, pages 10052--10062, 2018.

[29]

Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation, 9(8):1735--1780, 1997.

Digital Library

[30]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6077--6086, 2018.

[31]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, et al. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision, 123(1):32--73, 2017.

Digital Library

[32]

Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE international conference on computer vision, pages 2961--2969, 2017.

[33]

Yang Chen, Yu-Kun Lai, and Yong-Jin Liu. Cartoongan: Generative adversarial networks for photo cartoonization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 9465--9474, 2018.

[34]

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740--755. Springer, 2014.

[35]

Jeffrey Pennington, Richard Socher, and Christopher Manning. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 1532--1543, 2014.

[36]

Edouard Grave, Piotr Bojanowski, Prakhar Gupta, Armand Joulin, and Tomas Mikolov. Learning word vectors for 157 languages. In Proceedings of the International Conference on Language Resources and Evaluation (LREC 2018), 2018.

Cited By

Gu XWang XJin CSong R(2024)ScaMo: Towards Text to Video Storyboard Generation Using Scale and Movement of ShotsProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700279(1-8)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700279
Chen QLiu SHuang KWang XMa XZhu JPeng Z(2024)RetAssist: Facilitating Vocabulary Learners with Generative Images in Story Retelling PracticesProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661581(2019-2036)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3661581
Wang SMenon SLong THenderson KLi DCrowston KHansen MNickerson JChilton L(2024)ReelFramer: Human-AI Co-Creation for News-to-Video TranslationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642868(1-20)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642868
Show More Cited By

Index Terms

Neural Storyboard Artist: Visualizing Stories with Coherent Image Sequences
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Machine learning approaches
      1. Neural networks
2. Information systems
  1. Information retrieval
    1. Specialized information retrieval
      1. Multimedia and multimodal retrieval
        Image search

Recommendations

Comic storyboard extraction via edge segment analysis

Comic storyboard extraction aims to decompose the comic image into several storyboards (or frames), which is the key technique to produce the digital comic documents suitable for mobile reading. Previous methods fail either to detect overlapped ...
Artist friendly facial animation retargeting

This paper presents a novel facial animation retargeting system that is carefully designed to support the animator's workflow. Observation and analysis of the animators' often preferred process of key-frame animation with blendshape models informed our ...
Artist friendly facial animation retargeting
SA '11: Proceedings of the 2011 SIGGRAPH Asia Conference

This paper presents a novel facial animation retargeting system that is carefully designed to support the animator's workflow. Observation and analysis of the animators' often preferred process of key-frame animation with blendshape models informed our ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

MM '19: Proceedings of the 27th ACM International Conference on Multimedia

October 2019

2794 pages

ISBN:9781450368896

DOI:10.1145/3343031

General Chairs:
Laurent Amsaleg
CNRS-IRISA, France
,
Benoit Huet
EURECOM, France
,
Martha Larson
Radboud University and TU Delft (Netherlands)
,
Program Chairs:
Guillaume Gravier
CNRS-IRISA, France
,
Hayley Hung
Delft University of Technology Netherlands
,
Chong-Wah Ngo
City University of Hong Kong Hong Kong
,
Wei Tsang Ooi
National University of Singapore Singapore

Copyright © 2019 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMM: ACM Special Interest Group on Multimedia

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 15 October 2019

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Beijing Natural Science Foundation
National Key Research and Development Plan

Conference

MM '19

Sponsor:

SIGMM

MM '19: The 27th ACM International Conference on Multimedia

October 21 - 25, 2019

Nice, France

Acceptance Rates

MM '19 Paper Acceptance Rate 252 of 936 submissions, 27%;

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

22
Total Citations
View Citations
705
Total Downloads

Downloads (Last 12 months)95
Downloads (Last 6 weeks)5

Reflects downloads up to 13 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gu XWang XJin CSong R(2024)ScaMo: Towards Text to Video Storyboard Generation Using Scale and Movement of ShotsProceedings of the 6th ACM International Conference on Multimedia in Asia10.1145/3696409.3700279(1-8)Online publication date: 3-Dec-2024
https://dl.acm.org/doi/10.1145/3696409.3700279
Chen QLiu SHuang KWang XMa XZhu JPeng Z(2024)RetAssist: Facilitating Vocabulary Learners with Generative Images in Story Retelling PracticesProceedings of the 2024 ACM Designing Interactive Systems Conference10.1145/3643834.3661581(2019-2036)Online publication date: 1-Jul-2024
https://dl.acm.org/doi/10.1145/3643834.3661581
Wang SMenon SLong THenderson KLi DCrowston KHansen MNickerson JChilton L(2024)ReelFramer: Human-AI Co-Creation for News-to-Video TranslationProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642868(1-20)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642868
Jo HSuzuki RKim Y(2024)CollageVis: Rapid Previsualization Tool for Indie Filmmaking using Video CollagesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642575(1-16)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642575
Kato JHara KHirasawa N(2024)Griffith: A Storyboarding Tool Designed with Japanese Animation ProfessionalsProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642121(1-14)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642121
Lu YNi FWang HGuo XZhu LYang ZSong RCheng LYang Y(2024)Show Me a Video: A Large-Scale Narrated Video Dataset for Coherent Story IllustrationIEEE Transactions on Multimedia10.1109/TMM.2023.329694426(2456-2466)Online publication date: 2024
https://doi.org/10.1109/TMM.2023.3296944
Rusu ARusu A(2024)Script-to-Storyboard-to-Story Reel Framework2024 28th International Conference Information Visualisation (IV)10.1109/IV64223.2024.00067(350-355)Online publication date: 22-Jul-2024
https://doi.org/10.1109/IV64223.2024.00067
A BP KN DD AR ES A(2024)Artistic Fusion: AI Powered Artistry for Story Boarding2024 4th International Conference on Sustainable Expert Systems (ICSES)10.1109/ICSES63445.2024.10763187(795-800)Online publication date: 15-Oct-2024
https://doi.org/10.1109/ICSES63445.2024.10763187
Fernandes TNisi VNunes NJames S(2024)ArtAI4DS: AI Art and Its Empowering Role in Digital StorytellingEntertainment Computing – ICEC 202410.1007/978-3-031-74353-5_6(78-93)Online publication date: 1-Oct-2024
https://dl.acm.org/doi/10.1007/978-3-031-74353-5_6
Gu XSun YNi FChen SWang XSong RLi BCao XEl Saddik AMei TCucchiara RBertini MTobon Vallejo DAtrey PHossain M(2023)TeViS: Translating Text Synopses to Video StoryboardsProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612417(4968-4979)Online publication date: 26-Oct-2023
https://dl.acm.org/doi/10.1145/3581783.3612417
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten