skip to main content
10.1145/3474085.3475236acmconferencesArticle/Chapter ViewAbstractPublication PagesmmConference Proceedingsconference-collections

Latent Memory-augmented Graph Transformer for Visual Storytelling

Published: 17 October 2021 Publication History


Visual storytelling aims to automatically generate a human-like short story given an image stream. Most existing works utilize either scene-level or object-level representations, neglecting the interaction among objects in each image and the sequential dependency between consecutive images. In this paper, we present a novel Latent Memory-augmented Graph Transformer~(LMGT ), a Transformer based framework for visual story generation. LMGT directly inherits the merits from the Transformer, which is further enhanced with two carefully designed components, i.e., a graph encoding module and a latent memory unit. Specifically, the graph encoding module exploits the semantic relationships among image regions and attentively aggregates critical visual features based on the parsed scene graphs. Furthermore, to better preserve inter-sentence coherence and topic consistency, we introduce an augmented latent memory unit that learns and records highly summarized latent information as the story line from the image stream and the sentence history. Experimental results on three widely-used datasets demonstrate the superior performance of LMGT over the state-of-the-art methods.

Supplementary Material

MP4 File (mfp0419_video.mp4)
Presentation video of #mfp0419 at ACM MM 2021


Vishal Anand, Raksha Ramesh, Ziyin Wang, Yijing Feng, Jiana Feng, Wenfeng Lyu, Tianle Zhu, Serena Yuan, and Ching-Yung Lin. 2020. Story Semantic Relationships from Multimodal Cognitions. In MM. ACM.
Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2016. Spice: Semantic propositional image caption evaluation. In ECCV. Springer.
Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E Hinton. 2016. Layer normalization. arXiv preprint arXiv:1607.06450 (2016).
Elaheh Barati and Xuewen Chen. 2019. Critic-based Attention Network for Event-based Video Captioning. In MM. ACM.
Xinlei Chen, Hao Fang, Tsung-Yi Lin, Ramakrishna Vedantam, Saurabh Gupta, Piotr Dollár, and C Lawrence Zitnick. 2015. Microsoft coco captions: Data collection and evaluation server. arXiv preprint arXiv:1504.00325 (2015).
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. 2019. Uniter: Learning universal image-text representations. arXiv preprint arXiv:1909.11740 (2019).
Kyunghyun Cho, Bart Van Merriënboer, Dzmitry Bahdanau, and Yoshua Bengio. 2014. On the properties of neural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259 (2014).
Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555 (2014).
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-Memory Transformer for Image Captioning. In CVPR. IEEE/CVF.
Zihang Dai, Zhilin Yang, Yiming Yang, Jaime Carbonell, Quoc V Le, and Ruslan Salakhutdinov. 2019. Transformer-xl: Attentive language models beyond a fixed-length context. In ACL .
Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376--380.
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).
Lianli Gao, Zhao Guo, Hanwang Zhang, Xing Xu, and Heng Tao Shen. 2017. Video captioning with attention-based LSTM and semantic consistency. IEEE Transactions on Multimedia, Vol. 19, 9 (2017), 2045--2055.
Longteng Guo, Jing Liu, Jinhui Tang, Jiangwei Li, Wei Luo, and Hanqing Lu. 2019. Aligning linguistic words and visual semantic units for image captioning. In MM. ACM.
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In CVPR. IEEE.
Simao Herdade, Armin Kappeler, Kofi Boakye, and Joao Soares. 2019. Image captioning: Transforming objects into words. In NeurIPS .
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural computation, Vol. 9, 8 (1997), 1735--1780.
Xudong Hong, Rakshith Shetty, Asad Sayeed, Khushboo Mehra, Vera Demberg, and Bernt Schiele. 2020. Diverse and Relevant Visual Storytelling with Scene Graph Embeddings. In CNLL .
Chao-Chun Hsu, Zi-Yuan Chen, Chi-Yang Hsu, Chih-Chia Li, Tzu-Yuan Lin, Ting-Hao'Kenneth' Huang, and Lun-Wei Ku. 2020. Knowledge-Enriched Visual Storytelling. In AAAI .
Junjie Hu, Yu Cheng, Zhe Gan, Jingjing Liu, Jianfeng Gao, and Graham Neubig. 2020. What Makes A Good Story? Designing Composite Rewards for Visual Storytelling. In AAAI .
Yaosi Hu, Zhenzhong Chen, Zheng-Jun Zha, and Feng Wu. 2019. Hierarchical global-local temporal modeling for video captioning. In MM. ACM.
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019 b. Attention on attention for image captioning. In ICCV. IEEE.
Qiuyuan Huang, Zhe Gan, Asli Celikyilmaz, Dapeng Wu, Jianfeng Wang, and Xiaodong He. 2019 a. Hierarchically structured reinforcement learning for topically coherent visual story generation. In AAAI .
Ting-Hao Huang, Francis Ferraro, Nasrin Mostafazadeh, Ishan Misra, Aishwarya Agrawal, Jacob Devlin, Ross Girshick, Xiaodong He, Pushmeet Kohli, Dhruv Batra, et al. 2016. Visual storytelling. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 1233--1239.
Jiayi Ji, Xiaoshuai Sun, Yiyi Zhou, Rongrong Ji, Fuhai Chen, Jianzhuang Liu, and Qi Tian. 2020. Attacking Image Captioning Towards Accuracy-Preserving Target Words Removal. In MM. ACM.
Yunjae Jung, Dahun Kim, Sanghyun Woo, Kyungsu Kim, Sungjin Kim, and In So Kweon. 2020. Hide-and-Tell: Learning to Bridge Photo Streams for Visual Storytelling. In AAAI .
Taehyeong Kim, Min-Oh Heo, Seonil Son, Kyoung-Wha Park, and Byoung-Tak Zhang. 2018. Glac net: Glocal attention cascading networks for multi-image cued story generation. In ACL .
Jonathan Krause, Justin Johnson, Ranjay Krishna, and Li Fei-Fei. 2017. A hierarchical approach for generating descriptive image paragraphs. In CVPR. IEEE.
Ranjay Krishna, Kenji Hata, Frederic Ren, Li Fei-Fei, and Juan Carlos Niebles. 2017. Dense-captioning events in videos. In ICCV. IEEE.
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A Shamma, Michael Bernstein, and Li Fei-Fei. 2016. Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations.
Jie Lei, Liwei Wang, Yelong Shen, Dong Yu, Tamara L Berg, and Mohit Bansal. 2020. Mart: Memory-augmented recurrent transformer for coherent video paragraph captioning. In ACL .
Guang Li, Linchao Zhu, Ping Liu, and Yi Yang. 2019 b. Entangled transformer for image captioning. In ICCV. IEEE.
Jiacheng Li, Haizhou Shi, Siliang Tang, Fei Wu, and Yueting Zhuang. 2019 a. Informative Visual Storytelling with Cross-modal Rules. In MM. ACM.
Jiacheng Li, Siliang Tang, Juncheng Li, Jun Xiao, Fei Wu, Shiliang Pu, and Yueting Zhuang. 2020. Topic Adaptation and Prototype Encoding for Few-Shot Visual Storytelling. In MM. ACM.
Yikang Li, Wanli Ouyang, Bolei Zhou, Jianping Shi, Chao Zhang, and Xiaogang Wang. 2018. Factorizable net: an efficient subgraph-based framework for scene graph generation. In ECCV. Springer.
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74--81.
Jen-Chun Lin, Wen-Li Wei, Yen-Yu Lin, Tyng-Luh Liu, and Hong-Yuan Mark Liao. 2020. Learning From Music to Visual Storytelling of Shots: A Deep Interactive Learning Mechanism. In MM. ACM.
Yang Liu. 2019. Fine-tune BERT for extractive summarization. arXiv preprint arXiv:1903.10318 (2019).
Yu Liu, Jianlong Fu, Tao Mei, and Chang Wen Chen. 2017. Let your photos talk: Generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In AAAI .
Bruce T Lowerre. 1976. The HARPY speech recognition system. Technical Report. CARNEGIE-MELLON UNIV PITTSBURGH PA DEPT OF COMPUTER SCIENCE.
Yingwei Pan, Ting Yao, Houqiang Li, and Tao Mei. 2017. Video captioning with transferred semantic attributes. In CVPR. IEEE.
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In ACL .
Cesc C Park and Gunhee Kim. 2015. Expressing an image stream with a sequence of natural sentences. In NIPS .
Jeffrey Pennington, Richard Socher, and Christopher D Manning. 2014. Glove: Global vectors for word representation. In EMNLP .
Mengshi Qi, Weijian Li, Zhengyuan Yang, Yunhong Wang, and Jiebo Luo. 2019 a. Attentive relational networks for mapping images to scene graphs. In CVPR. IEEE.
Mengshi Qi, Jie Qin, Yi Yang, Yunhong Wang, and Jiebo Luo. 2021. Semantics-Aware Spatial-Temporal Binaries for Cross-Modal Video Retrieval. IEEE Transactions on Image Processing, Vol. 30 (2021), 2989--3004.
Mengshi Qi, Jie Qin, Xiantong Zhen, Di Huang, Yi Yang, and Jiebo Luo. 2020. Few-Shot Ensemble Learning for Video Classification with SlowFast Memory Networks. In MM. ACM.
Mengshi Qi, Yunhong Wang, and Annan Li. 2017. Online cross-modal scene retrieval by binary representation and semantic graph. In MM. ACM.
Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2018. Sports video captioning by attentive motion representation based hierarchical recurrent neural networks. In MMSports. ACM.
Mengshi Qi, Yunhong Wang, Annan Li, and Jiebo Luo. 2019 b. Sports video captioning via attentive motion representation and group relationship modeling. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 8 (2019), 2617--2633.
Mengshi Qi, Yunhong Wang, Jie Qin, Annan Li, Jiebo Luo, and Luc Van Gool. 2019 c. stagNet: an attentive semantic RNN for group activity and individual action recognition. IEEE Transactions on Circuits and Systems for Video Technology, Vol. 30, 2 (2019), 549--565.
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In NIPS .
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. 2018. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In ACL .
Zhiqiang Shen, Jianguo Li, Zhou Su, Minjun Li, Yurong Chen, Yu-Gang Jiang, and Xiangyang Xue. 2017. Weakly supervised dense video captioning. In CVPR. IEEE.
Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. In ICLR .
Yuqing Song, Shizhe Chen, Yida Zhao, and Qin Jin. 2019. Unpaired cross-lingual image caption generation with self-supervised rewards. In MM. ACM.
Chen Sun, Austin Myers, Carl Vondrick, Kevin Murphy, and Cordelia Schmid. 2019. Videobert: A joint model for video and language representation learning. In ICCV. IEEE.
Hao Tan and Mohit Bansal. 2019. Lxmert: Learning cross-modality encoder representations from transformers. In EMNLP .
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In NIPS .
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. 2015. Cider: Consensus-based image description evaluation. In CVPR. IEEE.
Petar Velivc ković, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Lio, and Yoshua Bengio. 2018. Graph attention networks. In ICLR .
Paula Viana, Pedro Carvalho, Maria Teresa Andrade, Pieter P Jonker, Vasileios Papanikolaou, Inês N Teixeira, Luis Vilacc a, José P Pinto, and Tiago Costa. 2020. Semantic Storytelling Automation: A Context-Aware and Metadata-Driven Approach. In MM. ACM.
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In CVPR. IEEE.
Bairui Wang, Lin Ma, Wei Zhang, Wenhao Jiang, and Feng Zhang. 2019. Hierarchical photo-scene encoder for album storytelling. In AAAI .
Jing Wang, Jianlong Fu, Jinhui Tang, Zechao Li, and Tao Mei. 2018b. Show, reward and tell: Automatic generation of narrative paragraph from photo stream by adversarial training. In AAAI .
Jing Wang, Jinhui Tang, and Jiebo Luo. 2020 a. Multimodal Attention with Image Text Spatial Relationship for OCR-Based Image Captioning. In MM. ACM.
Ruize Wang, Zhongyu Wei, Piji Li, Qi Zhang, and Xuanjing Huang. 2020 b. Storytelling from an Image Stream Using Scene Graphs. In AAAI .
Xin Wang, Wenhu Chen, Yuan-Fang Wang, and William Yang Wang. 2018a. No metrics are perfect: Adversarial reward learning for visual storytelling. In ACL .
Danfei Xu, Yuke Zhu, Christopher B Choy, and Li Fei-Fei. 2017. Scene graph generation by iterative message passing. In CVPR. IEEE.
Pengcheng Yang, Fuli Luo, Peng Chen, Lei Li, Zhiyi Yin, Xiaodong He, and Xu Sun. 2019 b. Knowledgeable Storyteller: A Commonsense-Driven Generative Model for Visual Storytelling. In IJCAI .
Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Russ R Salakhutdinov, and Quoc V Le. 2019 a. Xlnet: Generalized autoregressive pretraining for language understanding. In NeurIPS .
Licheng Yu, Mohit Bansal, and Tamara L Berg. 2017. Hierarchically-attentive rnn for album summarization and storytelling. In EMNLP .
Yitian Yuan, Lin Ma, Jingwen Wang, and Wenwu Zhu. 2020. Controllable Video Captioning with an Exemplar Sentence. In MM. ACM.
Rowan Zellers, Mark Yatskar, Sam Thomson, and Yejin Choi. 2018. Neural motifs: Scene graph parsing with global context. In CVPR. IEEE.
Beichen Zhang, Liang Li, Li Su, Shuhui Wang, Jincan Deng, Zheng-Jun Zha, and Qingming Huang. 2020 a. Structural Semantic Adversarial Active Learning for Image Captioning. In MM. ACM.
Shengyu Zhang, Ziqi Tan, Jin Yu, Zhou Zhao, Kun Kuang, Jie Liu, Jingren Zhou, Hongxia Yang, and Fei Wu. 2020 b. Poet: Product-oriented Video Captioner for E-commerce. In MM. ACM.
Luowei Zhou, Yingbo Zhou, Jason J Corso, Richard Socher, and Caiming Xiong. 2018. End-to-end dense video captioning with masked transformer. In CVPR. IEEE.
Yongqing Zhu and Shuqiang Jiang. 2019. Attention-based densely connected LSTM for video captioning. In MM. ACM.

Cited By

View all
  • (2024)Emotional Video Captioning With Vision-Based Emotion Interpretation NetworkIEEE Transactions on Image Processing10.1109/TIP.2024.335904533(1122-1135)Online publication date: 2024
  • (2024)Multidimensional Semantic Augmented Visual Storytelling2024 4th International Conference on Neural Networks, Information and Communication (NNICE)10.1109/NNICE61279.2024.10498935(697-702)Online publication date: 19-Jan-2024
  • (2024)Bottom-Up Hierarchical Propagation Networks with Heterogeneous Graph Modeling for Video Question Answering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650620(1-8)Online publication date: 30-Jun-2024
  • Show More Cited By



Information & Contributors


Published In

cover image ACM Conferences
MM '21: Proceedings of the 29th ACM International Conference on Multimedia
October 2021
5796 pages
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]



Association for Computing Machinery

New York, NY, United States

Publication History

Published: 17 October 2021


Request permissions for this article.

Check for updates

Author Tags

  1. memory network
  2. scene graph
  3. transformer
  4. visual storytelling


  • Research-article

Funding Sources

  • NSFC


MM '21
MM '21: ACM Multimedia Conference
October 20 - 24, 2021
Virtual Event, China

Acceptance Rates

Overall Acceptance Rate 2,145 of 8,556 submissions, 25%


Other Metrics

Bibliometrics & Citations


Article Metrics

  • Downloads (Last 12 months)55
  • Downloads (Last 6 weeks)5
Reflects downloads up to 05 Mar 2025

Other Metrics


Cited By

View all
  • (2024)Emotional Video Captioning With Vision-Based Emotion Interpretation NetworkIEEE Transactions on Image Processing10.1109/TIP.2024.335904533(1122-1135)Online publication date: 2024
  • (2024)Multidimensional Semantic Augmented Visual Storytelling2024 4th International Conference on Neural Networks, Information and Communication (NNICE)10.1109/NNICE61279.2024.10498935(697-702)Online publication date: 19-Jan-2024
  • (2024)Bottom-Up Hierarchical Propagation Networks with Heterogeneous Graph Modeling for Video Question Answering2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650620(1-8)Online publication date: 30-Jun-2024
  • (2023)Storytelling with Image Data: A Systematic Review and Comparative Analysis of Methods and ToolsAlgorithms10.3390/a1603013516:3(135)Online publication date: 2-Mar-2023
  • (2023)Text-Only Training for Visual StorytellingProceedings of the 31st ACM International Conference on Multimedia10.1145/3581783.3612179(3686-3695)Online publication date: 26-Oct-2023
  • (2023)An Unsupervised Vision-related Keywords Retrieval and Fusion Method for Visual Storytelling2023 IEEE 35th International Conference on Tools with Artificial Intelligence (ICTAI)10.1109/ICTAI59109.2023.00120(784-790)Online publication date: 6-Nov-2023
  • (2023)With a Little Help from your own Past: Prototypical Memory Networks for Image Captioning2023 IEEE/CVF International Conference on Computer Vision (ICCV)10.1109/ICCV51070.2023.00282(3009-3019)Online publication date: 1-Oct-2023
  • (2023)Spectral Representation Learning and Fusion for Autonomous Vehicles Trip Description Exploiting Recurrent TransformerIEEE Access10.1109/ACCESS.2023.328778311(61437-61452)Online publication date: 2023

View Options

Login options

View options


View or Download as a PDF file.



View online with eReader.







Share this Publication link

Share on social media