Stories for Images-in-Sequence by Using Visual and Narrative Components

Smilevski, Marko; Lalkovski, Ilija; Madjarov, Gjorgji

doi:10.1007/978-3-030-00825-3_13

Marko Smilevski^10,11,
Ilija Lalkovski¹¹ &
Gjorgji Madjarov^10,12

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 940))

Included in the following conference series:

International Conference on Telecommunications

910 Accesses
10 Altmetric

Abstract

Recent research in AI is focusing towards generating narrative stories about visual scenes. It has the potential to achieve more human-like understanding than just basic description generation of images-in-sequence. In this work, we propose a solution for generating stories for images-in-sequence that is based on the Sequence to Sequence model. As a novelty, our encoder model is composed of two separate encoders, one that models the behaviour of the image sequence and other that models the sentence-story generated for the previous image in the sequence of images. By using the image sequence encoder we capture the temporal dependencies between the image sequence and the sentence-story and by using the previous sentence-story encoder we achieve a better story flow. Our solution generates long human-like stories that not only describe the visual context of the image sequence but also contains narrative and evaluative language. The obtained results were confirmed by manual human evaluation.

This research was partially funded by Pendulibrium and the Faculty of computer science and engineering, Ss. Cyril and Methodius University in Skopje.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

A Hierarchical Approach for Visual Storytelling Using Image Description

Towards Visual Storytelling by Understanding Narrative Context Through Scene-Graphs

StoryDALL-E: Adapting Pretrained Text-to-Image Transformers for Story Continuation

Notes

References

Collecting Highly Parallel Data for Paraphrase Evaluation. Association for Computational Linguistics, January 2011. https://www.microsoft.com/en-us/research/publication/collecting-highly-parallel-data-for-paraphrase-evaluation/
Antol, S., et al.: VQA: visual question answering. CoRR abs/1505.00468 (2015). http://arxiv.org/abs/1505.00468
Bernardi, R., et al.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. CoRR abs/1601.03896 (2016). http://arxiv.org/abs/1601.03896
Chen, J., Kuznetsova, P., Warren, D., Choi, Y.: Déjà image-captions: a corpus of expressive descriptions in repetition. https://doi.org/10.3115/v1/N15-1053. http://www.aclweb.org/anthology/N15-1053
Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. CoRR abs/1409.1259 (2014). http://arxiv.org/abs/1409.1259
Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014). http://arxiv.org/abs/1412.3555
Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. CoRR abs/1411.4389 (2014). http://arxiv.org/abs/1411.4389
Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question answering. CoRR abs/1505.05612 (20150). http://arxiv.org/abs/1505.05612
Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. CoRR abs/1303.5778 (2013). http://arxiv.org/abs/1303.5778
Guadarrama, S., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition (2013). http://www.cs.utexas.edu/users/ai-lab/pub-view.php?PubID=127409
Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735
Article Google Scholar
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet Google Scholar
Huang, T.K., et al.: Visual storytelling. CoRR abs/1604.03968 (2016). http://arxiv.org/abs/1604.03968
Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. CoRR abs/1412.2306 (2014). http://arxiv.org/abs/1412.2306
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980
Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc., New York (2012). http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf
Google Scholar
Lin, T., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014). http://arxiv.org/abs/1405.0312
Liu, Y., Fu, J., Mei, T., Chen, C.W.: Let your photos talk: generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In: AAAI Conference on Artificial Intelligence, February 2017
Google Scholar
Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. CoRR abs/1410.0210 (2014). http://arxiv.org/abs/1410.0210
Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 1143–1151. Curran Associates, Inc., New york (2011). http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captioned-photographs.pdf
Google Scholar
Park, C.C., Kim, G.: Expressing an image stream with a sequence of natural sentences. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 73–81. Curran Associates, Inc., Nwe york (2015). http://papers.nips.cc/paper/5776-expressing-an-image-stream-with-a-sequence-of-natural-sentences.pdf
Google Scholar
Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162
Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015). http://arxiv.org/abs/1506.01497
Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. CoRR abs/1501.02530 (2015). http://arxiv.org/abs/1501.02530
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556
Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). http://arxiv.org/abs/1212.0402
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. CoRR abs/1409.3215 (2014). http://arxiv.org/abs/1409.3215
Torabi, A., Pal, C.J., Larochelle, H., Courville, A.C.: Using descriptive video services to create a large data source for video annotation research. CoRR abs/1503.01070 (2015). http://arxiv.org/abs/1503.01070
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. CoRR abs/1411.4555 (2014). http://arxiv.org/abs/1411.4555
Yao, L., et al.: Describing videos by exploiting temporal structure (2015)
Google Scholar
Yu, L., Bansal, M., Berg, T.L.: Hierarchically-attentive RNN for album summarization and storytelling. CoRR abs/1708.02977 (2017). http://arxiv.org/abs/1708.02977

Download references

Author information

Authors and Affiliations

Ss. Cyril and Methodius University, Skopje, Macedonia
Marko Smilevski & Gjorgji Madjarov
Pendulibrium, Skopje, Macedonia
Marko Smilevski & Ilija Lalkovski
Elevate Global, Skopje, Macedonia
Gjorgji Madjarov

Authors

Marko Smilevski
View author publications
You can also search for this author in PubMed Google Scholar
Ilija Lalkovski
View author publications
You can also search for this author in PubMed Google Scholar
Gjorgji Madjarov
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marko Smilevski .

Editor information

Editors and Affiliations

Faculty of Computer Science and Engineering, Saints Cyril and Methodius University of Skopje, Skopje, Macedonia
Slobodan Kalajdziski
Faculty of Computer Science and Engineering, Saints Cyril and Methodius University of Skopje, Skopje, Macedonia
Nevena Ackovska

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Smilevski, M., Lalkovski, I., Madjarov, G. (2018). Stories for Images-in-Sequence by Using Visual and Narrative Components. In: Kalajdziski, S., Ackovska, N. (eds) ICT Innovations 2018. Engineering and Life Sciences. ICT 2018. Communications in Computer and Information Science, vol 940. Springer, Cham. https://doi.org/10.1007/978-3-030-00825-3_13

Download citation

DOI: https://doi.org/10.1007/978-3-030-00825-3_13
Published: 13 September 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-00824-6
Online ISBN: 978-3-030-00825-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics