Skip to main content

Stories for Images-in-Sequence by Using Visual and Narrative Components

  • Conference paper
  • First Online:
ICT Innovations 2018. Engineering and Life Sciences (ICT 2018)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 940))

Included in the following conference series:

Abstract

Recent research in AI is focusing towards generating narrative stories about visual scenes. It has the potential to achieve more human-like understanding than just basic description generation of images-in-sequence. In this work, we propose a solution for generating stories for images-in-sequence that is based on the Sequence to Sequence model. As a novelty, our encoder model is composed of two separate encoders, one that models the behaviour of the image sequence and other that models the sentence-story generated for the previous image in the sequence of images. By using the image sequence encoder we capture the temporal dependencies between the image sequence and the sentence-story and by using the previous sentence-story encoder we achieve a better story flow. Our solution generates long human-like stories that not only describe the visual context of the image sequence but also contains narrative and evaluative language. The obtained results were confirmed by manual human evaluation.

This research was partially funded by Pendulibrium and the Faculty of computer science and engineering, Ss. Cyril and Methodius University in Skopje.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

Notes

  1. 1.

    https://github.com/Pendulibrium/ai-visual-storytelling-seq2seq.

  2. 2.

    https://github.com/Pendulibrium/ai-visual-storytelling-seq2seq/tree/master/results/images.

References

  1. Collecting Highly Parallel Data for Paraphrase Evaluation. Association for Computational Linguistics, January 2011. https://www.microsoft.com/en-us/research/publication/collecting-highly-parallel-data-for-paraphrase-evaluation/

  2. Antol, S., et al.: VQA: visual question answering. CoRR abs/1505.00468 (2015). http://arxiv.org/abs/1505.00468

  3. Bernardi, R., et al.: Automatic description generation from images: a survey of models, datasets, and evaluation measures. CoRR abs/1601.03896 (2016). http://arxiv.org/abs/1601.03896

  4. Chen, J., Kuznetsova, P., Warren, D., Choi, Y.: Déjà image-captions: a corpus of expressive descriptions in repetition. https://doi.org/10.3115/v1/N15-1053. http://www.aclweb.org/anthology/N15-1053

  5. Cho, K., van Merrienboer, B., Bahdanau, D., Bengio, Y.: On the properties of neural machine translation: encoder-decoder approaches. CoRR abs/1409.1259 (2014). http://arxiv.org/abs/1409.1259

  6. Chung, J., Gülçehre, Ç., Cho, K., Bengio, Y.: Empirical evaluation of gated recurrent neural networks on sequence modeling. CoRR abs/1412.3555 (2014). http://arxiv.org/abs/1412.3555

  7. Donahue, J., et al.: Long-term recurrent convolutional networks for visual recognition and description. CoRR abs/1411.4389 (2014). http://arxiv.org/abs/1411.4389

  8. Gao, H., Mao, J., Zhou, J., Huang, Z., Wang, L., Xu, W.: Are you talking to a machine? Dataset and methods for multilingual image question answering. CoRR abs/1505.05612 (20150). http://arxiv.org/abs/1505.05612

  9. Graves, A., Mohamed, A., Hinton, G.E.: Speech recognition with deep recurrent neural networks. CoRR abs/1303.5778 (2013). http://arxiv.org/abs/1303.5778

  10. Guadarrama, S., et al.: YouTube2Text: recognizing and describing arbitrary activities using semantic hierarchies and zero-shot recognition (2013). http://www.cs.utexas.edu/users/ai-lab/pub-view.php?PubID=127409

  11. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural Comput. 9(8), 1735–1780 (1997). https://doi.org/10.1162/neco.1997.9.8.1735

    Article  Google Scholar 

  12. Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)

    Article  MathSciNet  Google Scholar 

  13. Huang, T.K., et al.: Visual storytelling. CoRR abs/1604.03968 (2016). http://arxiv.org/abs/1604.03968

  14. Karpathy, A., Li, F.: Deep visual-semantic alignments for generating image descriptions. CoRR abs/1412.2306 (2014). http://arxiv.org/abs/1412.2306

  15. Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. CoRR abs/1412.6980 (2014). http://arxiv.org/abs/1412.6980

  16. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. In: Pereira, F., Burges, C.J.C., Bottou, L., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 25, pp. 1097–1105. Curran Associates, Inc., New York (2012). http://papers.nips.cc/paper/4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf

    Google Scholar 

  17. Lin, T., et al.: Microsoft COCO: common objects in context. CoRR abs/1405.0312 (2014). http://arxiv.org/abs/1405.0312

  18. Liu, Y., Fu, J., Mei, T., Chen, C.W.: Let your photos talk: generating narrative paragraph for photo stream via bidirectional attention recurrent neural networks. In: AAAI Conference on Artificial Intelligence, February 2017

    Google Scholar 

  19. Malinowski, M., Fritz, M.: A multi-world approach to question answering about real-world scenes based on uncertain input. CoRR abs/1410.0210 (2014). http://arxiv.org/abs/1410.0210

  20. Ordonez, V., Kulkarni, G., Berg, T.L.: Im2Text: describing images using 1 million captioned photographs. In: Shawe-Taylor, J., Zemel, R.S., Bartlett, P.L., Pereira, F., Weinberger, K.Q. (eds.) Advances in Neural Information Processing Systems, vol. 24, pp. 1143–1151. Curran Associates, Inc., New york (2011). http://papers.nips.cc/paper/4470-im2text-describing-images-using-1-million-captioned-photographs.pdf

    Google Scholar 

  21. Park, C.C., Kim, G.: Expressing an image stream with a sequence of natural sentences. In: Cortes, C., Lawrence, N.D., Lee, D.D., Sugiyama, M., Garnett, R. (eds.) Advances in Neural Information Processing Systems 28, pp. 73–81. Curran Associates, Inc., Nwe york (2015). http://papers.nips.cc/paper/5776-expressing-an-image-stream-with-a-sequence-of-natural-sentences.pdf

    Google Scholar 

  22. Pennington, J., Socher, R., Manning, C.D.: GloVe: global vectors for word representation. In: Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014). http://www.aclweb.org/anthology/D14-1162

  23. Ren, S., He, K., Girshick, R.B., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. CoRR abs/1506.01497 (2015). http://arxiv.org/abs/1506.01497

  24. Rohrbach, A., Rohrbach, M., Tandon, N., Schiele, B.: A dataset for movie description. CoRR abs/1501.02530 (2015). http://arxiv.org/abs/1501.02530

  25. Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. CoRR abs/1409.1556 (2014). http://arxiv.org/abs/1409.1556

  26. Soomro, K., Zamir, A.R., Shah, M.: UCF101: a dataset of 101 human actions classes from videos in the wild. CoRR abs/1212.0402 (2012). http://arxiv.org/abs/1212.0402

  27. Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. CoRR abs/1409.3215 (2014). http://arxiv.org/abs/1409.3215

  28. Torabi, A., Pal, C.J., Larochelle, H., Courville, A.C.: Using descriptive video services to create a large data source for video annotation research. CoRR abs/1503.01070 (2015). http://arxiv.org/abs/1503.01070

  29. Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. CoRR abs/1411.4555 (2014). http://arxiv.org/abs/1411.4555

  30. Yao, L., et al.: Describing videos by exploiting temporal structure (2015)

    Google Scholar 

  31. Yu, L., Bansal, M., Berg, T.L.: Hierarchically-attentive RNN for album summarization and storytelling. CoRR abs/1708.02977 (2017). http://arxiv.org/abs/1708.02977

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Marko Smilevski .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Smilevski, M., Lalkovski, I., Madjarov, G. (2018). Stories for Images-in-Sequence by Using Visual and Narrative Components. In: Kalajdziski, S., Ackovska, N. (eds) ICT Innovations 2018. Engineering and Life Sciences. ICT 2018. Communications in Computer and Information Science, vol 940. Springer, Cham. https://doi.org/10.1007/978-3-030-00825-3_13

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-00825-3_13

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-00824-6

  • Online ISBN: 978-3-030-00825-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics