Skip to main content
Log in

Assamese news image caption generation using attention mechanism

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

In recent times, neural networks and deep learning have made significant contributions in various research domains. In the present work, we report automatic caption generation of an image using these techniques. Automatic image caption generation is an artificial intelligence problem that receives attention from both computer vision and natural language processing researchers. Most of the caption generation tasks exist in the English language and no work has been reported yet in Assamese to the best of our knowledge. Assamese is an Indo-European language spoken by 14 million speakers in the North-East region of India. This paper reports the image caption generation on the Assamese news domain. A quality image captioning system requires an annotated training corpus. However, there is no such standard dataset available for this resource-constrained language. Therefore, we built a dataset of 13000 images collected from various online local Assamese e-newspapers. We employ two different architectures for generating the news image caption. The first model is based on CNN-LSTM and the second model is based on the attention mechanism. These models are evaluated both qualitatively and quantitatively. Qualitative analysis of the generated captions is carried out in terms of fluency and adequacy scores based on a standard rating scale. The quantitative result is evaluated using the BLEU and CIDEr evaluation metrics. We observe that the attention mechanism-based model outperforms the CNN-LSTM based model for our task.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

Notes

  1. http://ganaadhikar.com

  2. https://niyomiyabarta.org/home/

  3. https://www.asomiyapratidin.in/

References

  1. Amritkar C, Jabade V (2018) Image caption generation using deep learning technique. In: 2018 Fourth international conference on computing communication control and automation (ICCUBEA), IEEE, pp 1–4

  2. Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In Proc international conference on learning representations arXiv:1409.0473

  3. Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304. Elsevier

    Article  Google Scholar 

  4. Batra V, He Y, Vogiatzis G (2018) Neural caption generation for news images. In: Proceedings of the Eleventh international conference on language resources and evaluation (LREC 2018)

  5. Chen X, Fang H, Lin T-Y, Vedantam R, Gupta S, Dollár P, Lawrence ZC (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325

  6. Chen X, Lawrence Zitnick C (2015) Mind’s eye: A recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431

  7. Dhir R, Mishra SK, Saha S, Bhattacharyya P (2019) A deep attention based framework for image caption generation in hindi language. Computación y Sistemas 23:3

    Article  Google Scholar 

  8. Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482

  9. Feng Y, Lapata M (2010) How many words is a picture worth? automatic caption generation for news images. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pp 1239–1249

  10. Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (4):797–812. IEEE

    Article  Google Scholar 

  11. Gorokhovatskyi O, Peredrii O (2018) Shallow convolutional neural networks for pattern recognition problems. In: 2018 IEEE Second international conference on data stream mining & processing (DSMP), IEEE, pp 459–463

  12. Haripriya B, Srushti GM, Haseeb S, Prakash MM Image Captioning using Deep Learning

  13. Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6(02):107–116

    Article  Google Scholar 

  14. Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  15. Holzinger A, Saranti A, Mueller H (2021) KANDINSKY Patterns–An experimental exploration environment for Pattern Analysis and Machine Intelligence. arXiv:2103.00519

  16. Kamal AH, Jishan Md, Mansoor N et al (2020) TextMage: The Automated Bangla Caption Generator Based On Deep Learning. arXiv:2010.08066

  17. Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137

  18. Kohakade AK, Emmanuel M (2014) Content based caption generation for images embedded in news articles. Int J Comput Appl 100(11):7–15

    Google Scholar 

  19. Lu X, Wang B, Zheng X, Li X (2017) . Exploring models and data for remote sensing image caption generation 56(4):2183–2195. IEEE

    Google Scholar 

  20. Lu D, Whitehead S, Huang L, Ji H, Chang S-F (2018) Entity-aware image caption generation. arXiv:1804.07889

  21. Luong M-T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv:1508.04025

  22. Mansimov E, Parisotto E, Ba JL, Salakhutdinov R (2015) Generating images from captions with attention. arXiv:1511.02793

  23. Meetei LS, Singh TD, Bandyopadhyay S (2019) Extraction and identification of manipuri and mizo texts from scene and document images. In: Deka B, Maji P, Mitra S, Bhattacharyya DK, Bora PK, Pal SK (eds) PReMI 2019. LNCS. https://doi.org/10.1007/978-3-030-34869-4_44, vol 11941. Springer, Cham, pp 405–414

  24. Meetei LS, Singh TD, Bandyopadhyay S (2019) WAT2019: English-Hindi translation on Hindi visual genome dataset. In: Proceedings of the 6th workshop on asian translation, pp 181–188

  25. Miyazaki T, Shimizu N (2016) Cross-lingual image caption generation. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 1780–1790

  26. O’Shea K, Nash R (2015) An introduction to convolutional neural networks. arXiv:1511.08458

  27. Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318

  28. Peng H, Li N (2016) Generating chinese captions for flickr30k images

  29. Prajapati K, Wadekar S, Bobhate B, Mhatre A Auto-Caption Generation for News Images

  30. Rahman M, Mohammed N, Mansoor N, Momen S (2019) Chittron: An automatic bangla image captioning system. Procedia Comput Sci 154:636–642. Elsevier

    Article  Google Scholar 

  31. Sherstinsky A (2020) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena 404:132306. Elsevier

    Article  MathSciNet  Google Scholar 

  32. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  33. Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556

  34. Singh A, Meetei LS, Singh TD, Bandyopadhyay S (2021) Generation and evaluation of hindi image captions of visual genome. In: Maji AK, Saha G, Das S, Basu S, Tavares JMRS (eds) Proceedings of the international conference on computing and communication systems. Lecture Notes in Networks and Systems. https://doi.org/10.1007/978-981-33-4084-8_7, vol 170. Springer, Singapore

  35. Soh M (2016) Learning CNN-LSTM architectures for image caption generation. Dept Comput Sci, Stanford Univ., Stanford, CA, USA, Tech. Rep

  36. Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575

  37. Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator, pp 3156–3164

  38. Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057

  39. You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ringki Das.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Das, R., Singh, T.D. Assamese news image caption generation using attention mechanism. Multimed Tools Appl 81, 10051–10069 (2022). https://doi.org/10.1007/s11042-022-12042-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-022-12042-8

Keywords

Navigation