Abstract
In recent times, neural networks and deep learning have made significant contributions in various research domains. In the present work, we report automatic caption generation of an image using these techniques. Automatic image caption generation is an artificial intelligence problem that receives attention from both computer vision and natural language processing researchers. Most of the caption generation tasks exist in the English language and no work has been reported yet in Assamese to the best of our knowledge. Assamese is an Indo-European language spoken by 14 million speakers in the North-East region of India. This paper reports the image caption generation on the Assamese news domain. A quality image captioning system requires an annotated training corpus. However, there is no such standard dataset available for this resource-constrained language. Therefore, we built a dataset of 13000 images collected from various online local Assamese e-newspapers. We employ two different architectures for generating the news image caption. The first model is based on CNN-LSTM and the second model is based on the attention mechanism. These models are evaluated both qualitatively and quantitatively. Qualitative analysis of the generated captions is carried out in terms of fluency and adequacy scores based on a standard rating scale. The quantitative result is evaluated using the BLEU and CIDEr evaluation metrics. We observe that the attention mechanism-based model outperforms the CNN-LSTM based model for our task.












Similar content being viewed by others
References
Amritkar C, Jabade V (2018) Image caption generation using deep learning technique. In: 2018 Fourth international conference on computing communication control and automation (ICCUBEA), IEEE, pp 1–4
Bahdanau D, Cho K, Bengio Y (2015) Neural machine translation by jointly learning to align and translate. In Proc international conference on learning representations arXiv:1409.0473
Bai S, An S (2018) A survey on automatic image caption generation. Neurocomputing 311:291–304. Elsevier
Batra V, He Y, Vogiatzis G (2018) Neural caption generation for news images. In: Proceedings of the Eleventh international conference on language resources and evaluation (LREC 2018)
Chen X, Fang H, Lin T-Y, Vedantam R, Gupta S, Dollár P, Lawrence ZC (2015) Microsoft coco captions: Data collection and evaluation server. arXiv:1504.00325
Chen X, Lawrence Zitnick C (2015) Mind’s eye: A recurrent visual representation for image caption generation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 2422–2431
Dhir R, Mishra SK, Saha S, Bhattacharyya P (2019) A deep attention based framework for image caption generation in hindi language. Computación y Sistemas 23:3
Fang H, Gupta S, Iandola F, Srivastava RK, Deng L, Dollár P, Gao J, He X, Mitchell M, Platt JC et al (2015) From captions to visual concepts and back. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 1473–1482
Feng Y, Lapata M (2010) How many words is a picture worth? automatic caption generation for news images. In: Proceedings of the 48th annual meeting of the Association for Computational Linguistics, pp 1239–1249
Feng Y, Lapata M (2012) Automatic caption generation for news images. IEEE Transactions on Pattern Analysis and Machine Intelligence 35 (4):797–812. IEEE
Gorokhovatskyi O, Peredrii O (2018) Shallow convolutional neural networks for pattern recognition problems. In: 2018 IEEE Second international conference on data stream mining & processing (DSMP), IEEE, pp 459–463
Haripriya B, Srushti GM, Haseeb S, Prakash MM Image Captioning using Deep Learning
Hochreiter S (1998) The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems 6(02):107–116
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780
Holzinger A, Saranti A, Mueller H (2021) KANDINSKY Patterns–An experimental exploration environment for Pattern Analysis and Machine Intelligence. arXiv:2103.00519
Kamal AH, Jishan Md, Mansoor N et al (2020) TextMage: The Automated Bangla Caption Generator Based On Deep Learning. arXiv:2010.08066
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 3128–3137
Kohakade AK, Emmanuel M (2014) Content based caption generation for images embedded in news articles. Int J Comput Appl 100(11):7–15
Lu X, Wang B, Zheng X, Li X (2017) . Exploring models and data for remote sensing image caption generation 56(4):2183–2195. IEEE
Lu D, Whitehead S, Huang L, Ji H, Chang S-F (2018) Entity-aware image caption generation. arXiv:1804.07889
Luong M-T, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv:1508.04025
Mansimov E, Parisotto E, Ba JL, Salakhutdinov R (2015) Generating images from captions with attention. arXiv:1511.02793
Meetei LS, Singh TD, Bandyopadhyay S (2019) Extraction and identification of manipuri and mizo texts from scene and document images. In: Deka B, Maji P, Mitra S, Bhattacharyya DK, Bora PK, Pal SK (eds) PReMI 2019. LNCS. https://doi.org/10.1007/978-3-030-34869-4_44, vol 11941. Springer, Cham, pp 405–414
Meetei LS, Singh TD, Bandyopadhyay S (2019) WAT2019: English-Hindi translation on Hindi visual genome dataset. In: Proceedings of the 6th workshop on asian translation, pp 181–188
Miyazaki T, Shimizu N (2016) Cross-lingual image caption generation. In: Proceedings of the 54th annual meeting of the association for computational linguistics (Volume 1: Long Papers), pp 1780–1790
O’Shea K, Nash R (2015) An introduction to convolutional neural networks. arXiv:1511.08458
Papineni K, Roukos S, Ward T, Zhu W-J (2002) BLEU: A method for automatic evaluation of machine translation. In: Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pp 311–318
Peng H, Li N (2016) Generating chinese captions for flickr30k images
Prajapati K, Wadekar S, Bobhate B, Mhatre A Auto-Caption Generation for News Images
Rahman M, Mohammed N, Mansoor N, Momen S (2019) Chittron: An automatic bangla image captioning system. Procedia Comput Sci 154:636–642. Elsevier
Sherstinsky A (2020) Fundamentals of recurrent neural network (rnn) and long short-term memory (lstm) network. Physica D: Nonlinear Phenomena 404:132306. Elsevier
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Simonyan K, Zisserman A (2014) Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556
Singh A, Meetei LS, Singh TD, Bandyopadhyay S (2021) Generation and evaluation of hindi image captions of visual genome. In: Maji AK, Saha G, Das S, Basu S, Tavares JMRS (eds) Proceedings of the international conference on computing and communication systems. Lecture Notes in Networks and Systems. https://doi.org/10.1007/978-981-33-4084-8_7, vol 170. Springer, Singapore
Soh M (2016) Learning CNN-LSTM architectures for image caption generation. Dept Comput Sci, Stanford Univ., Stanford, CA, USA, Tech. Rep
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4566–4575
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator, pp 3156–3164
Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International conference on machine learning, pp 2048–2057
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4651–4659
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Das, R., Singh, T.D. Assamese news image caption generation using attention mechanism. Multimed Tools Appl 81, 10051–10069 (2022). https://doi.org/10.1007/s11042-022-12042-8
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12042-8