Image captioning in Hindi language using transformer networks

https://doi.org/10.1016/j.compeleceng.2021.107114Get rights and content

Abstract

Neural encoder–decoder architectures have been used extensively for image captioning. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are popularly used in encoder and decoder models. Recurrent Neural Networks are popular architectures in natural language processing used for language modeling, but they are sequential in nature. The transformer model can solve this problem of sequential dependency by using an attention mechanism. Many works are available for image captioning in the English language, but models for generating Hindi captions are limited; hence, we have tried to fill this gap. We have created the Hindi dataset for image captioning by manually translating the popular MSCOCO dataset from English to Hindi. Experimental results show that our proposed model outperforms other models. The proposed model has attained the BLEU-1 score of 62.9, BLEU-2 score of 43.3, BLEU-3 score of 29.1, and BLEU4 score of 19.0.

Introduction

Generating well-formed sentences for an image is a challenging task. This is a complex problem as compared to image classification and image recognition. The caption of an image must recognize not only the objects in an image but also the relationships between each other as well as their attributes and activities; hence, apart from scene understanding, the language model is needed to express this semantic knowledge in natural language. It can be helpful for visually challenged people to interpret the content on the web. It helps people in organizing and navigating through a large amount of unstructured visual data. Recent advancements in object recognition and language modeling have made it possible to generate relevant captions of an image. Nowadays, image captioning is an essential research area in computer vision, natural language processing, and image processing. Despite the complex nature of this problem, plenty of works have been done to solve the image captioning problem. Recent advancements in deep neural network [1], [2] and development of large classification datasets like ImageNet [3] have made it possible to get better quality of generated captions using the CNN and RNN. In previous works on image captioning [4], [5], pre-trained CNN is used as an encoder, and its last hidden layer is used as input to decoder RNN. In the current study, we have also used similar encoder–decoder based architecture.

Transformers are a special kind of neural network proposed by authors of [6]. They have proposed an attention-based model to transform one input sequence to another with the help of encoder–decoder architecture. It differs from the conventional sequence to sequence model because it does not use a recurrent neural network. But there is no work in image captioning based on transformer-based models. Inspired by this, we have proposed to use transformer-based architecture for image captioning tasks.

Key contributions of this paper are as follows:

  • We have created the Hindi dataset for image captioning as there was no dataset available in the Hindi language. We have used well known MSCOCO dataset; this dataset is translated into Hindi using Google translator. Further, google translated corpus is corrected by human annotators.

  • We have developed a novel image captioning model using the transformer architecture [6]. We have used CNN as an encoder for feature extraction and the transformer model as a decoder.

The proposed image captioning architecture has attained a BLEU-1 score of 62.9, BLEU-2 score of 43.3, BLEU-3 score of 29.1, and BLEU4 score of 19.0. To establish our model’s efficacy, we have compared our results with four baselines models, as shown in Table 6. Results illustrate that the proposed method outperforms other methods. The organization of the paper is as follows:

Section 2 highlights the literature of image captioning. Motivation of our work is described in Section 3. Section 4 describes our proposed methodology. Experimental setup and results are shown in Section 5 and Section 6, respectively. Finally conclusion of the paper is presented in Section 7.

Section snippets

Related works

In the previous works, there are two approaches to solve the image captioning problem, top-down approach [7], [8] and the bottom-up approach [9], [10].

In the top-down approach, the input image is converted into words based on some rules, whereas the bottom-up approach comes up with words, and further words are combined to form the caption. In the top-down approach, end to end formulation is used from an image to sentence, and all the parameters of networks are learned during training. The

Motivation

Hindi is the official language of India along with English; It is widely spoken in South Asia. Over a half billion people around the world speak Hindi. Hence, it is very much required to develop an image captioning model in Hindi for the well being of people. To the best of our knowledge, there is only one existing work [19] on image captioning for the Hindi language. But, this proposed method suffers from the problem of incorrect generation of captions as the language model suffers from long

Proposed methodology

In this paper, we have developed a novel model of image captioning in the Hindi language. Here, we have used the transformer model given by authors of [6] (as shown in Fig. 1).

Experimental setup

This section covers the procedure of dataset creation and evaluation techniques used.

Results and discussion

This section reports about qualitative and quantitative analysis of generated captions. To the best of our knowledge, there is only a single work on image captioning in the Hindi language by Dhir et al. [19]. Hence we have compared our approach with this work. We have also compared our results with some popular encoder–decoder architectures frequently used in image captioning. Our proposed model outperformed all other models, as shown in Table 6.

We have used RESNET101 to extract features from

Conclusions and future work

In this paper, a novel image captioning model using transformer networks is developed for the Hindi language. An encoder–decoder architecture is used for image captioning; here, CNN is used as an encoder, and the transformer model is used as a decoder. The advantage of using the transformer network is that it can focus on particular words on both the left and right sides of the current word, as it uses the vector embedding and positional encoding together. It does not suffer from the long term

CRediT authorship contribution statement

Santosh Kumar Mishra: Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft. Rijul Dhir: Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft. Sriparna Saha: Conceptualization, Supervision, Writing - review & editing, Data curation, Project administration, Funding acquisition. Pushpak Bhattacharyya: Supervision, Project administration, Funding acquisition. Amit Kumar Singh: Supervision, Writing - review & editing.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2021.107114.

Acknowledgment

Dr. Sriparna Saha gratefully acknowledges the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY) , Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia) for carrying out this research.

Santosh Kumar Mishra received M.Tech. degree in computer science and engineering from the Indian Institute of Information Technology, Design, and Manufacturing, Jabalpur. He is currently pursuing a Ph.D. degree with IIT Patna, Patna, India. His research interest is in natural language processing and computer vision.

References (25)

  • HuangF. et al.

    Image-text sentiment analysis via deep multimodal attentive fusion

    Knowl Based Syst

    (2019)
  • KrizhevskyA. et al.

    Imagenet classification with deep convolutional neural networks

  • RussakovskyO. et al.

    Imagenet large scale visual recognition challenge

    Int J Comput Vis

    (2015)
  • XuK. et al.

    Show, attend and tell: Neural image caption generation with visual attention

  • YouQ. et al.

    Image captioning with semantic attention

  • VaswaniA. et al.

    Attention is all you need

  • Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. 2014. arXiv...
  • SutskeverI. et al.

    Sequence to sequence learning with neural networks

  • FarhadiA. et al.

    Every picture tells a story: Generating sentences from images

  • ElliottD. et al.

    Image description using visual dependency representations

  • VinyalsO. et al.

    Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

    IEEE Trans Pattern Anal Mach Intell

    (2016)
  • Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations...
  • Cited by (19)

    • Self-Enhanced Attention for Image Captioning

      2024, Neural Processing Letters
    View all citing articles on Scopus

    Santosh Kumar Mishra received M.Tech. degree in computer science and engineering from the Indian Institute of Information Technology, Design, and Manufacturing, Jabalpur. He is currently pursuing a Ph.D. degree with IIT Patna, Patna, India. His research interest is in natural language processing and computer vision.

    Rijul Dhir received his Bachelor’s degree in Computer Science and Engineering from Indian Institute of Technology Patna, India, in 2019 and is currently pursuing his M.S. (CS) degree from University of Southern California, USA. He has experience as a Software Engineer, and his research interests include Computer Vision, NLP, ML, and Data Science.

    Sriparna Saha is currently serving as an Associate Professor in the Department of Computer Science and Engineering, Indian Institute of Technology Patna, India. She has authored more than 212 papers. Her current research interests include machine learning, natural language processing, multiobjective optimization and biomedical information extraction. Her h-index is 26 and the total Google scholar citation count of her papers is 4016.

    Pushpak Bhattacharyya is the former Director of IIT Patna (2015–20) and Professor of Computer Science and Engineering Department IIT Bombay where he also held the Vijay and Sita Vashi Chair Professorship. His research areas are Natural Language Processing, Machine Learning and AI (NLP-ML-AI). He has been President (2016–17) of Association of Computational Linguistics (ACL) the highest International body of Computational Linguistics.

    Amit Kumar Singh is currently an Assistant Professor with the Computer Science and Engineering Department, National Institute of Technology Patna, Bihar, India. He has authored over 100 peer-reviewed journal, conference publications, and book chapters. He is the associate editor of IEEE Access, IET Image Processing, and Telecommunication Systems, Springer. His research interests include multimedia data hiding, image processing, biometrics, & Cryptography.

    This paper is for special section VSI-dlis. Reviews processed and recommended for publication by Guest Editor Feiran Huang.

    View full text