Image captioning in Hindi language using transformer networks

doi:10.1016/j.compeleceng.2021.107114

Computers & Electrical Engineering

Volume 92, June 2021, 107114

https://doi.org/10.1016/j.compeleceng.2021.107114 Get rights and content

Abstract

Neural encoder–decoder architectures have been used extensively for image captioning. Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN) are popularly used in encoder and decoder models. Recurrent Neural Networks are popular architectures in natural language processing used for language modeling, but they are sequential in nature. The transformer model can solve this problem of sequential dependency by using an attention mechanism. Many works are available for image captioning in the English language, but models for generating Hindi captions are limited; hence, we have tried to fill this gap. We have created the Hindi dataset for image captioning by manually translating the popular MSCOCO dataset from English to Hindi. Experimental results show that our proposed model outperforms other models. The proposed model has attained the BLEU-1 score of 62.9, BLEU-2 score of 43.3, BLEU-3 score of 29.1, and BLEU4 score of 19.0.

Introduction

Generating well-formed sentences for an image is a challenging task. This is a complex problem as compared to image classification and image recognition. The caption of an image must recognize not only the objects in an image but also the relationships between each other as well as their attributes and activities; hence, apart from scene understanding, the language model is needed to express this semantic knowledge in natural language. It can be helpful for visually challenged people to interpret the content on the web. It helps people in organizing and navigating through a large amount of unstructured visual data. Recent advancements in object recognition and language modeling have made it possible to generate relevant captions of an image. Nowadays, image captioning is an essential research area in computer vision, natural language processing, and image processing. Despite the complex nature of this problem, plenty of works have been done to solve the image captioning problem. Recent advancements in deep neural network [1], [2] and development of large classification datasets like ImageNet [3] have made it possible to get better quality of generated captions using the CNN and RNN. In previous works on image captioning [4], [5], pre-trained CNN is used as an encoder, and its last hidden layer is used as input to decoder RNN. In the current study, we have also used similar encoder–decoder based architecture.

Transformers are a special kind of neural network proposed by authors of [6]. They have proposed an attention-based model to transform one input sequence to another with the help of encoder–decoder architecture. It differs from the conventional sequence to sequence model because it does not use a recurrent neural network. But there is no work in image captioning based on transformer-based models. Inspired by this, we have proposed to use transformer-based architecture for image captioning tasks.

Key contributions of this paper are as follows:

•
We have created the Hindi dataset for image captioning as there was no dataset available in the Hindi language. We have used well known MSCOCO dataset; this dataset is translated into Hindi using Google translator. Further, google translated corpus is corrected by human annotators.
•
We have developed a novel image captioning model using the transformer architecture [6]. We have used CNN as an encoder for feature extraction and the transformer model as a decoder.

The proposed image captioning architecture has attained a BLEU-1 score of 62.9, BLEU-2 score of 43.3, BLEU-3 score of 29.1, and BLEU4 score of 19.0. To establish our model’s efficacy, we have compared our results with four baselines models, as shown in Table 6. Results illustrate that the proposed method outperforms other methods. The organization of the paper is as follows:

Section 2 highlights the literature of image captioning. Motivation of our work is described in Section 3. Section 4 describes our proposed methodology. Experimental setup and results are shown in Section 5 and Section 6, respectively. Finally conclusion of the paper is presented in Section 7.

Section snippets

Related works

In the previous works, there are two approaches to solve the image captioning problem, top-down approach [7], [8] and the bottom-up approach [9], [10].

In the top-down approach, the input image is converted into words based on some rules, whereas the bottom-up approach comes up with words, and further words are combined to form the caption. In the top-down approach, end to end formulation is used from an image to sentence, and all the parameters of networks are learned during training. The

Motivation

Hindi is the official language of India along with English; It is widely spoken in South Asia. Over a half billion people around the world speak Hindi. Hence, it is very much required to develop an image captioning model in Hindi for the well being of people. To the best of our knowledge, there is only one existing work [19] on image captioning for the Hindi language. But, this proposed method suffers from the problem of incorrect generation of captions as the language model suffers from long

Proposed methodology

In this paper, we have developed a novel model of image captioning in the Hindi language. Here, we have used the transformer model given by authors of [6] (as shown in Fig. 1).

Experimental setup

This section covers the procedure of dataset creation and evaluation techniques used.

Results and discussion

This section reports about qualitative and quantitative analysis of generated captions. To the best of our knowledge, there is only a single work on image captioning in the Hindi language by Dhir et al. [19]. Hence we have compared our approach with this work. We have also compared our results with some popular encoder–decoder architectures frequently used in image captioning. Our proposed model outperformed all other models, as shown in Table 6.

We have used RESNET101 to extract features from

Conclusions and future work

In this paper, a novel image captioning model using transformer networks is developed for the Hindi language. An encoder–decoder architecture is used for image captioning; here, CNN is used as an encoder, and the transformer model is used as a decoder. The advantage of using the transformer network is that it can focus on particular words on both the left and right sides of the current word, as it uses the vector embedding and positional encoding together. It does not suffer from the long term

CRediT authorship contribution statement

Santosh Kumar Mishra: Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft. Rijul Dhir: Methodology, Software, Validation, Formal analysis, Investigation, Writing - original draft. Sriparna Saha: Conceptualization, Supervision, Writing - review & editing, Data curation, Project administration, Funding acquisition. Pushpak Bhattacharyya: Supervision, Project administration, Funding acquisition. Amit Kumar Singh: Supervision, Writing - review & editing.

Declaration of Competing Interest

No author associated with this paper has disclosed any potential or pertinent conflicts which may be perceived to have impending conflict with this work. For full disclosure statements refer to https://doi.org/10.1016/j.compeleceng.2021.107114.

Acknowledgment

Dr. Sriparna Saha gratefully acknowledges the Young Faculty Research Fellowship (YFRF) Award, supported by Visvesvaraya PhD scheme for Electronics and IT, Ministry of Electronics and Information Technology (MeitY) , Government of India, being implemented by Digital India Corporation (formerly Media Lab Asia) for carrying out this research.

Santosh Kumar Mishra received M.Tech. degree in computer science and engineering from the Indian Institute of Information Technology, Design, and Manufacturing, Jabalpur. He is currently pursuing a Ph.D. degree with IIT Patna, Patna, India. His research interest is in natural language processing and computer vision.

References (25)

HuangF. et al.
Image-text sentiment analysis via deep multimodal attentive fusion
Knowl Based Syst
(2019)
KrizhevskyA. et al.
Imagenet classification with deep convolutional neural networks
RussakovskyO. et al.
Imagenet large scale visual recognition challenge
Int J Comput Vis
(2015)
XuK. et al.
Show, attend and tell: Neural image caption generation with visual attention
YouQ. et al.
Image captioning with semantic attention
VaswaniA. et al.
Attention is all you need
Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. 2014. arXiv...
SutskeverI. et al.
Sequence to sequence learning with neural networks
FarhadiA. et al.
Every picture tells a story: Generating sentences from images
ElliottD. et al.
Image description using visual dependency representations

VinyalsO. et al.

Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

IEEE Trans Pattern Anal Mach Intell

(2016)

Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y. Learning phrase representations...

Cited by (19)

MVL-Tra: Multi-view LFM signal source classification using Transformer
2023, Computers and Electrical Engineering
Classifying the sources of linear frequency modulation (LFM) signals has practical significance, as these signals are widely applied in various scenarios. However, the signals emitted by the same configured sources exhibit insignificant differences and may suffer severe degradation in low signal-to-noise ratio (SNR) conditions. Previous research failed to achieve satisfactory classification results in these two conditions. To address such challenges, we propose a signal source classification method for LFM signals using multi-view representations, based on the Transformer network, named MVL-Tra. Multi-view learning improves robustness by utilizing multiple feature sets and multi-dimensional analysis on these features. Transformer, a emerging deep neural network, enhances the distinctiveness of features between signals through capturing long-range dependencies when dealing with large-scale time-series data. In the experiment, we conducted classification on a dataset comprising 15 sets of low-SNR LFM signals. Notably, Even under the challenging condition of SNR = −5, our method achieved an accuracy of 93.78%. Th results confirm the effectiveness of using multi-view representations and Transformer for LFM signal source classification.
Emotion recognition in EEG signals using deep learning methods: A review
2023, Computers in Biology and Medicine
Emotions are a critical aspect of daily life and serve a crucial role in human decision-making, planning, reasoning, and other mental states. As a result, they are considered a significant factor in human interactions. Human emotions can be identified through various sources, such as facial expressions, speech, behavior (gesture/position), or physiological signals. The use of physiological signals can enhance the objectivity and reliability of emotion detection. Compared with peripheral physiological signals, electroencephalogram (EEG) recordings are directly generated by the central nervous system and are closely related to human emotions. EEG signals have the great spatial resolution that facilitates the evaluation of brain functions, making them a popular modality in emotion recognition studies. Emotion recognition using EEG signals presents several challenges, including signal variability due to electrode positioning, individual differences in signal morphology, and lack of a universal standard for EEG signal processing. Moreover, identifying the appropriate features for emotion recognition from EEG data requires further research. Finally, there is a need to develop more robust artificial intelligence (AI) including conventional machine learning (ML) and deep learning (DL) methods to handle the complex and diverse EEG signals associated with emotional states. This paper examines the application of DL techniques in emotion recognition from EEG signals and provides a detailed discussion of relevant articles. The paper explores the significant challenges in emotion recognition using EEG signals, highlights the potential of DL techniques in addressing these challenges, and suggests the scope for future research in emotion recognition using DL techniques. The paper concludes with a summary of its findings.
Hybrid Architecture using CNN and LSTM for Image Captioning in Hindi Language
2022, Procedia Computer Science
With the advent of deep learning in recent years, the integration of computer vision with natural language processing has garnered a lot of attention. Generating descriptions from images is one of the most intriguing and focused areas of Machine Learning that faces a number of obstacles, particularly when describing images in languages other than English. In image captioning, a computer is trained to understand the visual information of an image and then generate a description based on the image features and reference sentences. Having an application that automatically describes events in their environment and then translates them into a caption or message can make a significant contribution to society. This paper presents a multi-layered CNN-LSTM neural network model that is utilized to recognize and generate Hindi captions for the objects in images. In addition, a variety of models were trained by adjusting hyperparameters and the number of hidden layers to find the optimum model and maximize the likelihood of the resultant Hindi description. Moreover, after testing the effectiveness of our models, it was observed that our model has shown an increase of 34.64% and 29.13% in terms of BLEU score (Unigram) and BLEU score (Bigram) respectively when compared to the existing work done in this field. Image captioning in Hindi can have a multitude of applications in today's society, and it can also provide a user-friendly interface for Hindi speakers.
Self-Enhanced Attention for Image Captioning
2024, Neural Processing Letters
Fine-grained image emotion captioning based on Generative Adversarial Networks
2024, Multimedia Tools and Applications
A Comprehensive Survey on Image Captioning for Indian Languages: Techniques, Datasets, and Challenges
2023, Research Square

View all citing articles on Scopus

Rijul Dhir received his Bachelor’s degree in Computer Science and Engineering from Indian Institute of Technology Patna, India, in 2019 and is currently pursuing his M.S. (CS) degree from University of Southern California, USA. He has experience as a Software Engineer, and his research interests include Computer Vision, NLP, ML, and Data Science.

Sriparna Saha is currently serving as an Associate Professor in the Department of Computer Science and Engineering, Indian Institute of Technology Patna, India. She has authored more than 212 papers. Her current research interests include machine learning, natural language processing, multiobjective optimization and biomedical information extraction. Her h-index is 26 and the total Google scholar citation count of her papers is 4016.

Pushpak Bhattacharyya is the former Director of IIT Patna (2015–20) and Professor of Computer Science and Engineering Department IIT Bombay where he also held the Vijay and Sita Vashi Chair Professorship. His research areas are Natural Language Processing, Machine Learning and AI (NLP-ML-AI). He has been President (2016–17) of Association of Computational Linguistics (ACL) the highest International body of Computational Linguistics.

Amit Kumar Singh is currently an Assistant Professor with the Computer Science and Engineering Department, National Institute of Technology Patna, Bihar, India. He has authored over 100 peer-reviewed journal, conference publications, and book chapters. He is the associate editor of IEEE Access, IET Image Processing, and Telecommunication Systems, Springer. His research interests include multimedia data hiding, image processing, biometrics, & Cryptography.

^☆: This paper is for special section VSI-dlis. Reviews processed and recommended for publication by Guest Editor Feiran Huang.

View full text

Image captioning in Hindi language using transformer networks☆

Abstract

Introduction

Section snippets

Related works

Motivation

Proposed methodology

Experimental setup

Results and discussion

Conclusions and future work

CRediT authorship contribution statement

Declaration of Competing Interest

Acknowledgment

Knowl Based Syst

Imagenet classification with deep convolutional neural networks

Imagenet large scale visual recognition challenge

Int J Comput Vis

Show, attend and tell: Neural image caption generation with visual attention

Image captioning with semantic attention

Attention is all you need

Sequence to sequence learning with neural networks

Every picture tells a story: Generating sentences from images

Image description using visual dependency representations

Show and tell: Lessons learned from the 2015 mscoco image captioning challenge

IEEE Trans Pattern Anal Mach Intell