research-article

GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in Hindi

Authors:
Santosh Kumar Mishra

Rajiv Gandhi Institute of Petroleum Technology, Amethi, India

Rajiv Gandhi Institute of Petroleum Technology, Amethi, India

0000-0003-4639-5506
View Profile

,
Soham Chakraborty

Kalinga Institute of Industrial Technology, Bhubaneswar, India

Kalinga Institute of Industrial Technology, Bhubaneswar, India

0009-0004-2675-9418
View Profile

,
Sriparna Saha

Indian Institute of Technology Patna, India

Indian Institute of Technology Patna, India

0000-0001-5458-9381
View Profile

,
Pushpak Bhattacharyya

Indian Institute of Technology Bombay

Indian Institute of Technology Bombay

0000-0001-5319-5508
View Profile

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22 Issue 10Article No.: 241pp 1–16https://doi.org/10.1145/3622936

Published:13 October 2023Publication History

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

Image captioning frameworks usually employ an encoder-decoder paradigm, with the encoder receiving abstract image feature vectors as input and the decoder for language modeling. Nowadays, most prominent architectures employ features from region proposals derived from object detection modules. In this work, we propose a novel architecture for image captioning. We employ the object detection module integrated with transformer architecture as an encoder and GPT-2 (Generative Pre-trained Transformer) as a decoder. The encoder utilizes the information of the spatial relationships among detected objects. We introduce a unique methodology for image caption generation in Hindi, which is widely spoken in South Asia and India and is the world’s third most spoken language as well as India’s official language. In terms of BLEU scores, the proposed approach’s performance is comparable to those of other baselines, and the results illustrate that the proposed approach outperforms the other baselines. The efficacy of the proposed approach in generating correct captions is further determined by human assessment in terms of adequacy and fluency.

REFERENCES

[1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
[2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google ScholarCross Ref
[3] Barraco Manuele, Cornia Marcella, Cascianelli Silvia, Baraldi Lorenzo, and Cucchiara Rita. 2022. The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4662–4670.Google ScholarCross Ref
[4] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
[5] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578–10587.Google Scholar
[6] Deshpande Aditya, Aneja Jyoti, Wang Liwei, Schwing Alexander G., and Forsyth David. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10695–10704.Google Scholar
[7] Dhir Rijul, Mishra Santosh Kumar, Saha Sriparna, and Bhattacharyya Pushpak. 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693–701.Google Scholar
[8] Elliott Desmond and Keller Frank. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292–1302.Google Scholar
[9] Farhadi Ali, Hejrati Mohsen, Sadeghi Mohammad Amin, Young Peter, Rashtchian Cyrus, Hockenmaier Julia, and Forsyth David. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 15–29.Google ScholarDigital Library
[10] Fei Zhengcong, Yan Xu, Wang Shuhui, and Tian Qi. 2022. DeeCap: Dynamic early exiting for efficient image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12216–12226.Google ScholarCross Ref
[11] Feng Yang, Ma Lin, Liu Wei, and Luo Jiebo. 2019. Unsupervised image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4125–4134.Google Scholar
[12] Garry J. and C. R. G. Rubino. 2001. Facts about the World’s Languages: An Encyclopedia of the World’s Major Languages, Past and Present. H. W. Wilson EBSCO, New York.Google Scholar
[13] Ghataoura Darminder and Ogbonnaya Sam. 2021. Application of image captioning and retrieval to support military decision making. In 2021 International Conference on Military Communication and Information Systems (ICMCIS ’21). IEEE, 1–8.Google ScholarCross Ref
[14] Gong Yunchao, Wang Liwei, Hodosh Micah, Hockenmaier Julia, and Lazebnik Svetlana. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. Springer, 529–545.Google ScholarCross Ref
[15] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
[16] Hirota Yusuke, Nakashima Yuta, and Garcia Noa. 2022. Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13450–13459.Google ScholarCross Ref
[17] Hodosh Micah, Young Peter, and Hockenmaier Julia. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.Google ScholarDigital Library
[18] Honda Ukyo, Watanabe Taro, and Matsumoto Yuji. 2023. Switching to discriminative image captioning by relieving a bottleneck of reinforcement learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1124–1134.Google ScholarCross Ref
[19] Hu Han, Gu Jiayuan, Zhang Zheng, Dai Jifeng, and Wei Yichen. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588–3597.Google ScholarCross Ref
[20] Jiang Wenhao, Ma Lin, Chen Xinpeng, Zhang Hanwang, and Liu Wei. 2018. Learning to guide decoding for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 6959–6966.Google Scholar
[21] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google ScholarCross Ref
[22] Kulkarni Girish, Premraj Visruth, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2011. Baby talk: Understanding and generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2011), 2891–2903.Google Scholar
[23] Li Siming, Kulkarni Girish, Berg Tamara L., Berg Alexander C., and Choi Yejin. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228.Google ScholarDigital Library
[24] Li Yehao, Pan Yingwei, Yao Ting, and Mei Tao. 2022. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17990–17999.Google ScholarCross Ref
[25] Li Zhuowan, Tran Quan, Mai Long, Lin Zhe, and Yuille Alan L.. 2020. Context-aware group captioning via self-attention and contrastive features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3440–3450.Google ScholarCross Ref
[26] Liu Junhao, Wang Kai, Xu Chunpu, Zhao Zhou, Xu Ruifeng, Shen Ying, and Yang Min. 2020. Interactive dual generative adversarial networks for image captioning. In AAAI. 11588–11595.Google Scholar
[27] MacLeod Haley, Bennett Cynthia L., Morris Meredith Ringel, and Cutrell Edward. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999.Google ScholarDigital Library
[28] Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan. 2014. Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632 (2014).Google Scholar
[29] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, and Bhattacharyya Pushpak. 2021. A Hindi image caption generation framework using deep learning. Transactions on Asian and Low-resource Language Information Processing 20, 2 (2021), 1–19.Google ScholarDigital Library
[30] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, Bhattacharyya Pushpak, and Singh Amit Kumar. 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114.Google ScholarCross Ref
[31] Mishra Santosh Kumar, Rai Gaurav, Saha Sriparna, and Bhattacharyya Pushpak. 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Transactions on Asian and Low-resource Language Information Processing 21, 3 (2021), 1–17.Google Scholar
[32] Monteiro João, Kitamoto Asanobu, and Martins Bruno. 2017. Situational awareness from social media photographs using automated image captioning. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA ’17). IEEE, 203–211.Google ScholarCross Ref
[33] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.Google Scholar
[34] Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, and Sutskever Ilya. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google Scholar
[35] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.Google ScholarDigital Library
[36] Rennie Steven J., Marcheret Etienne, Mroueh Youssef, Ross Jerret, and Goel Vaibhava. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.Google ScholarCross Ref
[37] Saini Naveen, Saha Sriparna, Bhattacharyya Pushpak, Mrinal Shubhankar, and Mishra Santosh Kumar. 2021. On multimodal microblog summarization. IEEE Transactions on Computational Social Systems 9, 5 (2021), 1317–1329.Google Scholar
[38] Sammani Fawaz and Melas-Kyriazi Luke. 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4808–4816.Google ScholarCross Ref
[39] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
[40] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google ScholarCross Ref
[41] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2016. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2016), 652–663.Google ScholarDigital Library
[42] Wang Leiquan, Chu Xiaoliang, Zhang Weishan, Wei Yiwei, Sun Weichen, and Wu Chunlei. 2018. Social image captioning: Exploring visual attention and user attention. Sensors 18, 2 (2018), 646.Google ScholarCross Ref
[43] Wu Mingrui, Zhang Xuying, Sun Xiaoshuai, Zhou Yiyi, Chen Chao, Gu Jiaxin, Sun Xing, and Ji Rongrong. 2022. DIFNet: Boosting visual information flow for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18020–18029.Google ScholarCross Ref
[44] Xu Guanghui, Niu Shuaicheng, Tan Mingkui, Luo Yucheng, Du Qing, and Wu Qi. 2021. Towards accurate text-based image captioning with content diversity exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12637–12646.Google ScholarCross Ref
[45] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048–2057.Google ScholarDigital Library
[46] Yang Zhilin, Yuan Ye, Wu Yuexin, Cohen William W., and Salakhutdinov Ruslan R.. 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 2361–2369.Google Scholar
[47] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google ScholarCross Ref
[48] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI. 13041–13049.Google Scholar

Index Terms

GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in Hindi
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation

Recommendations

Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi
In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed ...
Read More
An Object Localization-based Dense Image Captioning Framework in Hindi
Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language ...
Read More
A Hindi Image Caption Generation Framework Using Deep Learning
Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 10
October 2023
226 pages
ISSN:2375-4699
EISSN:2375-4702
DOI:10.1145/3627976
Editor:
Imed Zitouni
Google, USA
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 13 October 2023
- Online AM: 11 September 2023
- Accepted: 21 August 2023
- Revised: 25 April 2023
- Received: 21 August 2022
Published in tallip Volume 22, Issue 10

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Deep learning
attention
GPT-2
Hindi
Qualifiers
- research-article
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 1
  Total Citations
  View Citations
- 184
  Total Downloads
- Downloads (Last 12 months)184
- Downloads (Last 6 weeks)21
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

View Full Text

GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in Hindi

ACM Transactions on Asian and Low-Resource Language Information Processing

Abstract

REFERENCES

Cited By

Index Terms

Recommendations

Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi

An Object Localization-based Dense Image Captioning Framework in Hindi

A Hindi Image Caption Generation Framework Using Deep Learning

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Full Text

Share this Publication link

Share on Social Media