skip to main content
research-article

GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in Hindi

Published:13 October 2023Publication History
Skip Abstract Section

Abstract

Image captioning frameworks usually employ an encoder-decoder paradigm, with the encoder receiving abstract image feature vectors as input and the decoder for language modeling. Nowadays, most prominent architectures employ features from region proposals derived from object detection modules. In this work, we propose a novel architecture for image captioning. We employ the object detection module integrated with transformer architecture as an encoder and GPT-2 (Generative Pre-trained Transformer) as a decoder. The encoder utilizes the information of the spatial relationships among detected objects. We introduce a unique methodology for image caption generation in Hindi, which is widely spoken in South Asia and India and is the world’s third most spoken language as well as India’s official language. In terms of BLEU scores, the proposed approach’s performance is comparable to those of other baselines, and the results illustrate that the proposed approach outperforms the other baselines. The efficacy of the proposed approach in generating correct captions is further determined by human assessment in terms of adequacy and fluency.

REFERENCES

  1. [1] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google ScholarGoogle Scholar
  2. [2] Anderson Peter, He Xiaodong, Buehler Chris, Teney Damien, Johnson Mark, Gould Stephen, and Zhang Lei. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 60776086.Google ScholarGoogle ScholarCross RefCross Ref
  3. [3] Barraco Manuele, Cornia Marcella, Cascianelli Silvia, Baraldi Lorenzo, and Cucchiara Rita. 2022. The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 46624670.Google ScholarGoogle ScholarCross RefCross Ref
  4. [4] Cho Kyunghyun, Merriënboer Bart Van, Gulcehre Caglar, Bahdanau Dzmitry, Bougares Fethi, Schwenk Holger, and Bengio Yoshua. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google ScholarGoogle Scholar
  5. [5] Cornia Marcella, Stefanini Matteo, Baraldi Lorenzo, and Cucchiara Rita. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578–10587.Google ScholarGoogle Scholar
  6. [6] Deshpande Aditya, Aneja Jyoti, Wang Liwei, Schwing Alexander G., and Forsyth David. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10695–10704.Google ScholarGoogle Scholar
  7. [7] Dhir Rijul, Mishra Santosh Kumar, Saha Sriparna, and Bhattacharyya Pushpak. 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693–701.Google ScholarGoogle Scholar
  8. [8] Elliott Desmond and Keller Frank. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 12921302.Google ScholarGoogle Scholar
  9. [9] Farhadi Ali, Hejrati Mohsen, Sadeghi Mohammad Amin, Young Peter, Rashtchian Cyrus, Hockenmaier Julia, and Forsyth David. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 1529.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. [10] Fei Zhengcong, Yan Xu, Wang Shuhui, and Tian Qi. 2022. DeeCap: Dynamic early exiting for efficient image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1221612226.Google ScholarGoogle ScholarCross RefCross Ref
  11. [11] Feng Yang, Ma Lin, Liu Wei, and Luo Jiebo. 2019. Unsupervised image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4125–4134.Google ScholarGoogle Scholar
  12. [12] Garry J. and C. R. G. Rubino. 2001. Facts about the World’s Languages: An Encyclopedia of the World’s Major Languages, Past and Present. H. W. Wilson EBSCO, New York.Google ScholarGoogle Scholar
  13. [13] Ghataoura Darminder and Ogbonnaya Sam. 2021. Application of image captioning and retrieval to support military decision making. In 2021 International Conference on Military Communication and Information Systems (ICMCIS ’21). IEEE, 18.Google ScholarGoogle ScholarCross RefCross Ref
  14. [14] Gong Yunchao, Wang Liwei, Hodosh Micah, Hockenmaier Julia, and Lazebnik Svetlana. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. Springer, 529545.Google ScholarGoogle ScholarCross RefCross Ref
  15. [15] He Kaiming, Zhang Xiangyu, Ren Shaoqing, and Sun Jian. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770778.Google ScholarGoogle ScholarCross RefCross Ref
  16. [16] Hirota Yusuke, Nakashima Yuta, and Garcia Noa. 2022. Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1345013459.Google ScholarGoogle ScholarCross RefCross Ref
  17. [17] Hodosh Micah, Young Peter, and Hockenmaier Julia. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853899.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. [18] Honda Ukyo, Watanabe Taro, and Matsumoto Yuji. 2023. Switching to discriminative image captioning by relieving a bottleneck of reinforcement learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 11241134.Google ScholarGoogle ScholarCross RefCross Ref
  19. [19] Hu Han, Gu Jiayuan, Zhang Zheng, Dai Jifeng, and Wei Yichen. 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 35883597.Google ScholarGoogle ScholarCross RefCross Ref
  20. [20] Jiang Wenhao, Ma Lin, Chen Xinpeng, Zhang Hanwang, and Liu Wei. 2018. Learning to guide decoding for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 6959–6966.Google ScholarGoogle Scholar
  21. [21] Karpathy Andrej and Fei-Fei Li. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31283137.Google ScholarGoogle ScholarCross RefCross Ref
  22. [22] Kulkarni Girish, Premraj Visruth, Dhar Sagnik, Li Siming, Choi Yejin, Berg Alexander C., and Berg Tamara L.. 2011. Baby talk: Understanding and generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2011), 2891–2903.Google ScholarGoogle Scholar
  23. [23] Li Siming, Kulkarni Girish, Berg Tamara L., Berg Alexander C., and Choi Yejin. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220228.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. [24] Li Yehao, Pan Yingwei, Yao Ting, and Mei Tao. 2022. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1799017999.Google ScholarGoogle ScholarCross RefCross Ref
  25. [25] Li Zhuowan, Tran Quan, Mai Long, Lin Zhe, and Yuille Alan L.. 2020. Context-aware group captioning via self-attention and contrastive features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 34403450.Google ScholarGoogle ScholarCross RefCross Ref
  26. [26] Liu Junhao, Wang Kai, Xu Chunpu, Zhao Zhou, Xu Ruifeng, Shen Ying, and Yang Min. 2020. Interactive dual generative adversarial networks for image captioning. In AAAI. 1158811595.Google ScholarGoogle Scholar
  27. [27] MacLeod Haley, Bennett Cynthia L., Morris Meredith Ringel, and Cutrell Edward. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 59885999.Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. [28] Mao Junhua, Xu Wei, Yang Yi, Wang Jiang, Huang Zhiheng, and Yuille Alan. 2014. Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632 (2014).Google ScholarGoogle Scholar
  29. [29] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, and Bhattacharyya Pushpak. 2021. A Hindi image caption generation framework using deep learning. Transactions on Asian and Low-resource Language Information Processing 20, 2 (2021), 119.Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. [30] Mishra Santosh Kumar, Dhir Rijul, Saha Sriparna, Bhattacharyya Pushpak, and Singh Amit Kumar. 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114.Google ScholarGoogle ScholarCross RefCross Ref
  31. [31] Mishra Santosh Kumar, Rai Gaurav, Saha Sriparna, and Bhattacharyya Pushpak. 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Transactions on Asian and Low-resource Language Information Processing 21, 3 (2021), 117.Google ScholarGoogle Scholar
  32. [32] Monteiro João, Kitamoto Asanobu, and Martins Bruno. 2017. Situational awareness from social media photographs using automated image captioning. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA ’17). IEEE, 203211.Google ScholarGoogle ScholarCross RefCross Ref
  33. [33] Papineni Kishore, Roukos Salim, Ward Todd, and Zhu Wei-Jing. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311318.Google ScholarGoogle Scholar
  34. [34] Radford Alec, Wu Jeffrey, Child Rewon, Luan David, Amodei Dario, and Sutskever Ilya. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google ScholarGoogle Scholar
  35. [35] Ren Shaoqing, He Kaiming, Girshick Ross, and Sun Jian. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 9199.Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. [36] Rennie Steven J., Marcheret Etienne, Mroueh Youssef, Ross Jerret, and Goel Vaibhava. 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 70087024.Google ScholarGoogle ScholarCross RefCross Ref
  37. [37] Saini Naveen, Saha Sriparna, Bhattacharyya Pushpak, Mrinal Shubhankar, and Mishra Santosh Kumar. 2021. On multimodal microblog summarization. IEEE Transactions on Computational Social Systems 9, 5 (2021), 1317–1329.Google ScholarGoogle Scholar
  38. [38] Sammani Fawaz and Melas-Kyriazi Luke. 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 48084816.Google ScholarGoogle ScholarCross RefCross Ref
  39. [39] Vaswani Ashish, Shazeer Noam, Parmar Niki, Uszkoreit Jakob, Jones Llion, Gomez Aidan N., Kaiser Łukasz, and Polosukhin Illia. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 59986008.Google ScholarGoogle Scholar
  40. [40] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 31563164.Google ScholarGoogle ScholarCross RefCross Ref
  41. [41] Vinyals Oriol, Toshev Alexander, Bengio Samy, and Erhan Dumitru. 2016. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2016), 652663.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. [42] Wang Leiquan, Chu Xiaoliang, Zhang Weishan, Wei Yiwei, Sun Weichen, and Wu Chunlei. 2018. Social image captioning: Exploring visual attention and user attention. Sensors 18, 2 (2018), 646.Google ScholarGoogle ScholarCross RefCross Ref
  43. [43] Wu Mingrui, Zhang Xuying, Sun Xiaoshuai, Zhou Yiyi, Chen Chao, Gu Jiaxin, Sun Xing, and Ji Rongrong. 2022. DIFNet: Boosting visual information flow for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1802018029.Google ScholarGoogle ScholarCross RefCross Ref
  44. [44] Xu Guanghui, Niu Shuaicheng, Tan Mingkui, Luo Yucheng, Du Qing, and Wu Qi. 2021. Towards accurate text-based image captioning with content diversity exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 1263712646.Google ScholarGoogle ScholarCross RefCross Ref
  45. [45] Xu Kelvin, Ba Jimmy, Kiros Ryan, Cho Kyunghyun, Courville Aaron, Salakhudinov Ruslan, Zemel Rich, and Bengio Yoshua. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 20482057.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. [46] Yang Zhilin, Yuan Ye, Wu Yuexin, Cohen William W., and Salakhutdinov Ruslan R.. 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 23612369.Google ScholarGoogle Scholar
  47. [47] You Quanzeng, Jin Hailin, Wang Zhaowen, Fang Chen, and Luo Jiebo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 46514659.Google ScholarGoogle ScholarCross RefCross Ref
  48. [48] Zhou Luowei, Palangi Hamid, Zhang Lei, Hu Houdong, Corso Jason J., and Gao Jianfeng. 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI. 1304113049.Google ScholarGoogle Scholar

Index Terms

  1. GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in Hindi

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    Full Access

    • Published in

      cover image ACM Transactions on Asian and Low-Resource Language Information Processing
      ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 10
      October 2023
      226 pages
      ISSN:2375-4699
      EISSN:2375-4702
      DOI:10.1145/3627976
      Issue’s Table of Contents

      Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 13 October 2023
      • Online AM: 11 September 2023
      • Accepted: 21 August 2023
      • Revised: 25 April 2023
      • Received: 21 August 2022
      Published in tallip Volume 22, Issue 10

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    View Full Text