research-article

An Object Localization-based Dense Image Captioning Framework in Hindi

Authors:

Santosh Kumar Mishra,

Pushpak BhattacharyyaAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 22, Issue 2

Article No.: 33, Pages 1 - 15

https://doi.org/10.1145/3558391

Published: 27 December 2022 Publication History

Abstract

Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language processing for generating captions. Numerous works have been carried out on dense image captioning for resource-rich languages like English; however, resource-poor languages like Hindi are not explored. Hindi is one of India’s official languages and is the third most spoken language in the world. This article proposes a dense image captioning model to describe different segments of an image by generating more than one caption in the Hindi language. For localized image recognition and language modeling, we employ Faster R-CNN and Long Short-Term Memory (LSTM), respectively. Apart from this, we conduct various experiments using gated recurrent units (GRUs) and attention mechanism. By manually translating the well-known Visual Genome dataset from English to Hindi, a dataset has been created for dense image captioning in Hindi. The experiments conducted on the newly constructed Hindi dense image captioning dataset illustrate the efficacy of the proposed method over the state-of-the-art methods.

References

[1]

Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).

[3]

Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2874–2883.

[4]

Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).

[5]

Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).

[6]

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.

[7]

Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).

[8]

Rijul Dhir, Santosh Kumar Mishra, Sriparna Saha, and Pushpak Bhattacharyya. 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693–701.

[9]

Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.

[10]

Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292–1302.

[11]

Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338.

Digital Library

[12]

Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.

[13]

Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 15–29.

Digital Library

[14]

Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).

[15]

Darminder Ghataoura and Sam Ogbonnaya. 2021. Application of image captioning and retrieval to support military decision making. In 2021 International Conference on Military Communication and Information Systems (ICMCIS’21). IEEE, 1–8.

[16]

Karanjit Gill, Sriparna Saha, and Santosh Kumar Mishra. 2021. Dense image captioning in hindi. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 2894–2899.

Digital Library

[17]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.

[18]

Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.

Digital Library

[19]

Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.

[20]

Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018. Learning to guide decoding for image captioning. In 32nd AAAI Conference on Artificial Intelligence.

[21]

Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang. 2015. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272 (2015).

[22]

Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565–4574.

[23]

Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.

[24]

Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679 (2014).

[25]

Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, and Yu-Wing Tai. 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8888–8897.

[26]

Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. 2019. Dense relational captioning: Triple-stream networks for relationship-based captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6271–6280.

[27]

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, and others. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.

[28]

Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR. Citeseer.

Digital Library

[29]

Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228.

Digital Library

[30]

Junhao Liu, Kai Wang, Chunpu Xu, Zhou Zhao, Ruifeng Xu, Ying Shen, and Min Yang. 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 7 (2020), 11588–11595.

[31]

Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999.

Digital Library

[32]

Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632 (2014).

[33]

Santosh Kumar Mishra, Rijul Dhir, Sriparna Saha, and Pushpak Bhattacharyya. 2021. A hindi image caption generation framework using deep learning. Transactions on Asian and Low-resource Language Information Processing 20, 2 (2021), 1–19.

Digital Library

[34]

Santosh Kumar Mishra, Rijul Dhir, Sriparna Saha, Pushpak Bhattacharyya, and Amit Kumar Singh. 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114.

[35]

Santosh Kumar Mishra, Mahesh Babu Peethala, Sriparna Saha, and Pushpak Bhattacharyya. 2021. An information multiplexed encoder-decoder network for image captioning in Hindi. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 3019–3024.

Digital Library

[36]

Santosh Kumar Mishra, Gaurav Rai, Sriparna Saha, and Pushpak Bhattacharyya. 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Transactions on Asian and Low-Resource Language Information Processing 21, 3 (2021), 1–17.

[37]

João Monteiro, Asanobu Kitamoto, and Bruno Martins. 2017. Situational awareness from social media photographs using automated image captioning. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA’17). IEEE, 203–211.

[38]

Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 891–898.

Digital Library

[39]

Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.

[40]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.

[41]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.

Digital Library

[42]

Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2016), 1137–1149.

Digital Library

[43]

Naveen Saini, Sriparna Saha, Pushpak Bhattacharyya, Shubhankar Mrinal, and Santosh Kumar Mishra. 2021. On multimodal microblog summarization. IEEE Transactions on Computational Social Systems (2021), 1–13. DOI:

[44]

Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).

[45]

Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.

[46]

Leiquan Wang, Xiaoliang Chu, Weishan Zhang, Yiwei Wei, Weichen Sun, and Chunlei Wu. 2018. Social image captioning: Exploring visual attention and user attention. Sensors 18, 2 (2018), 646.

[47]

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048–2057.

Digital Library

[48]

Linjie Yang, Kevin Tang, Jianchao Yang, and Li-Jia Li. 2017. Dense captioning with joint inference and visual context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2193–2202.

[49]

Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.

Digital Library

[50]

Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.

[51]

Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI. 13041–13049.

Cited By

Jayaswal VRani RKaur J(2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
https://doi.org/10.4018/979-8-3693-5643-2.ch009
Patra BKisku D(2024)Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer DecodersTurkish Journal of Engineering10.31127/tuje.1507442Online publication date: 22-Aug-2024
https://doi.org/10.31127/tuje.1507442
Narendrabhai CThacker C(2024)A Survey on Image Captioning Using Encoder-Decoder Based Deep Learning Models2024 Parul International Conference on Engineering and Technology (PICET)10.1109/PICET60765.2024.10716063(1-5)Online publication date: 3-May-2024
https://doi.org/10.1109/PICET60765.2024.10716063
Show More Cited By

Index Terms

An Object Localization-based Dense Image Captioning Framework in Hindi
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation

Recommendations

Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi
In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed ...
Efficient Channel Attention Based Encoder–Decoder Approach for Image Captioning in Hindi
Image captioning refers to the process of generating a textual description that describes objects and activities present in a given image. It connects two fields of artificial intelligence, computer vision, and natural language processing. Computer vision ...
A Hindi Image Caption Generation Framework Using Deep Learning
Image captioning is the process of generating a textual description of an image that aims to describe the salient parts of the given image. It is an important problem, as it involves computer vision and natural language processing, where computer vision ...

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 22, Issue 2

February 2023

624 pages

ISSN:2375-4699

EISSN:2375-4702

DOI:10.1145/3572719

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 27 December 2022

Online AM: 22 August 2022

Accepted: 16 May 2022

Revised: 26 March 2022

Received: 21 November 2021

Published in TALLIP Volume 22, Issue 2

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Refereed

Funding Sources

Young Faculty Research Fellowship program of Visvesvaraya Ph.D. Scheme of Ministry of Electronics & Information Technology, Government of India
Digital India Corporation (Formerly Media Lab Asia) for conducting this research

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

8
Total Citations
View Citations
357
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)10

Reflects downloads up to 17 Jan 2025

Other Metrics

View Author Metrics

Citations

Cited By

Jayaswal VRani RKaur J(2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
https://doi.org/10.4018/979-8-3693-5643-2.ch009
Patra BKisku D(2024)Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer DecodersTurkish Journal of Engineering10.31127/tuje.1507442Online publication date: 22-Aug-2024
https://doi.org/10.31127/tuje.1507442
Narendrabhai CThacker C(2024)A Survey on Image Captioning Using Encoder-Decoder Based Deep Learning Models2024 Parul International Conference on Engineering and Technology (PICET)10.1109/PICET60765.2024.10716063(1-5)Online publication date: 3-May-2024
https://doi.org/10.1109/PICET60765.2024.10716063
Goyal DMishra SRai V(2024)A Multimodal Framework For Satire Vs. Sarcasm Detection2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10726230(1-7)Online publication date: 24-Jun-2024
https://doi.org/10.1109/ICCCNT61001.2024.10726230
Mishra S(2024)Multi-News Summarization Using Multi-Objective Differential Evolution Framework2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10723904(1-7)Online publication date: 24-Jun-2024
https://doi.org/10.1109/ICCCNT61001.2024.10723904
Singh PRaja FSharma H(2024)Generating Image Captions in Hindi Based on Encoder-Decoder Based Deep Learning TechniquesReliability Engineering for Industrial Processes10.1007/978-3-031-55048-5_6(81-94)Online publication date: 23-Apr-2024
https://doi.org/10.1007/978-3-031-55048-5_6
Meghwal VMittal NSingh G(2023)A Multiheaded Attention-Based Model for Generating Hindi CaptionsProceedings of Third Emerging Trends and Technologies on Intelligent Systems10.1007/978-981-99-3963-3_51(677-684)Online publication date: 20-Sep-2023
https://doi.org/10.1007/978-981-99-3963-3_51
Li HLiu CWang JHuang MZhou WJin L(2023)DTDT: Highly Accurate Dense Text Line Detection in Historical Documents via Dynamic TransformerDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41676-7_22(381-396)Online publication date: 21-Aug-2023
https://dl.acm.org/doi/10.1007/978-3-031-41676-7_22

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View full text|Download PDF

View Issue’s Table of Contents