skip to main content
research-article

An Object Localization-based Dense Image Captioning Framework in Hindi

Published: 27 December 2022 Publication History

Abstract

Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language processing for generating captions. Numerous works have been carried out on dense image captioning for resource-rich languages like English; however, resource-poor languages like Hindi are not explored. Hindi is one of India’s official languages and is the third most spoken language in the world. This article proposes a dense image captioning model to describe different segments of an image by generating more than one caption in the Hindi language. For localized image recognition and language modeling, we employ Faster R-CNN and Long Short-Term Memory (LSTM), respectively. Apart from this, we conduct various experiments using gated recurrent units (GRUs) and attention mechanism. By manually translating the well-known Visual Genome dataset from English to Hindi, a dataset has been created for dense image captioning in Hindi. The experiments conducted on the newly constructed Hindi dense image captioning dataset illustrate the efficacy of the proposed method over the state-of-the-art methods.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473 (2014).
[3]
Sean Bell, C. Lawrence Zitnick, Kavita Bala, and Ross Girshick. 2016. Inside-outside net: Detecting objects in context with skip pooling and recurrent neural networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2874–2883.
[4]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).
[5]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).
[6]
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Imagenet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE, 248–255.
[7]
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[8]
Rijul Dhir, Santosh Kumar Mishra, Sriparna Saha, and Pushpak Bhattacharyya. 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693–701.
[9]
Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini Venugopalan, Kate Saenko, and Trevor Darrell. 2015. Long-term recurrent convolutional networks for visual recognition and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2625–2634.
[10]
Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292–1302.
[11]
Mark Everingham, Luc Van Gool, Christopher K. I. Williams, John Winn, and Andrew Zisserman. 2010. The Pascal visual object classes (VOC) challenge. International Journal of Computer Vision 88, 2 (2010), 303–338.
[12]
Hao Fang, Saurabh Gupta, Forrest Iandola, Rupesh K. Srivastava, Li Deng, Piotr Dollár, Jianfeng Gao, Xiaodong He, Margaret Mitchell, John C. Platt, and others. 2015. From captions to visual concepts and back. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1473–1482.
[13]
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 15–29.
[14]
Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[15]
Darminder Ghataoura and Sam Ogbonnaya. 2021. Application of image captioning and retrieval to support military decision making. In 2021 International Conference on Military Communication and Information Systems (ICMCIS’21). IEEE, 1–8.
[16]
Karanjit Gill, Sriparna Saha, and Santosh Kumar Mishra. 2021. Dense image captioning in hindi. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 2894–2899.
[17]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[18]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[19]
Lun Huang, Wenmin Wang, Jie Chen, and Xiao-Yong Wei. 2019. Attention on attention for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4634–4643.
[20]
Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018. Learning to guide decoding for image captioning. In 32nd AAAI Conference on Artificial Intelligence.
[21]
Junqi Jin, Kun Fu, Runpeng Cui, Fei Sha, and Changshui Zhang. 2015. Aligning where to see and what to tell: Image caption with region-based attention and scene factorization. arXiv preprint arXiv:1506.06272 (2015).
[22]
Justin Johnson, Andrej Karpathy, and Li Fei-Fei. 2016. Densecap: Fully convolutional localization networks for dense captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4565–4574.
[23]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.
[24]
Andrej Karpathy, Armand Joulin, and Li Fei-Fei. 2014. Deep fragment embeddings for bidirectional image sentence mapping. arXiv preprint arXiv:1406.5679 (2014).
[25]
Lei Ke, Wenjie Pei, Ruiyu Li, Xiaoyong Shen, and Yu-Wing Tai. 2019. Reflective decoding network for image captioning. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 8888–8897.
[26]
Dong-Jin Kim, Jinsoo Choi, Tae-Hyun Oh, and In So Kweon. 2019. Dense relational captioning: Triple-stream networks for relationship-based captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 6271–6280.
[27]
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, and others. 2017. Visual genome: Connecting language and vision using crowdsourced dense image annotations. International Journal of Computer Vision 123, 1 (2017), 32–73.
[28]
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th CVPR. Citeseer.
[29]
Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228.
[30]
Junhao Liu, Kai Wang, Chunpu Xu, Zhou Zhao, Ruifeng Xu, Ying Shen, and Min Yang. 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence 34, 7 (2020), 11588–11595.
[31]
Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999.
[32]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, Zhiheng Huang, and Alan Yuille. 2014. Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632 (2014).
[33]
Santosh Kumar Mishra, Rijul Dhir, Sriparna Saha, and Pushpak Bhattacharyya. 2021. A hindi image caption generation framework using deep learning. Transactions on Asian and Low-resource Language Information Processing 20, 2 (2021), 1–19.
[34]
Santosh Kumar Mishra, Rijul Dhir, Sriparna Saha, Pushpak Bhattacharyya, and Amit Kumar Singh. 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114.
[35]
Santosh Kumar Mishra, Mahesh Babu Peethala, Sriparna Saha, and Pushpak Bhattacharyya. 2021. An information multiplexed encoder-decoder network for image captioning in Hindi. In 2021 IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 3019–3024.
[36]
Santosh Kumar Mishra, Gaurav Rai, Sriparna Saha, and Pushpak Bhattacharyya. 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Transactions on Asian and Low-Resource Language Information Processing 21, 3 (2021), 1–17.
[37]
João Monteiro, Asanobu Kitamoto, and Bruno Martins. 2017. Situational awareness from social media photographs using automated image captioning. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA’17). IEEE, 203–211.
[38]
Roozbeh Mottaghi, Xianjie Chen, Xiaobai Liu, Nam-Gyu Cho, Seong-Whan Lee, Sanja Fidler, Raquel Urtasun, and Alan Yuille. 2014. The role of context for object detection and semantic segmentation in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 891–898.
[39]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.
[40]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.
[41]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.
[42]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2016. Faster R-CNN: Towards real-time object detection with region proposal networks. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 6 (2016), 1137–1149.
[43]
Naveen Saini, Sriparna Saha, Pushpak Bhattacharyya, Shubhankar Mrinal, and Santosh Kumar Mishra. 2021. On multimodal microblog summarization. IEEE Transactions on Computational Social Systems (2021), 1–13. DOI:
[44]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[45]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.
[46]
Leiquan Wang, Xiaoliang Chu, Weishan Zhang, Yiwei Wei, Weichen Sun, and Chunlei Wu. 2018. Social image captioning: Exploring visual attention and user attention. Sensors 18, 2 (2018), 646.
[47]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. PMLR, 2048–2057.
[48]
Linjie Yang, Kevin Tang, Jianchao Yang, and Li-Jia Li. 2017. Dense captioning with joint inference and visual context. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2193–2202.
[49]
Ting Yao, Yingwei Pan, Yehao Li, and Tao Mei. 2018. Exploring visual relationship for image captioning. In Proceedings of the European Conference on Computer Vision (ECCV’18). 684–699.
[50]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.
[51]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI. 13041–13049.

Cited By

View all
  • (2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
  • (2024)Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer DecodersTurkish Journal of Engineering10.31127/tuje.1507442Online publication date: 22-Aug-2024
  • (2024)A Survey on Image Captioning Using Encoder-Decoder Based Deep Learning Models2024 Parul International Conference on Engineering and Technology (PICET)10.1109/PICET60765.2024.10716063(1-5)Online publication date: 3-May-2024
  • Show More Cited By

Index Terms

  1. An Object Localization-based Dense Image Captioning Framework in Hindi

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 2
    February 2023
    624 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3572719
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 27 December 2022
    Online AM: 22 August 2022
    Accepted: 16 May 2022
    Revised: 26 March 2022
    Received: 21 November 2021
    Published in TALLIP Volume 22, Issue 2

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Dense image captioning
    2. attention
    3. deep-learning
    4. Hindi

    Qualifiers

    • Research-article
    • Refereed

    Funding Sources

    • Young Faculty Research Fellowship program of Visvesvaraya Ph.D. Scheme of Ministry of Electronics & Information Technology, Government of India
    • Digital India Corporation (Formerly Media Lab Asia) for conducting this research

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)81
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
    • (2024)Exploring Bengali Image Descriptions through the combination of diverse CNN Architectures and Transformer DecodersTurkish Journal of Engineering10.31127/tuje.1507442Online publication date: 22-Aug-2024
    • (2024)A Survey on Image Captioning Using Encoder-Decoder Based Deep Learning Models2024 Parul International Conference on Engineering and Technology (PICET)10.1109/PICET60765.2024.10716063(1-5)Online publication date: 3-May-2024
    • (2024)A Multimodal Framework For Satire Vs. Sarcasm Detection2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10726230(1-7)Online publication date: 24-Jun-2024
    • (2024)Multi-News Summarization Using Multi-Objective Differential Evolution Framework2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10723904(1-7)Online publication date: 24-Jun-2024
    • (2024)Generating Image Captions in Hindi Based on Encoder-Decoder Based Deep Learning TechniquesReliability Engineering for Industrial Processes10.1007/978-3-031-55048-5_6(81-94)Online publication date: 23-Apr-2024
    • (2023)A Multiheaded Attention-Based Model for Generating Hindi CaptionsProceedings of Third Emerging Trends and Technologies on Intelligent Systems10.1007/978-981-99-3963-3_51(677-684)Online publication date: 20-Sep-2023
    • (2023)DTDT: Highly Accurate Dense Text Line Detection in Historical Documents via Dynamic TransformerDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41676-7_22(381-396)Online publication date: 21-Aug-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media