skip to main content
research-article

Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi

Published: 24 March 2023 Publication History

Abstract

In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed dimensional vector representation in the image captioning task, whereas a decoder, a recurrent neural network, performs language modeling and generates the target descriptions. Recent CNNs use the same operation over every pixel; however, all the image pixels are not equally important. To address this, the proposed method uses a dynamic convolution-based encoder for image encoding or feature extraction, Long-Short-Term-Memory as a decoder for language modeling, and X-Linear attention to make the system robust. Encoders, attentions, and decoders are important aspects of the image captioning task; therefore, we experiment with various encoders, decoders, and attention mechanisms. Most of the works for image captioning have been carried out for the English language in the existing literature. We propose a novel approach for caption generation from images in Hindi. Hindi, widely spoken in South Asia and India, is the fourth most-spoken language globally; it is India’s official language. The proposed method utilizes dynamic convolution operation on the encoder side to obtain a better image encoding quality. The Hindi image captioning dataset is manually created by translating the popular MSCOCO dataset from English to Hindi. In terms of BLEU scores, the performance of the proposed method is compared with other baselines, and the results obtained show that the proposed method outperforms different baselines. Manual human assessment in terms of adequacy and fluency of the captions generated further determines the efficacy of the proposed method in generating good-quality captions.

References

[1]
Peter Anderson, Xiaodong He, Chris Buehler, Damien Teney, Mark Johnson, Stephen Gould, and Lei Zhang. 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.
[2]
Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. CoRR abs/1409.0473 (2014). Retrieved from https://arxiv.org/abs/s1409.0473.
[3]
Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau, Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. Conference on Empirical Methods in Natural Language Processing. Retrieved from https://arxiv.org/abs/1406.1078.
[4]
Marcella Cornia, Matteo Stefanini, Lorenzo Baraldi, and Rita Cucchiara. 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR’20).
[5]
Navneet Dalal and Bill Triggs. 2005. Histograms of oriented gradients for human detection. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), Vol. 1. IEEE, 886–893.
[6]
Aditya Deshpande, Jyoti Aneja, Liwei Wang, Alexander G. Schwing, and David Forsyth. 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[7]
Rijul Dhir, Santosh Kumar Mishra, Sriparna Saha, and Pushpak Bhattacharyya. 2019. A deep attention based framework for image caption generation in Hindi language. Comput. Sist. 23, 3 (2019).
[8]
Desmond Elliott and Frank Keller. 2013. Image description using visual dependency representations. In Proceedings of the Conference on Empirical Methods in Natural Language Processing. 1292–1302.
[9]
Ali Farhadi, Mohsen Hejrati, Mohammad Amin Sadeghi, Peter Young, Cyrus Rashtchian, Julia Hockenmaier, and David Forsyth. 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 15–29.
[10]
Yang Feng, Lin Ma, Wei Liu, and Jiebo Luo. 2019. Unsupervised image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’19).
[11]
Akira Fukui, Dong Huk Park, Daylen Yang, Anna Rohrbach, Trevor Darrell, and Marcus Rohrbach. 2016. Multimodal compact bilinear pooling for visual question answering and visual grounding. Conference on Empirical Methods in Natural Language Processing. Retrieved from https://arxiv.org/abs/1606.01847.
[12]
Yang Gao, Oscar Beijbom, Ning Zhang, and Trevor Darrell. 2016. Compact bilinear pooling. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 317–326.
[13]
Jane Gary and Carl Rubino. 2001. Facts About the World’s Languages: An Encyclopedia of the World’s Major Languages.
[14]
Karanjit Gill, Sriparna Saha, and Santosh Kumar Mishra. 2021. Dense image captioning in Hindi. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 2894–2899.
[15]
Yunchao Gong, Liwei Wang, Micah Hodosh, Julia Hockenmaier, and Svetlana Lazebnik. 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. Springer, 529–545.
[16]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.
[17]
Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735–1780.
[18]
Micah Hodosh, Peter Young, and Julia Hockenmaier. 2013. Framing image description as a ranking task: Data, models and evaluation metrics. J. Artif. Intell. Res. 47 (2013), 853–899.
[19]
Jie Hu, Li Shen, and Gang Sun. 2018. Squeeze-and-excitation networks, 7132–7141. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR’18).
[20]
Eric Jang, Shixiang Gu, and Ben Poole. 2016. Categorical reparameterization with gumbel-softmax. International Conference on Learning Representations. Retrieved from https://arxiv.org/abs/1611.01144.
[21]
Wenhao Jiang, Lin Ma, Xinpeng Chen, Hanwang Zhang, and Wei Liu. 2018. Learning to guide decoding for image captioning. In Proceedings of the 32nd AAAI Conference on Artificial Intelligence.
[22]
Andrej Karpathy and Li Fei-Fei. 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.
[23]
Amanpreet Kaur, Munish Kumar, and Manish Kumar Jindal. 2022. Shi-Tomasi corner detector for cattle identification from muzzle print image pattern. Ecol. Inf. 68 (2022), 101549.
[24]
Jin-Hwa Kim, Jaehyun Jun, and Byoung-Tak Zhang. 2018. Bilinear attention networks. In Advances in Neural Information Processing Systems. 1564–1574.
[25]
Girish Kulkarni, Visruth Premraj, Sagnik Dhar, Siming Li, Yejin Choi, Alexander C. Berg, and Tamara L. Berg. 2011. Baby talk: Understanding and generating image descriptions. In Proceedings of the 24th IEEE Conference on Computer Vision and Pattern Recognition (CVPR’11). Citeseer.
[26]
Siming Li, Girish Kulkarni, Tamara L. Berg, Alexander C. Berg, and Yejin Choi. 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228.
[27]
Junhao Liu, Kai Wang, Chunpu Xu, Zhou Zhao, Ruifeng Xu, Ying Shen, and Min Yang. 2020. Interactive dual generative adversarial networks for image captioning. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’20). 11588–11595.
[28]
David G. Lowe. 2004. Distinctive image features from scale-invariant keypoints. Int. J. Comput. Vis. 60, 2 (2004), 91–110.
[29]
Jiasen Lu, Caiming Xiong, Devi Parikh, and Richard Socher. 2017. Knowing when to look: Adaptive attention via a visual sentinel for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 375–383.
[30]
Minh-Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective approaches to attention-based neural machine translation. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing. 1412–1421.
[31]
Haley MacLeod, Cynthia L. Bennett, Meredith Ringel Morris, and Edward Cutrell. 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999.
[32]
Junhua Mao, Wei Xu, Yi Yang, Jiang Wang, and Alan L. Yuille. 2015. Deep captioning with multimodal recurrent neural networks (m-RNN). 3rd International Conference on Learning Representations (ICLR’15, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings), Yoshua Bengio and Yann LeCun. http://arxiv.org/abs/1412.6632
[33]
Santosh Kumar Mishra, Rijul Dhir, Sriparna Saha, and Pushpak Bhattacharyya. 2021. A Hindi image caption generation framework using deep learning. Trans. Asian Low-Resour. Lang. Inf. Process. 20, 2 (2021), 1–19.
[34]
Santosh Kumar Mishra, Rijul Dhir, Sriparna Saha, Pushpak Bhattacharyya, and Amit Kumar Singh. 2020. Image captioning in Hindi language using transformer networks. (unpublished).
[35]
Santosh Kumar Mishra, Rijul Dhir, Sriparna Saha, Pushpak Bhattacharyya, and Amit Kumar Singh. 2021. Image captioning in Hindi language using transformer networks. Comput. Electr. Eng. 92 (2021), 107114.
[36]
Santosh Kumar Mishra, Mahesh Babu Peethala, Sriparna Saha, and Pushpak Bhattacharyya. 2021. An information multiplexed encoder-decoder network for image captioning in Hindi. In Proceedings of the IEEE International Conference on Systems, Man, and Cybernetics (SMC’21). IEEE, 3019–3024.
[37]
Santosh Kumar Mishra, Gaurav Rai, Sriparna Saha, and Pushpak Bhattacharyya. 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Trans. Asian Low-Resour. Lang. Inf. Process. 21, 3 (2021), 1–17.
[38]
Santosh Kumar Mishra, Sriparna Saha, and Pushpak Bhattacharyya. 2021. A scaled encoder decoder network for image captioning in Hindi. In Proceedings of the 18th International Conference on Natural Language Processing (ICON’21). 251–260.
[39]
Santosh Kumar Mishra, Sriparna Saha, and Pushpak Bhattacharyya. 2022. An object localization based dense image captioning framework in Hindi. Trans. Asian Low-Resour. Lang. Inf. Process. (2022).
[40]
Faisel Mushtaq, Muzafar Mehraj Misgar, Munish Kumar, and Surinder Singh Khurana. 2021. UrduDeepNet: Offline handwritten Urdu character recognition using deep neural network. Neural Comput. Appl. 33, 22 (2021), 15229–15252.
[41]
Sonika Rani Narang, Manish Kumar Jindal, Shruti Ahuja, and Munish Kumar. 2020. On the recognition of Devanagari ancient handwritten characters using SIFT and Gabor features. Soft Comput. 24, 22 (2020), 17279–17289.
[42]
Sonika Rani Narang, Manish Kumar Jindal, and Munish Kumar. 2020. Ancient text recognition: A review. Artif. Intell. Rev. 53, 8 (2020), 5517–5558.
[43]
Sonika Rani Narang, Munish Kumar, and Manish Kumar Jindal. 2021. DeepNetDevanagari: A deep learning model for Devanagari ancient character recognition. Multimedia Tools Appl. 80, 13 (2021), 20671–20686.
[44]
Timo Ojala, Matti Pietikäinen, and Topi Mäenpää. 2000. Gray scale and rotation invariant texture classification with local binary patterns. In European Conference on Computer Vision. Springer, 404–420.
[45]
Yingwei Pan, Ting Yao, Yehao Li, and Tao Mei. 2020. X-linear attention networks for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10971–10980.
[46]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.
[47]
Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster r-cnn: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.
[48]
Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, and Alexander A. Alemi. 2017. Inception-v4, inception-resnet and the impact of residual connections on learning. In Proceedings of the 31st AAAI Conference on Artificial Intelligence.
[49]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
[50]
Thomas Verelst and Tinne Tuytelaars. 2020. Dynamic convolutions: Exploiting spatial sparsity for faster inference. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2320–2329.
[51]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.
[52]
Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. 2016. Show and tell: Lessons learned from the 2015 mscoco image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39, 4 (2016), 652–663.
[53]
Guanghui Xu, Shuaicheng Niu, Mingkui Tan, Yucheng Luo, Qing Du, and Qi Wu. 2021. Towards accurate text-based image captioning with content diversity exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12637–12646.
[54]
Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhudinov, Rich Zemel, and Yoshua Bengio. 2015. Show, attend and tell: Neural image caption generation with visual attention. In Proceedings of the International Conference on Machine Learning. 2048–2057.
[55]
Zhilin Yang, Ye Yuan, Yuexin Wu, William W. Cohen, and Ruslan R. Salakhutdinov. 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 2361–2369.
[56]
Quanzeng You, Hailin Jin, Zhaowen Wang, Chen Fang, and Jiebo Luo. 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.
[57]
Chaojian Yu, Xinyi Zhao, Qi Zheng, Peng Zhang, and Xinge You. 2018. Hierarchical bilinear pooling for fine-grained visual recognition. In Proceedings of the European Conference on Computer Vision (ECCV’18). 574–589.
[58]
Luowei Zhou, Hamid Palangi, Lei Zhang, Houdong Hu, Jason J. Corso, and Jianfeng Gao. 2020. Unified vision-language pre-training for image captioning and VQA. In Proceedings of the AAAI Annual Conference on Artificial Intelligence (AAAI’20). 13041–13049.

Cited By

View all
  • (2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
  • (2024)A Survey on Image Captioning Using Encoder-Decoder Based Deep Learning Models2024 Parul International Conference on Engineering and Technology (PICET)10.1109/PICET60765.2024.10716063(1-5)Online publication date: 3-May-2024
  • (2024)A Multimodal Framework For Satire Vs. Sarcasm Detection2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10726230(1-7)Online publication date: 24-Jun-2024
  • Show More Cited By

Index Terms

  1. Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Asian and Low-Resource Language Information Processing
    ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 22, Issue 4
    April 2023
    682 pages
    ISSN:2375-4699
    EISSN:2375-4702
    DOI:10.1145/3588902
    Issue’s Table of Contents

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 24 March 2023
    Online AM: 26 December 2022
    Accepted: 12 November 2022
    Revised: 28 June 2022
    Received: 10 December 2021
    Published in TALLIP Volume 22, Issue 4

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Hindi
    2. dynamic convolution
    3. attention
    4. deep learning

    Qualifiers

    • Research-article

    Funding Sources

    • Young Faculty Research Fellowship program of Visvesvaraya Ph.D. Scheme of Ministry of Electronics and Information Technology, Government of India

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)105
    • Downloads (Last 6 weeks)16
    Reflects downloads up to 17 Jan 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)A Deep Learning-Based Efficient Image Captioning Approach for Hindi LanguageDevelopments Towards Next Generation Intelligent Systems for Sustainable Development10.4018/979-8-3693-5643-2.ch009(225-246)Online publication date: 5-Apr-2024
    • (2024)A Survey on Image Captioning Using Encoder-Decoder Based Deep Learning Models2024 Parul International Conference on Engineering and Technology (PICET)10.1109/PICET60765.2024.10716063(1-5)Online publication date: 3-May-2024
    • (2024)A Multimodal Framework For Satire Vs. Sarcasm Detection2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10726230(1-7)Online publication date: 24-Jun-2024
    • (2024)Multi-News Summarization Using Multi-Objective Differential Evolution Framework2024 15th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT61001.2024.10723904(1-7)Online publication date: 24-Jun-2024
    • (2024)APEDM: a new voice casting system using acoustic–phonetic encoder-decoder mappingMultimedia Tools and Applications10.1007/s11042-024-20496-1Online publication date: 23-Dec-2024
    • (2024)Deep Learning Approach to Compose Short Stories Based on Online Hospital Reviews of Tirunelveli RegionProceedings of the Fifth International Conference on Trends in Computational and Cognitive Engineering10.1007/978-981-97-1923-5_1(3-12)Online publication date: 14-Jun-2024
    • (2024)Feature Fusion and Multi-head Attention Based Hindi CaptionerComputer Vision and Image Processing10.1007/978-3-031-58181-6_40(479-487)Online publication date: 3-Jul-2024
    • (2024)Generating Image Captions in Hindi Based on Encoder-Decoder Based Deep Learning TechniquesReliability Engineering for Industrial Processes10.1007/978-3-031-55048-5_6(81-94)Online publication date: 23-Apr-2024
    • (2023)Performance Analysis of Image Caption Generation Techniques Using CNN-Based Encoder–Decoder ArchitectureData Science and Network Engineering10.1007/978-981-99-6755-1_23(301-313)Online publication date: 3-Nov-2023
    • (2023)DTDT: Highly Accurate Dense Text Line Detection in Historical Documents via Dynamic TransformerDocument Analysis and Recognition - ICDAR 202310.1007/978-3-031-41676-7_22(381-396)Online publication date: 21-Aug-2023

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Full Text

    View this article in Full Text.

    Full Text

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media