Abstract
Image captioning frameworks usually employ an encoder-decoder paradigm, with the encoder receiving abstract image feature vectors as input and the decoder for language modeling. Nowadays, most prominent architectures employ features from region proposals derived from object detection modules. In this work, we propose a novel architecture for image captioning. We employ the object detection module integrated with transformer architecture as an encoder and GPT-2 (Generative Pre-trained Transformer) as a decoder. The encoder utilizes the information of the spatial relationships among detected objects. We introduce a unique methodology for image caption generation in Hindi, which is widely spoken in South Asia and India and is the world’s third most spoken language as well as India’s official language. In terms of BLEU scores, the proposed approach’s performance is comparable to those of other baselines, and the results illustrate that the proposed approach outperforms the other baselines. The efficacy of the proposed approach in generating correct captions is further determined by human assessment in terms of adequacy and fluency.
- [1] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google Scholar
- [2] . 2018. Bottom-up and top-down attention for image captioning and visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6077–6086.Google ScholarCross Ref
- [3] . 2022. The unreasonable effectiveness of CLIP features for image captioning: An experimental analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4662–4670.Google ScholarCross Ref
- [4] . 2014. Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).Google Scholar
- [5] . 2020. Meshed-memory transformer for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10578–10587.Google Scholar
- [6] . 2019. Fast, diverse and accurate image captioning guided by part-of-speech. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10695–10704.Google Scholar
- [7] . 2019. A deep attention based framework for image caption generation in Hindi language. Computación y Sistemas 23, 3 (2019), 693–701.Google Scholar
- [8] . 2013. Image description using visual dependency representations. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. 1292–1302.Google Scholar
- [9] . 2010. Every picture tells a story: Generating sentences from images. In European Conference on Computer Vision. Springer, 15–29.Google ScholarDigital Library
- [10] . 2022. DeeCap: Dynamic early exiting for efficient image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12216–12226.Google ScholarCross Ref
- [11] . 2019. Unsupervised image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4125–4134.Google Scholar
- [12] and C. R. G. Rubino. 2001. Facts about the World’s Languages: An Encyclopedia of the World’s Major Languages, Past and Present. H. W. Wilson EBSCO, New York.Google Scholar
- [13] . 2021. Application of image captioning and retrieval to support military decision making. In 2021 International Conference on Military Communication and Information Systems (ICMCIS ’21). IEEE, 1–8.Google ScholarCross Ref
- [14] . 2014. Improving image-sentence embeddings using large weakly annotated photo collections. In European Conference on Computer Vision. Springer, 529–545.Google ScholarCross Ref
- [15] . 2016. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 770–778.Google ScholarCross Ref
- [16] . 2022. Quantifying societal bias amplification in image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 13450–13459.Google ScholarCross Ref
- [17] . 2013. Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research 47 (2013), 853–899.Google ScholarDigital Library
- [18] . 2023. Switching to discriminative image captioning by relieving a bottleneck of reinforcement learning. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 1124–1134.Google ScholarCross Ref
- [19] . 2018. Relation networks for object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3588–3597.Google ScholarCross Ref
- [20] . 2018. Learning to guide decoding for image captioning. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32, 6959–6966.Google Scholar
- [21] . 2015. Deep visual-semantic alignments for generating image descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3128–3137.Google ScholarCross Ref
- [22] . 2011. Baby talk: Understanding and generating image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence 35, 12 (2011), 2891–2903.Google Scholar
- [23] . 2011. Composing simple image descriptions using web-scale n-grams. In Proceedings of the 15th Conference on Computational Natural Language Learning. Association for Computational Linguistics, 220–228.Google ScholarDigital Library
- [24] . 2022. Comprehending and ordering semantics for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 17990–17999.Google ScholarCross Ref
- [25] . 2020. Context-aware group captioning via self-attention and contrastive features. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3440–3450.Google ScholarCross Ref
- [26] . 2020. Interactive dual generative adversarial networks for image captioning. In AAAI. 11588–11595.Google Scholar
- [27] . 2017. Understanding blind people’s experiences with computer-generated captions of social media images. In Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems. ACM, 5988–5999.Google ScholarDigital Library
- [28] . 2014. Deep captioning with multimodal recurrent neural networks (M-RNN). arXiv preprint arXiv:1412.6632 (2014).Google Scholar
- [29] . 2021. A Hindi image caption generation framework using deep learning. Transactions on Asian and Low-resource Language Information Processing 20, 2 (2021), 1–19.Google ScholarDigital Library
- [30] . 2021. Image captioning in Hindi language using transformer networks. Computers & Electrical Engineering 92 (2021), 107114.Google ScholarCross Ref
- [31] . 2021. Efficient channel attention based encoder–decoder approach for image captioning in Hindi. Transactions on Asian and Low-resource Language Information Processing 21, 3 (2021), 1–17.Google Scholar
- [32] . 2017. Situational awareness from social media photographs using automated image captioning. In 2017 IEEE International Conference on Data Science and Advanced Analytics (DSAA ’17). IEEE, 203–211.Google ScholarCross Ref
- [33] . 2002. BLEU: A method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. Association for Computational Linguistics, 311–318.Google Scholar
- [34] . 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019), 9.Google Scholar
- [35] . 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in Neural Information Processing Systems. 91–99.Google ScholarDigital Library
- [36] . 2017. Self-critical sequence training for image captioning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7008–7024.Google ScholarCross Ref
- [37] . 2021. On multimodal microblog summarization. IEEE Transactions on Computational Social Systems 9, 5 (2021), 1317–1329.Google Scholar
- [38] . 2020. Show, edit and tell: A framework for editing image captions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4808–4816.Google ScholarCross Ref
- [39] . 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.Google Scholar
- [40] . 2015. Show and tell: A neural image caption generator. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 3156–3164.Google ScholarCross Ref
- [41] . 2016. Show and tell: Lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 4 (2016), 652–663.Google ScholarDigital Library
- [42] . 2018. Social image captioning: Exploring visual attention and user attention. Sensors 18, 2 (2018), 646.Google ScholarCross Ref
- [43] . 2022. DIFNet: Boosting visual information flow for image captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 18020–18029.Google ScholarCross Ref
- [44] . 2021. Towards accurate text-based image captioning with content diversity exploration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 12637–12646.Google ScholarCross Ref
- [45] . 2015. Show, attend and tell: Neural image caption generation with visual attention. In International Conference on Machine Learning. 2048–2057.Google ScholarDigital Library
- [46] . 2016. Review networks for caption generation. In Advances in Neural Information Processing Systems. 2361–2369.Google Scholar
- [47] . 2016. Image captioning with semantic attention. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 4651–4659.Google ScholarCross Ref
- [48] . 2020. Unified vision-language pre-training for image captioning and VQA. In AAAI. 13041–13049.Google Scholar
Index Terms
- GAGPT-2: A Geometric Attention-based GPT-2 Framework for Image Captioning in Hindi
Recommendations
Dynamic Convolution-based Encoder-Decoder Framework for Image Captioning in Hindi
In sequence-to-sequence modeling tasks, such as image captioning, machine translation, and visual question answering, encoder-decoder architectures are state of the art. An encoder, convolutional neural network (CNN) encodes input images into fixed ...
An Object Localization-based Dense Image Captioning Framework in Hindi
Dense image captioning is a task that requires generating localized captions in natural language for multiple regions of an image. This task leverages its functionalities from both computer vision for recognizing regions in an image and natural language ...
Comments