Abstract
Image captioning generates textual description from the corresponding input image with the help of computer vision and natural language processing. In recent years, deep learning approaches have shown promise in image captioning. This research introduces a novel image captioning architecture comprising a dual self-attention fused encoder-decoder framework. The VGG16 Hybrid Places 1365 (V16HP1365) encoder captures diverse visual features from images, enhancing the quality of image representations. In this article, the Gated Recurrent Unit (GRU) is considered as a decoder for conducting word-level language modeling. Additionally, the dual self-attention network embedded in the architecture allows for capturing contextual image information to provide accurate content descriptions and relationship understanding. Experimental evaluations on the COCO dataset showcase superior performance, surpassing existing methods in terms of captioning quality metrics. This approach holds potential for applications such as aiding the visually impaired and advancing content retrieval. Future work aims to extend the model to support multilingual captioning.





Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Data sharing is not applicable to this article as no new data were created or analyzed in this study.
Code availability
Not applicable.
References
Jaiswal T, Pandey M, Tripathi P (2021) Image captioning through cognitive IOT and machine-learning approaches. Turkish J Comput Math Educ 12L:333–351
Fan Z, Wei Z, Wang S, Huang X (2020) Bridging by word: Image-grounded vocabulary construction for visual captioning. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference 6514–6524. https://doi.org/10.18653/v1/p19-1652
Yan C, Hao Y, Li L, Yin J, Liu A, Mao Z, Chen Z, Gao X (2021) Task-adaptive attention for image captioning. IEEE Trans Circuits Syst Video Technol 32:43–51
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: A survey on deep learning-based image captioning. IEEE Trans Pattern Anal Mach Intell 45:539–559
Zohourianshahzadi Z, Kalita JK (2022) Neural attention for image captioning: re-view of outstanding methods. Artif Intell Rev 55:3833–3862
Alzubi JA, Jain R, Nagrath P, Satapathy S, Taneja S, Gupta P (2021) Deep image captioning using an ensemble of CNN and LSTM based deep neural net-works. J Intell Fuzzy Syst 40:5761–5769
Maru H, Chandana TSS, Naik D (2021) Comparison of image encoder ar-chitectures for image captioning. In 2021 5th International Conference on Computing Methodologies and Communication (ICCMC) 740–744. IEEE
Smith WHB, Milford M, Mcdonald-Maier KD, Ehsan S (2021) Scene Retrieval for Contextual Visual Mapping. ArXiv 2102:1
Vinyals O, Toshev A, Bengio S, Erhan D (2015) Show and tell: A neural image caption generator. Proc. IEEE Comput. Soc Conf Comput Vis Pattern Recognit 07:3156–3164. https://doi.org/10.1109/CVPR.2015.7298935
Xu K et al (2015) Show, attend and tell: Neural image caption generation with visual attention. 32nd Int. Conf. Mach. Learn. ICML 3: 2048–2057
Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition 1179–1195. https://doi.org/10.1109/CVPR.2017.131
Anderson P et al (2017) Bottom-Up and Top-Down Attention for Image Captioning and VQA. ArXiv 1707:0
Yu L et al (2018) Matt net: modular attention network for referring expression comprehension. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 1307–1315. https://doi.org/10.1109/CVPR.2018.00142
Faghr F, Fleet DJ, Kiros JR, Fidler S (2017) VSE++: Improved Visual-Semantic Embeddings. ArXiv 1707:0
Sukhbaatar S, Grave E, Bojanowski P, Joulin A (2020) Adaptive attention span in transformers. In ACL 2019 - 57th Annual Meeting of the Association for Computational Linguistics, Proceedings of the Conference 331–335. https://doi.org/10.18653/v1/p19-1032
Vaswani A et al (2017) Attention is all you need. Adv Neural Inf Process Syst 5999–6009. [Online]. Available https://arxiv.org/pdf/1706.03762.pdf. 05/03/2024
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition 4651–4659. https://doi.org/10.1109/CVPR.2016.503
Huang L, Wang W, Chen J, Wei XY (2019) Attention on attention for image captioning. In Proceedings of the IEEE International Conference on Computer Vision 4633–4642. https://doi.org/10.1109/ICCV.2019.00473
Wan B, Jiang W, Fang Y, Wen W, Liu H (2022) Dual-stream self-attention network for image captioning. In 2022 IEEE International Conference on Visual Communications and Image Processing (VCIP) 1–5. IEEE
Kim JS, Park SW, Kim JY, Park J, Huh JH, Jung SH, Sim CB (2023) E-HRNet: Enhanced semantic segmentation using squeeze and excitation. Electronics 12:3619
Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9:1735–1780. https://doi.org/10.1162/neco.1997.9.8.1735
Cho K et al (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. In EMNLP 2014 - 2014 Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference 1724–1734. https://doi.org/10.3115/v1/d14-1179
Lin TY et al (2014) Microsoft COCO: Common objects in context. In Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) 8693: 740–755
Karpathy A, Fei-Fei L (2017) Deep visual-semantic alignments for generating image descriptions. IEEE Trans Pattern Anal Mach Intell 39:664–676. https://doi.org/10.1109/TPAMI.2016.2598339
Bhavana D, Krishna KC, Tejaswini K, Vikas NV, Sahithya ANV (2021) Image captioning using deep learning. In Handbook of Research on Innovations and Ap-plications of AI, IoT, and Cognitive Technologies 381–395. IGI Global
Singh D, Kaur M, Alanazi JM, AlZubi AA, Lee HN (2022) Efficient evolving deep ensemble medical image captioning network. IEEE J Biomed Health Inform 27:1016–1025
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. In: Proceedings of the workshop on text summarization branches out (WAS 2004) 25–26. [Online]. Available papers2://publication/uuid/5DDA0BB8-E59F-44C1–88E6–2AD316DAEF85. 05/03/2024
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the Annual Meeting of the Association for Computational Linguistics 376–380. https://doi.org/10.3115/v1/w14-3348
Wang J, Xu W, Wang Q, Chan AB (2022) On distinctive image captioning via comparing and reweighting. IEEE Trans Pattern Anal Mach Intell 45:2088–2103
Elhagry A, Kadaoui K (2021) A thorough review on recent deep learning methodol-ogies for image captioning. arXiv preprint arXiv 2107:13114
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) 3128–3137. https://doi.org/10.1109/CVPR.2015.7298932
Katpally H, Bansal A (2020) Ensemble learning on deep neural networks for image caption generation. In Proceedings - 14th IEEE International Conference on Semantic Computing 61–68. https://doi.org/10.1109/ICSC.2020.00016
Pu Y et al (2016) Variational autoencoder for deep learning of images, labels and captions. In Advances in Neural Information Processing Systems 2360–2368
Chu Y, Yue X, Yu L, Sergei M, Wang Z (2020) Automatic Image Captioning Based on ResNet50 and LSTM with Soft Attention. Wirel Commun Mob Comput 8909458:1–8909458
Amritkar C, Jabade VS (2018) Image caption generation using deep learning technique. 2018 Fourth Int Conf Comput Commun Control Autom1–4
Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. Proc IEEE Comput Soc Conf Comput Vis Pattern Recognit 5561–5570. https://doi.org/10.1109/CVPR.2018.00583
Mao J, Xu W, Yang Y, Wang J, Yuille AL (2014) Explain Images with Multimodal Recurrent Neural Networks. ArXiv 1410:1. [Online]. Available: http://arxiv.org/abs/1410.1090
Li L, Tang S, Zhang Y, Deng L, Tian Q (2018) GLA: Global-Local Attention for Image Description. IEEE Trans Multimed 20:726–737. https://doi.org/10.1109/TMM.2017.2751140
Xiao X, Wang L, Ding K, Xiang S, Pan C (2019) Deep hierarchical encoder–decoder network for image captioning. IEEE Trans Multimed 21:2942–2956. https://doi.org/10.1109/TMM.2019.2915033
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting Image Captioning with Attributes. In 2017 IEEE International Conference on Computer Vision (ICCV) 4904–4912. https://doi.org/10.1109/ICCV.2017.524
Cao P, Yang Z, Sun L, Liang Y, Yang MQ, Guan R (2019) Image captioning with bidi-rectional semantic attention-based guiding of long short-term memory. Neural Process Lett 50:103–119. https://doi.org/10.1007/s11063-018-09973-5
Wang C, Gu X (2023) Learning joint relationship attention network for image captioning. Expert Syst Appl 211:118474. https://doi.org/10.1016/j.eswa.2022.118474
Moral ÖT, Kiliç V, Onan A, Wang W (2022) August. Automated image captioning with multi-layer gated recurrent unit. In 2022 30th European Signal Processing Conference (EUSIPCO) 1160–1164. IEEE
Padate R, Jain A, Kalla M, Sharma A (2023) Image caption generation using a dual attention mechanism. Eng Appl Artif Intell 123:106112
Kim DH (2019) Evaluation of COCO validation 2017 dataset with YO-LOv3. Evaluation 6:10356–10360
Agarwal V, Sharma S (2022) EMVD: Efficient multitype vehicle detection algorithm using deep learning approach in vehicular communication network for radio resource management. Int J Image Graph Signal Process (IJIGSP) 14:25–37
Agarwal V, Sharma S (2023) DQN Algorithm for network resource management in vehicular communication network. Int J Inf Technol 1–9
Agarwal V, Sharma S (2022) Deep learning techniques to improve radio resource management in vehicular communication network. In Sustainable Advanced Compu-ting: Select Proceedings of ICSAC 2021. 161–171. Singapore: Springer Singapore
Agarwal V, Sharma S, Bansal G (2022) Network resource allocation security techniques and challenges for vehicular communication network management. In New Trends and Applications in Internet of Things (IoT) and Big Data Analytics pp 123–137. Cham: Springer International Publishing
Author information
Authors and Affiliations
Contributions
TJ agreed on the content of the study. TJ, MP and PT collected all the data for analysis. TJ agreed on the methodology. TJ, MP and PT completed the analysis based on agreed steps. Results and conclusions are discussed and written together. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Human and animal rights
This article does not contain any studies with human or animal subjects performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Conflict of interest
The authors declare that they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jaiswal, T., Pandey, M. & Tripathi, P. Advancing image captioning with V16HP1365 encoder and dual self-attention network. Multimed Tools Appl 83, 80701–80725 (2024). https://doi.org/10.1007/s11042-024-18467-7
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-024-18467-7