A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

Thangavel, Kumaravel; Palanisamy, Natesan; Muthusamy, Suresh; Mishra, Om Prava; Sundararajan, Suma Christal Mary; Panchal, Hitesh; Loganathan, Ashok Kumar; Ramamoorthi, Ponarun

doi:10.1007/s00500-023-08448-7

A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

Application of soft computing
Published: 22 May 2023

Volume 27, pages 14205–14218, (2023)
Cite this article

Soft Computing Aims and scope Submit manuscript

4 Citations
Explore all metrics

Abstract

Image captioning is a technique that allows us to use computers to interpret the information in photographs and make written text. The use of deep learning to interpret image information and create descriptive text has become a widely researched issue since its establishment. Nevertheless, these strategies do not identify all samples that depict conceptual ideas. In reality, the vast majority of them seem to be irrelevant to the matching tasks. The degree of similarity is determined only by a few relevant semantic occurrences. This duplicate instance can also be thought of as noise, as it obstructs the matching process of a few meaningful instances and adds to the model's computational effort. In the existing scheme, traditional convolutional neural networks (CNN) are presented. For that reason, captioning is not effective due to its structure. Furthermore, present approaches frequently require the deliberate use of additional target recognition algorithms or costly human labeling when extracting information is required. For image captioning, this research presents a multimodal feature fusion-based deep learning model. The coding layer uses mask recurrent neural networks (Faster RCNN), the long short-term memory has been used to decode, and the descriptive text is constructed. In deep learning, the model parameters are optimized through the method of gradient optimization. In the decoding layer, dense attention mechanisms can assist in minimizing non-salient data interruption and preferentially input the appropriate data for the decryption stage. Input images are used to train a model that, when given the opportunity, will provide captions that are very close to accurately describing the images. Various datasets are used to evaluate the model's precision and the fluency, or mastery, of the language it acquires by analyzing picture descriptions. Results from these tests demonstrate that the model consistently provides correct descriptions of input images. This model has been taught to provide captions or words describing an input picture. To measure the effectiveness of the model, the system is given categorization scores. With a batch size of 512 and 100 training epochs, the suggested system shows a 95% increase in performance. The model's capacity to comprehend images and generate text is validated by the experimental data in the domain of generic images. This paper is implemented using Python frameworks and also evaluated using performance metrics such as PSNR, RMSE, SSIM, accuracy, recall, F1-score, and precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A review on the long short-term memory model

Article 13 May 2020

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Data availability

Enquiries about data availability should be directed to the authors.

References

Chavan AG, Rajpurohit K, Singh A, Kumar R, Bhonsle M (2021) Image captioning model for mobile app. IJCRT 9(11):229–231
Google Scholar
Degadwala S, Vyas D, Biswas H, Chakraborty U, Saha S (2021) Image captioning using inception V3 transfer learning model. 2021 6th Int Conf Commun Electr Syst (ICCES), pp 1103–1108, https://doi.org/10.1109/ICCES51350.2021.9489111.
Fouladi S, Safaei AA, Mammone N et al (2022a) Efficient deep neural networks for classification of alzheimer’s disease and mild cognitive impairment from scalp EEG recordings. CognComput 14:1247–1268. https://doi.org/10.1007/s12559-022-10033-3
Article Google Scholar
Fouladi S, Safaei AA, Arshad NI et al (2022b) The use of artificial neural networks to diagnose Alzheimer’s disease from brain images. Multimed Tools Appl 81:37681–37721. https://doi.org/10.1007/s11042-022-13506-7
Article Google Scholar
Funckes N, Carrier E and Wolffe G (2021) An augmented image captioning model: incorporating hierarchical image information. 2021 20th IEEE Int Conf Mach Learn Appl (ICMLA), 2021, pp. 1608–1614, https://doi.org/10.1109/ICMLA52953.2021.00257
Gao J, Zhou Y, Yu P, & Gu J. (2020). Unsupervised cross-lingual image captioning. ArXiv, abs/2010.01288.
Gupta SC, Singh NR, Sharma T, Tyagi A and Majumda R (2021) generating image captions using deep learning and natural language processing. 2021 9th Int Conf Reliabil Infocom Technol Optim (Trends and Future Directions) (ICRITO), pp. 1–4, doi: https://doi.org/10.1109/ICRITO51393.2021.9596486.
Javaheri E, Kumala V, Javaheri A, Rawassizadeh R, Lubritz J, Graf B, Rethmeier M (2020) Quantifying mechanical properties of automotive steels with deep learning based computer vision algorithms. Metals 10(2):163. https://doi.org/10.3390/met10020163
Article Google Scholar
Lu Y, Guo C, Dai X and Wang YF (2021) Image Captioning on Fine Art Paintings via Virtual Paintings. 2021 IEEE 1st Int Conf Digital Twins Parallel Intell (DTPI), 2021, pp. 156–159, https://doi.org/10.1109/DTPI52967.2021.9540081.
Mahmoudi, A. (2020). Water and wastewater industry and energy management.
Nejatishahidin, N., Fayyazsanavi, P., & Kosecka, J. (2022). Object pose estimation using mid-level visual representations. ArXiv, abs/2203.01449.
Nivedita M, Asnathvictyphamila Y (2020) A survey on different deep learning architectures for image captioning. WSEAS Trans Syst Control 15:635–646
Article Google Scholar
Puscasiu A, Fanca A, Gota D-I.and Valean H, (2020) Automated image captioning. 2020 IEEE Intl Conf Autom, Quality Test Robot (AQTR), 2020, pp 1–6, https://doi.org/10.1109/AQTR49680.2020.9129930.
Rane P, Sargar AM, & Shaikh F (2018). Self-critical sequence training for image captioning. IJRESM. Vol.1, No.9, pp: 234–238
Rawat SS, Rawat KS, Nijhawan R (2020) A novel convolutional neural network-gated recurrent unit approach for image captioning. Third Int Conf Smart Syst Inv Technol (ICSSIT) 2020:704–708. https://doi.org/10.1109/ICSSIT48917.2020.9214109
Article Google Scholar
Saloni Kalra & Alka Leekha (2020) Survey of convolutional neural networks for image captioning. J Inf Optim Sci 41(1):239–260. https://doi.org/10.1080/02522667.2020.1715602
Article Google Scholar
Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image captioning: a comprehensive survey. Int Conf Power Electr IoT Appl Renew Energy Control (PARC) 2020:325–328
Google Scholar
Shinde NN, Gawde N, Paradkar N (2020a) Social media image caption generation using deep learning. Int J Eng Develop Res 8(4):222–228
Google Scholar
Shinde, D.V., Dave, M.P., Singh, A., & Dubey, A.C. (2020b). Image caption generator using big data and machine learning. Vol.7, No.4, pp: 6197–6201
Turkerud IR, Mengshoel OJ (2021) Image captioning using deep learning: text augmentation by paraphrasing via backtranslation. IEEE Symp Ser Comput Intell (SSCI) 2021:01–10. https://doi.org/10.1109/SSCI50451.2021.9659834
Article Google Scholar
Waghmare P, Shinde S (2022) Image Caption Generation Using neural network models and LSTM hierarchical structure. In: Das AK, Nayak J, Naik B, Dutta S, Pelusi D (eds) Computational intelligence in pattern recognition. Advances in Intelligent Systems and Computing, Springer, Singapore. https://doi.org/10.1007/978-981-16-2543-5_10
Chapter Google Scholar
YV SS, Choubey Y and Naik D, (2021) Image captioning with attention based model. 2021 5th Int Conf Comput Methodol Commun (ICCMC), pp. 1051–1055, https://doi.org/10.1109/ICCMC51019.2021.9418347.
Zeng Z, Li X (2021) Application of human computing in image captioning under deep learning. MicrosystTechnol 27:1687–1692. https://doi.org/10.1007/s00542-019-04473-5
Article Google Scholar
Zhao D, Wang A, Russakovsky O (2021) Understanding and evaluating racial biases in image captioning. IEEE/CVF Int Conf Comput vis (ICCV) 2021:14810–14820
Google Scholar

Download references

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Kongu Engineering College (Autonomous), Perundurai, Erode, Tamil Nadu, India
Kumaravel Thangavel
Department of Computer Science and Engineering, Kongu Engineering College(Autonomous), Perundurai, Erode, Tamil Nadu, India
Natesan Palanisamy
Department of Electronics and Communication Engineering, Kongu Engineering College (Autonomous), Perundurai, Erode, Tamil Nadu, India
Suresh Muthusamy
Department of Electronics and Communication Engineering, Vel Tech Rangarajan Dr.Sagunthala R&D Institute of Science and Technology, Avadi, Chennai, Tamil Nadu, India
Om Prava Mishra
Department of Information Technology, Panimalar Engineering College (Autonomous), Poonamallee, Chennai, Tamil Nadu, India
Suma Christal Mary Sundararajan
Department of Mechanical Engineering, Government Engineering College, Patan, Gujarat, India
Hitesh Panchal
Department of Electrical and Electronics Engineering, PSG College of Technology, Coimbatore, Tamil Nadu, India
Ashok Kumar Loganathan
Department of Electrical and Electronics Engineering, Theni Kammavar Sangam College of Technology, Theni, Tamil Nadu, India
Ponarun Ramamoorthi

Authors

Kumaravel Thangavel
View author publications
You can also search for this author in PubMed Google Scholar
Natesan Palanisamy
View author publications
You can also search for this author in PubMed Google Scholar
Suresh Muthusamy
View author publications
You can also search for this author in PubMed Google Scholar
Om Prava Mishra
View author publications
You can also search for this author in PubMed Google Scholar
Suma Christal Mary Sundararajan
View author publications
You can also search for this author in PubMed Google Scholar
Hitesh Panchal
View author publications
You can also search for this author in PubMed Google Scholar
Ashok Kumar Loganathan
View author publications
You can also search for this author in PubMed Google Scholar
Ponarun Ramamoorthi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Suresh Muthusamy.

Ethics declarations

Competing interests

The authors have not disclosed any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Thangavel, K., Palanisamy, N., Muthusamy, S. et al. A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models. Soft Comput 27, 14205–14218 (2023). https://doi.org/10.1007/s00500-023-08448-7

Download citation

Accepted: 03 May 2023
Published: 22 May 2023
Issue Date: October 2023
DOI: https://doi.org/10.1007/s00500-023-08448-7

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

Abstract

Access this article

Similar content being viewed by others

A review on the long short-term memory model

HCRNN: A Novel Architecture for Fast Online Handwritten Stroke Classification

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Data availability

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Competing interests

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation