Skip to main content

Advertisement

Log in

A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models

  • Application of soft computing
  • Published:
Soft Computing Aims and scope Submit manuscript

Abstract

Image captioning is a technique that allows us to use computers to interpret the information in photographs and make written text. The use of deep learning to interpret image information and create descriptive text has become a widely researched issue since its establishment. Nevertheless, these strategies do not identify all samples that depict conceptual ideas. In reality, the vast majority of them seem to be irrelevant to the matching tasks. The degree of similarity is determined only by a few relevant semantic occurrences. This duplicate instance can also be thought of as noise, as it obstructs the matching process of a few meaningful instances and adds to the model's computational effort. In the existing scheme, traditional convolutional neural networks (CNN) are presented. For that reason, captioning is not effective due to its structure. Furthermore, present approaches frequently require the deliberate use of additional target recognition algorithms or costly human labeling when extracting information is required. For image captioning, this research presents a multimodal feature fusion-based deep learning model. The coding layer uses mask recurrent neural networks (Faster RCNN), the long short-term memory has been used to decode, and the descriptive text is constructed. In deep learning, the model parameters are optimized through the method of gradient optimization. In the decoding layer, dense attention mechanisms can assist in minimizing non-salient data interruption and preferentially input the appropriate data for the decryption stage. Input images are used to train a model that, when given the opportunity, will provide captions that are very close to accurately describing the images. Various datasets are used to evaluate the model's precision and the fluency, or mastery, of the language it acquires by analyzing picture descriptions. Results from these tests demonstrate that the model consistently provides correct descriptions of input images. This model has been taught to provide captions or words describing an input picture. To measure the effectiveness of the model, the system is given categorization scores. With a batch size of 512 and 100 training epochs, the suggested system shows a 95% increase in performance. The model's capacity to comprehend images and generate text is validated by the experimental data in the domain of generic images. This paper is implemented using Python frameworks and also evaluated using performance metrics such as PSNR, RMSE, SSIM, accuracy, recall, F1-score, and precision.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Data availability

Enquiries about data availability should be directed to the authors.

References

  • Chavan AG, Rajpurohit K, Singh A, Kumar R, Bhonsle M (2021) Image captioning model for mobile app. IJCRT 9(11):229–231

    Google Scholar 

  • Degadwala S, Vyas D, Biswas H, Chakraborty U, Saha S (2021) Image captioning using inception V3 transfer learning model. 2021 6th Int Conf Commun Electr Syst (ICCES), pp 1103–1108, https://doi.org/10.1109/ICCES51350.2021.9489111.

  • Fouladi S, Safaei AA, Mammone N et al (2022a) Efficient deep neural networks for classification of alzheimer’s disease and mild cognitive impairment from scalp EEG recordings. CognComput 14:1247–1268. https://doi.org/10.1007/s12559-022-10033-3

    Article  Google Scholar 

  • Fouladi S, Safaei AA, Arshad NI et al (2022b) The use of artificial neural networks to diagnose Alzheimer’s disease from brain images. Multimed Tools Appl 81:37681–37721. https://doi.org/10.1007/s11042-022-13506-7

    Article  Google Scholar 

  • Funckes N, Carrier E and Wolffe G (2021) An augmented image captioning model: incorporating hierarchical image information. 2021 20th IEEE Int Conf Mach Learn Appl (ICMLA), 2021, pp. 1608–1614, https://doi.org/10.1109/ICMLA52953.2021.00257

  • Gao J, Zhou Y, Yu P, & Gu J. (2020). Unsupervised cross-lingual image captioning. ArXiv, abs/2010.01288.

  • Gupta SC, Singh NR, Sharma T, Tyagi A and Majumda R (2021) generating image captions using deep learning and natural language processing. 2021 9th Int Conf Reliabil Infocom Technol Optim (Trends and Future Directions) (ICRITO), pp. 1–4, doi: https://doi.org/10.1109/ICRITO51393.2021.9596486.

  • Javaheri E, Kumala V, Javaheri A, Rawassizadeh R, Lubritz J, Graf B, Rethmeier M (2020) Quantifying mechanical properties of automotive steels with deep learning based computer vision algorithms. Metals 10(2):163. https://doi.org/10.3390/met10020163

    Article  Google Scholar 

  • Lu Y, Guo C, Dai X and Wang YF (2021) Image Captioning on Fine Art Paintings via Virtual Paintings. 2021 IEEE 1st Int Conf Digital Twins Parallel Intell (DTPI), 2021, pp. 156–159, https://doi.org/10.1109/DTPI52967.2021.9540081.

  • Mahmoudi, A. (2020). Water and wastewater industry and energy management.

  • Nejatishahidin, N., Fayyazsanavi, P., & Kosecka, J. (2022). Object pose estimation using mid-level visual representations. ArXiv, abs/2203.01449.

  • Nivedita M, Asnathvictyphamila Y (2020) A survey on different deep learning architectures for image captioning. WSEAS Trans Syst Control 15:635–646

    Article  Google Scholar 

  • Puscasiu A, Fanca A, Gota D-I.and Valean H, (2020) Automated image captioning. 2020 IEEE Intl Conf Autom, Quality Test Robot (AQTR), 2020, pp 1–6, https://doi.org/10.1109/AQTR49680.2020.9129930.

  • Rane P, Sargar AM, & Shaikh F (2018). Self-critical sequence training for image captioning. IJRESM. Vol.1, No.9, pp: 234–238

  • Rawat SS, Rawat KS, Nijhawan R (2020) A novel convolutional neural network-gated recurrent unit approach for image captioning. Third Int Conf Smart Syst Inv Technol (ICSSIT) 2020:704–708. https://doi.org/10.1109/ICSSIT48917.2020.9214109

    Article  Google Scholar 

  • Saloni Kalra & Alka Leekha (2020) Survey of convolutional neural networks for image captioning. J Inf Optim Sci 41(1):239–260. https://doi.org/10.1080/02522667.2020.1715602

    Article  Google Scholar 

  • Sharma H, Agrahari M, Singh SK, Firoj M, Mishra RK (2020) Image captioning: a comprehensive survey. Int Conf Power Electr IoT Appl Renew Energy Control (PARC) 2020:325–328

    Google Scholar 

  • Shinde NN, Gawde N, Paradkar N (2020a) Social media image caption generation using deep learning. Int J Eng Develop Res 8(4):222–228

    Google Scholar 

  • Shinde, D.V., Dave, M.P., Singh, A., & Dubey, A.C. (2020b). Image caption generator using big data and machine learning. Vol.7, No.4, pp: 6197–6201

  • Turkerud IR, Mengshoel OJ (2021) Image captioning using deep learning: text augmentation by paraphrasing via backtranslation. IEEE Symp Ser Comput Intell (SSCI) 2021:01–10. https://doi.org/10.1109/SSCI50451.2021.9659834

    Article  Google Scholar 

  • Waghmare P, Shinde S (2022) Image Caption Generation Using neural network models and LSTM hierarchical structure. In: Das AK, Nayak J, Naik B, Dutta S, Pelusi D (eds) Computational intelligence in pattern recognition. Advances in Intelligent Systems and Computing, Springer, Singapore. https://doi.org/10.1007/978-981-16-2543-5_10

    Chapter  Google Scholar 

  • YV SS, Choubey Y and Naik D, (2021) Image captioning with attention based model. 2021 5th Int Conf Comput Methodol Commun (ICCMC), pp. 1051–1055, https://doi.org/10.1109/ICCMC51019.2021.9418347.

  • Zeng Z, Li X (2021) Application of human computing in image captioning under deep learning. MicrosystTechnol 27:1687–1692. https://doi.org/10.1007/s00542-019-04473-5

    Article  Google Scholar 

  • Zhao D, Wang A, Russakovsky O (2021) Understanding and evaluating racial biases in image captioning. IEEE/CVF Int Conf Comput vis (ICCV) 2021:14810–14820

    Google Scholar 

Download references

Funding

The authors have not disclosed any funding.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Suresh Muthusamy.

Ethics declarations

Competing interests

The authors have not disclosed any competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Thangavel, K., Palanisamy, N., Muthusamy, S. et al. A novel method for image captioning using multimodal feature fusion employing mask RNN and LSTM models. Soft Comput 27, 14205–14218 (2023). https://doi.org/10.1007/s00500-023-08448-7

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00500-023-08448-7

Keywords

Navigation