Skip to main content
Log in

An Integrated Hybrid CNN–RNN Model for Visual Description and Generation of Captions

  • Published:
Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Abstract

Video captioning is currently considered to be one of the simplest ways to index and search data efficiently. In today’s era, suitable captioning of video images can be facilitated with deep learning architectures. The focus of past research has been on providing image captions; however, the generation of high-quality captions with suitable semantics for different scenes has not yet been achieved. Therefore, this work aims to generate well-defined and meaningful captions to images and videos by using convolutional neural networks (CNN) and recurrent neural networks in combination. Beginning with the available dataset, features of images and videos were extracted using CNN. The extracted feature vectors were then utilized to generate a language model with the involvement of long short-term memory for individual word grams. The generated meaningful captions were trained using a softmax function, for performance computation using some predefined evaluation metrics. The obtained experimental results demonstrate that the proposed model outperforms existing benchmark models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7

Similar content being viewed by others

References

  1. P. Bhanodia, B. Pandey, D. Pandey, A. Khamparia, A Comprehensive survey of link prediction in social networks: techniques, parameters and challenges. Expert Syst. Appl. 124, 164 (2019)

    Article  Google Scholar 

  2. J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 2625–2634

  3. H. Fang, S. Gupta, F. Iandola, R.K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J.C. Platt, et al., From captions to visual concepts and back. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1473–1482

  4. A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth, Every picture tells a story: generating sentences from images. Computer Vision–ECCV 2010 (2010), pp. 15–29

  5. D. Gupta, J.J.P.C. Rodrigues, S. Sundaram, A. Khanna, V. Korotaev, V.H.C. Albuquerque, Usability feature extraction using modified crow search algorithm: a novel approach. Neural Comput. Appl. (2018). https://doi.org/10.1007/s00521-018-3688-6

    Article  Google Scholar 

  6. D. Gupta, K. Sagar, Remote file synchronization single-round algorithm. Int. J. Comput. Appl. 4(1), 32–36 (2010)

    Google Scholar 

  7. D. Gupta, A. Ahlawat, Usability prediction of live auction using multistage fuzzy system. Int. J. Artif. Intell. Appl. Smart Devices 5(1), 11–20 (2017)

    Google Scholar 

  8. D. Gupta, A. Ahlawat, Usability feature selection via MBBAT: a novel approach. J. Comput. Sci. 23, 195–203 (2017)

    Article  Google Scholar 

  9. D. Gupta, A. Ahlawat, K. Sagar, Usability prediction and ranking of SDLC models using fuzzy hierarchical usability model. Open Eng. (Central Eur. J. Eng.) 7(1), 161–168 (2017)

    Google Scholar 

  10. M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)

    Article  MathSciNet  Google Scholar 

  11. J. Johnson, A. Karpathy, L. Fei-Fei, Densecap fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4565–4574

  12. A. Khamparia, G. Saini, D. Gupta, A. Khanna, S. Tiwari, V.H.C. de Albuquerque, Seasonal crops disease prediction and classification using deep convolutional encoder network. Circuits Syst. Signal Process. 32, 1–19 (2019)

    Google Scholar 

  13. A. Khamparia, A. Singh, D. Anand, D. Gupta, A. Khanna, N. Arun Kumar, J. Tan, A novel deep learning-based multi-model ensemble method for the prediction of neuromuscular disorders. Neural Comput. Appl. (2018). https://doi.org/10.1007/s00521-018-3896-0

    Article  Google Scholar 

  14. A. Khamparia, D. Gupta, N.G. Nhu, A. Khanna, B. Shukla, P. Tiwari, Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access 7(1), 7717–7727 (2019)

    Article  Google Scholar 

  15. A. Khamparia, B. Pandey, Effects of visual map embedded approach on students learning performance using Briggs–Myers learning style in word puzzle gaming course. Comput. Electr. Eng. 66, 531–540 (2018)

    Article  Google Scholar 

  16. A. Khamparia, N.G. Nhu, B. Pandey, D. Gupta, J.J. Rodrigues, A. Khanna, P. Tiwari, Investigating the importance of psychological and environmental factors for improving learner’s performance using hidden Markov model. IEEE Access 7, 21559–21571 (2019)

    Article  Google Scholar 

  17. J. Krause, J. Johnson, R. Krishna, L. Fei-Fei, A Hierarchical Approach for Generating Descriptive Image Paragraphs, arXivPrepr. arXiv1611.06607. (2016)

  18. S.K. Lakshmanaprabu, K. Shankar, A. Khanna, D. Gupta, J.J.P.C. Rodrigues, P.R. Pinheiro, V.H.C. De Albuquerque, Effective features to classify big data using social internet of things. IEEE Access 6, 24196–24204 (2018)

    Article  Google Scholar 

  19. M.D.A. Lavie, Meteor universal: language specific translation evaluation for any target language. ACL 2014, 376 (2014)

    Google Scholar 

  20. T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft coco: common objects in context. European Conference on Computer Vision (2014), pp. 740–755

  21. J. Li, M.-T. Luong, D. Jurafsky, A Hierarchical Neural Autoencoder for Paragraphs and Documents, arXivPrepr. arXiv1506.01057. (2015)

  22. Q. Liu, Y. Chen, J. Wang, S. Zhang, Multiview pedestrian captioning with an attention topic CNN model. Comput. Ind. 97, 47–53 (2018)

    Article  Google Scholar 

  23. J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, A. Yuille, Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN), arXivPrepr. arXiv1412.6632. (2014)

  24. K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318

  25. K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXivPrepr. arXiv1409.1556. (2014)

  26. R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4566–4575

  27. O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell. A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3156–3164

  28. Q. Wu, C. Shen, L. Liu, A. Dick, A. van den Hengel, What value do explicit high level concepts have in vision to language problems? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 203–212

  29. P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)

    Article  Google Scholar 

  30. H. Yu, J. Wang, Z. Huang, Y. Yang, W. Xu, Video paragraph captioning using hierarchical recurrent neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4584–4593

  31. M. Yavari, S. Barati, An automatic action potential detector for neural recording implants. Circuit Syst. Signal Process. 38(5), 1923–1941 (2018)

    MathSciNet  Google Scholar 

  32. T. Zhang, L. Xu, E. Yang, X. Yan, K. Qin, Q. Wang, A. Hussain, A novel method of signal fusion based on dimension expansion. Circuits Syst. Signal Process. 37(10), 4295–4318 (2018)

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work has been partially supported by National Funding from the FCT—Fundação para a Ciência e a Tecnologia through the UID/EEA/500008/2019 Project—and by the Brazilian National Council for Research and Development (CNPq) via Grant No. 309335/2017-5.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Joel J. P. C. Rodrigues.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Khamparia, A., Pandey, B., Tiwari, S. et al. An Integrated Hybrid CNN–RNN Model for Visual Description and Generation of Captions. Circuits Syst Signal Process 39, 776–788 (2020). https://doi.org/10.1007/s00034-019-01306-8

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00034-019-01306-8

Keywords

Navigation