An Integrated Hybrid CNN–RNN Model for Visual Description and Generation of Captions

Khamparia, Aditya; Pandey, Babita; Tiwari, Shrasti; Gupta, Deepak; Khanna, Ashish; Rodrigues, Joel J. P. C.

doi:10.1007/s00034-019-01306-8

An Integrated Hybrid CNN–RNN Model for Visual Description and Generation of Captions

Published: 11 November 2019

Volume 39, pages 776–788, (2020)
Cite this article

Circuits, Systems, and Signal Processing Aims and scope Submit manuscript

Aditya Khamparia¹,
Babita Pandey²,
Shrasti Tiwari³,
Deepak Gupta⁴,
Ashish Khanna⁴ &
…
Joel J. P. C. Rodrigues ORCID: orcid.org/0000-0001-8657-3800^5,6

861 Accesses
25 Citations
3 Altmetric
Explore all metrics

Abstract

Video captioning is currently considered to be one of the simplest ways to index and search data efficiently. In today’s era, suitable captioning of video images can be facilitated with deep learning architectures. The focus of past research has been on providing image captions; however, the generation of high-quality captions with suitable semantics for different scenes has not yet been achieved. Therefore, this work aims to generate well-defined and meaningful captions to images and videos by using convolutional neural networks (CNN) and recurrent neural networks in combination. Beginning with the available dataset, features of images and videos were extracted using CNN. The extracted feature vectors were then utilized to generate a language model with the involvement of long short-term memory for individual word grams. The generated meaningful captions were trained using a softmax function, for performance computation using some predefined evaluation metrics. The obtained experimental results demonstrate that the proposed model outperforms existing benchmark models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

Article 19 October 2019

Neeraj Gupta & Anand Singh Jalal

Video Captioning Using Deep Learning Approach-A Comprehensive Survey

Generation of Image Captions Using VGG and ResNet CNN Models Cascaded with RNN Approach

References

P. Bhanodia, B. Pandey, D. Pandey, A. Khamparia, A Comprehensive survey of link prediction in social networks: techniques, parameters and challenges. Expert Syst. Appl. 124, 164 (2019)
Article Google Scholar
J. Donahue, L. Anne Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, T. Darrell, Long-term recurrent convolutional networks for visual recognition and description. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 2625–2634
H. Fang, S. Gupta, F. Iandola, R.K. Srivastava, L. Deng, P. Dollár, J. Gao, X. He, M. Mitchell, J.C. Platt, et al., From captions to visual concepts and back. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 1473–1482
A. Farhadi, M. Hejrati, M. Sadeghi, P. Young, C. Rashtchian, J. Hockenmaier, D. Forsyth, Every picture tells a story: generating sentences from images. Computer Vision–ECCV 2010 (2010), pp. 15–29
D. Gupta, J.J.P.C. Rodrigues, S. Sundaram, A. Khanna, V. Korotaev, V.H.C. Albuquerque, Usability feature extraction using modified crow search algorithm: a novel approach. Neural Comput. Appl. (2018). https://doi.org/10.1007/s00521-018-3688-6
Article Google Scholar
D. Gupta, K. Sagar, Remote file synchronization single-round algorithm. Int. J. Comput. Appl. 4(1), 32–36 (2010)
Google Scholar
D. Gupta, A. Ahlawat, Usability prediction of live auction using multistage fuzzy system. Int. J. Artif. Intell. Appl. Smart Devices 5(1), 11–20 (2017)
Google Scholar
D. Gupta, A. Ahlawat, Usability feature selection via MBBAT: a novel approach. J. Comput. Sci. 23, 195–203 (2017)
Article Google Scholar
D. Gupta, A. Ahlawat, K. Sagar, Usability prediction and ranking of SDLC models using fuzzy hierarchical usability model. Open Eng. (Central Eur. J. Eng.) 7(1), 161–168 (2017)
Google Scholar
M. Hodosh, P. Young, J. Hockenmaier, Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. Res. 47, 853–899 (2013)
Article MathSciNet Google Scholar
J. Johnson, A. Karpathy, L. Fei-Fei, Densecap fully convolutional localization networks for dense captioning. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4565–4574
A. Khamparia, G. Saini, D. Gupta, A. Khanna, S. Tiwari, V.H.C. de Albuquerque, Seasonal crops disease prediction and classification using deep convolutional encoder network. Circuits Syst. Signal Process. 32, 1–19 (2019)
Google Scholar
A. Khamparia, A. Singh, D. Anand, D. Gupta, A. Khanna, N. Arun Kumar, J. Tan, A novel deep learning-based multi-model ensemble method for the prediction of neuromuscular disorders. Neural Comput. Appl. (2018). https://doi.org/10.1007/s00521-018-3896-0
Article Google Scholar
A. Khamparia, D. Gupta, N.G. Nhu, A. Khanna, B. Shukla, P. Tiwari, Sound classification using convolutional neural network and tensor deep stacking network. IEEE Access 7(1), 7717–7727 (2019)
Article Google Scholar
A. Khamparia, B. Pandey, Effects of visual map embedded approach on students learning performance using Briggs–Myers learning style in word puzzle gaming course. Comput. Electr. Eng. 66, 531–540 (2018)
Article Google Scholar
A. Khamparia, N.G. Nhu, B. Pandey, D. Gupta, J.J. Rodrigues, A. Khanna, P. Tiwari, Investigating the importance of psychological and environmental factors for improving learner’s performance using hidden Markov model. IEEE Access 7, 21559–21571 (2019)
Article Google Scholar
J. Krause, J. Johnson, R. Krishna, L. Fei-Fei, A Hierarchical Approach for Generating Descriptive Image Paragraphs, arXivPrepr. arXiv1611.06607. (2016)
S.K. Lakshmanaprabu, K. Shankar, A. Khanna, D. Gupta, J.J.P.C. Rodrigues, P.R. Pinheiro, V.H.C. De Albuquerque, Effective features to classify big data using social internet of things. IEEE Access 6, 24196–24204 (2018)
Article Google Scholar
M.D.A. Lavie, Meteor universal: language specific translation evaluation for any target language. ACL 2014, 376 (2014)
Google Scholar
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. Zitnick, Microsoft coco: common objects in context. European Conference on Computer Vision (2014), pp. 740–755
J. Li, M.-T. Luong, D. Jurafsky, A Hierarchical Neural Autoencoder for Paragraphs and Documents, arXivPrepr. arXiv1506.01057. (2015)
Q. Liu, Y. Chen, J. Wang, S. Zhang, Multiview pedestrian captioning with an attention topic CNN model. Comput. Ind. 97, 47–53 (2018)
Article Google Scholar
J. Mao, W. Xu, Y. Yang, J. Wang, Z. Huang, A. Yuille, Deep Captioning with Multimodal Recurrent Neural Networks (m-RNN), arXivPrepr. arXiv1412.6632. (2014)
K. Papineni, S. Roukos, T. Ward, W.-J. Zhu, BLEU: a method for automatic evaluation of machine translation. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002), pp. 311–318
K. Simonyan, A. Zisserman, Very Deep Convolutional Networks for Large-Scale Image Recognition, arXivPrepr. arXiv1409.1556. (2014)
R. Vedantam, C. Lawrence Zitnick, D. Parikh, Cider consensus-based image description evaluation. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 4566–4575
O. Vinyals, A. Toshev, S. Bengio, D. Erhan, Show and tell. A neural image caption generator. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3156–3164
Q. Wu, C. Shen, L. Liu, A. Dick, A. van den Hengel, What value do explicit high level concepts have in vision to language problems? Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 203–212
P. Young, A. Lai, M. Hodosh, J. Hockenmaier, From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Comput. Linguist. 2, 67–78 (2014)
Article Google Scholar
H. Yu, J. Wang, Z. Huang, Y. Yang, W. Xu, Video paragraph captioning using hierarchical recurrent neural networks. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 4584–4593
M. Yavari, S. Barati, An automatic action potential detector for neural recording implants. Circuit Syst. Signal Process. 38(5), 1923–1941 (2018)
MathSciNet Google Scholar
T. Zhang, L. Xu, E. Yang, X. Yan, K. Qin, Q. Wang, A. Hussain, A novel method of signal fusion based on dimension expansion. Circuits Syst. Signal Process. 37(10), 4295–4318 (2018)
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work has been partially supported by National Funding from the FCT—Fundação para a Ciência e a Tecnologia through the UID/EEA/500008/2019 Project—and by the Brazilian National Council for Research and Development (CNPq) via Grant No. 309335/2017-5.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Lovely Professional University, Phagwara, Punjab, India
Aditya Khamparia
Department of Computer Science and IT, Babasaheb Bhimrao Ambedkar University, Satellite Campus, Amethi, UP, India
Babita Pandey
Division of Examinations, Lovely Professional University, Phagwara, Punjab, India
Shrasti Tiwari
Maharaja Agrasen Institute of Technology, New Delhi, Delhi, India
Deepak Gupta & Ashish Khanna
Federal University of Piauí (UFPI), Teresina, PI, Brazil
Joel J. P. C. Rodrigues
Instituto de Telecomunicações, Lisbon, Portugal
Joel J. P. C. Rodrigues

Authors

Aditya Khamparia
View author publications
You can also search for this author in PubMed Google Scholar
Babita Pandey
View author publications
You can also search for this author in PubMed Google Scholar
Shrasti Tiwari
View author publications
You can also search for this author in PubMed Google Scholar
Deepak Gupta
View author publications
You can also search for this author in PubMed Google Scholar
Ashish Khanna
View author publications
You can also search for this author in PubMed Google Scholar
Joel J. P. C. Rodrigues
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Joel J. P. C. Rodrigues.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Khamparia, A., Pandey, B., Tiwari, S. et al. An Integrated Hybrid CNN–RNN Model for Visual Description and Generation of Captions. Circuits Syst Signal Process 39, 776–788 (2020). https://doi.org/10.1007/s00034-019-01306-8

Download citation

Received: 29 January 2019
Revised: 31 October 2019
Accepted: 02 November 2019
Published: 11 November 2019
Issue Date: February 2020
DOI: https://doi.org/10.1007/s00034-019-01306-8

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

An Integrated Hybrid CNN–RNN Model for Visual Description and Generation of Captions

Abstract

Access this article

Similar content being viewed by others

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

Video Captioning Using Deep Learning Approach-A Comprehensive Survey

Generation of Image Captions Using VGG and ResNet CNN Models Cascaded with RNN Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

An Integrated Hybrid CNN–RNN Model for Visual Description and Generation of Captions

Abstract

Access this article

Similar content being viewed by others

Integration of textual cues for fine-grained image captioning using deep CNN and LSTM

Video Captioning Using Deep Learning Approach-A Comprehensive Survey

Generation of Image Captions Using VGG and ResNet CNN Models Cascaded with RNN Approach

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation