Abstract
Automatic generation of captions for visual contents has recently emerged as a challenging research field due to it’s enormous impact in areas like computer vision, information retrieval, autonomous vehicles and natural language processing. Traditional models mainly focus on single aspect of the visual features to generate descriptions. The proposed model incorporates spatial information of salient objects capturing detailed characteristics coupled with scene category to incorporate general image setting. These extracted features are processed by topic-aware attention-based language model to generate human like captions. Performance of the proposed model is compared with state-of-the-art research through evaluation on benchmark image captioning datasets. The experimental results depict the promising performance of the proposed model compared with the captioning models proposed in recent literature.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Ling, H., Fidler, S.: Teaching machines to describe images via natural language feedback. In: NIPS (2017)
Ramisa, A., Yan, F., Moreno-Noguer, F., Mikolajczyk, K.: BreakingNews: article annotation by image and text processing. IEEE Trans. Pattern Anal. Mach. Intell. 40(5), 1072–1085 (2018)
Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition. In: ICLR (2015)
Fang, H., et al.: From captions to visual concepts and back. In: IEEE Conference on Computer Vision and Pattern Recognition (2015)
Tan, Y.H., Chan, C.S.: phi-LSTM: a phrase-based hierarchical LSTM model for image captioning. In: ACCV (2016)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 664–676 (2017)
Epstein, R.A., Baker, C.I.: Scene perception in the human brain. Annu. Rev. Vis. Sci. (2019)
Groen, I.I.A., Silson, E.H., Baker, C.I.: Contributions of low-and high-level properties to neural processing of visual scenes in the human brain. Philos. Trans. Roy. Soc. B Biol. Sci. (2017)
Epstein, R.A., Baker, C.I.: Scene perception in the human brain. Annu. Rev. Vis. Sci. (2019)
Yang, Z., Zhang, Y.J., Rehman, S., Huang, Y.: Image captioning with object detection and localization. Int. Conf. Image Graph. 109–118 (2017)
Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR Workshop (2013)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Blei, D., Lafferty, J.: A correlated topic model of science. Ann. Appl. Statist. 1(1), 17–35 (2007)
Chen, B.: Latent topic modelling of word co-occurence information for spoken document retrieval. In: 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, Taipei, pp. 3961–3964 (2009)
Mao, J., Xu, W., Yang, Y., Wang, J., Yuille, A.L.: Explain images with multimodal recurrent neural networks arXiv:1410.1090 (2014)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: lessons learned from the 2015 MSCOCO image captioning challenge. IEEE Trans. Pattern Anal. Mach. Intell. 39(4), 652–663 (2017)
Xu, K., et al.: Show, attend and tell: neural image caption generation with visual attention. Int. Conf. Mach. Learn. (2015)
Lu, J., Xiong, C., Parikh, D., Socher, R.: Knowing when to look: adaptive attention via a visual sentinel for image captioning. In: Processing of IEEE Conference on Computer Vision and Pattern Recognition (CVPR), July 2017, pp. 375–383 (2017)
Chen, L., et al.: SCA-CNN: spatial and channel-wise attention in convolutional networks for image captioning. arXiv:1611.05594 (2016)
Gao, L., Li, X., Song, J., Shen, H.T.: Hierarchical LSTMs with adaptive attention for visual captioning. IEEE Trans. Pattern Anal. Mach. Intell. 42(5), 1112–1131 (2020)
Chen, M., Ding, G., Zhao, S., Chen, H., Liu, Q., Han, J.: Reference based LSTM for image captioning. In: Proceeding of 31st AAAI Conference, pp. 3981–3987 (2017)
Wu, C., Yuan, S., Cao, H., Wei, Y., Wang, L.: Hierarchical attention-based fusion for image caption with multi-grained rewards. IEEE Access 8, 57943–57951 (2020)
Gu, J., Cai, J., Wang, G., Chen, T.: Stack-captioning: coarse-to-fine learning for image captioning. In: AAAI (2018)
Huang, L., Wang, W., Chen, J., Wei, X.Y.: Attention on attention for image captioning. ICCV (2019)
Guo, L., Liu, J., Zhu, X., Yao, P., Lu, S., Lu, H.: Normalized and geometry-aware self-attention network for image captioning. CVPR (2020)
Li, J., Yao, P., Guo, L., Zhang, W.: Boosted transformer for image captioning. Appl. Sci. (2019)
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R.: Meshed-memory transformer for image captioning. In: CVPR (2020)
Fan, A., Lavril, T., Grave, E., Joulin, A., Sukhbaatar, S.: Addressing some limitations of transformers with feedback memory. arXiv:2002.09402v3 (2021)
Li, G., Zhu, L., Liu, P., Yang, Y.: Entangled transformer for image captioning. In: ICCV (2019)
Liu, F., Ren, X., Liu, Y., Lei, K., Sun, X.: Exploring and distilling cross-modal information for image captioning. arXiv (2020)
Cheng, Y., Huang, F., Zhou, L., Jin, C., Zhang, Y., Zhang, T.: A hierarchical multimodal attention-based neural network for image captioning. In: Proceedings of 40th International ACM SIGIR Conference, pp. 889–892 (2019)
Hou, J.C., Wang, S.S., Lai, Y.H., Tsao, Y., Chang, H.W., Wang, H.-M.: Audio-visual speech enhancement using multimodal deep convolutional neural networks. IEEE Trans. Emerg. Topics Comput. Intell. 2(2), 117–128 (2018)
Chen, H., Cohn, A.G.: Buried utility pipeline mapping based on multiple spatial data sources: a Bayesian data fusion approach. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, pp. 1–9 (2011)
Deng, J., Dong, W., Socher, R., Li, L.J., Kai, L., Li, F.-F.: ImageNet: a large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition, pp. 248–255 (2009)
Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: towards real-time object detection with region proposal networks. Adv. Neural Inf. Process. Syst. 91–99 (2015)
Vaswani, A., et al.: Attention is all you need. NeurIPS (2017)
Zhou, B., Lapedriza, A., Khosla, A., Oliva, A., Torralba, A.: Places: a 10 million image database for scene recognition. IEEE Trans. Pattern Anal. Mach. Intell. (2017)
Mikolov, T., Corrado, G.S., Chen, K., Dean, J.: Efficient estimation of word representations in vector space. In: ICLR, pp. 1–12 (2013)
Pennington, J., Socher, R., Manning, C.: Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp. 1532–1543 (2014)
Peters, M., et al.: Deep contextualized word representations. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp. 2227–2237 (2018)
Fadaee, M., Bisazza, A., Monz, C.: Learning topic-sensitive word representations. In: Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, pp. 441–447. Association for Computational Linguistics (2017)
Zia, U., Riaz, M.M., Ghafoor, A., Ali, S.S.: Topic sensitive image descriptions. Neural Comput. Appl. pp. 1–9 (2019)
Mikolov, T., Karafiát, M., Burget, L., Cernocký, J., Khudanpur, S.: Recurrent neural network based language model. In: Proceedings of the 11th Annual Conference of the International Speech Communication Association, pp. 1045–1048 (2010)
Hodosh, M., Young, P., Hockenmaier, J.: Framing image description as a ranking task: data, models and evaluation metrics. J. Artif. Intell. (2013)
Young, P., Lai, A., Hodosh, M., Hockenmaier, J.: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions. Trans. Assoc. Computat. Linguist. (2014)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (2002)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments. In: Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization, pp. 65–72 (2005)
Vedantam, R., Zitnick, C.L., Parikh, D.: CIDEr: consensus-based image description evaluation. In: 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 4566–4575 (2015)
Wang, Q., Chan, A.B.: CNN+ CNN: convolutional decoders for image captioning. arXiv preprint. arXiv:1805.09019 (2018)
Yao, T., Pan, Y., Li, Y., Qiu, Z., Mei, T.: Boosting image captioning with attributes. In: ICCV, pp. 4904–4912 (2017)
Cao, P., Yang, Z., Sun, L., Liang, Y., Yang, M.Q., Guan, R.: Image captioning with bidirectional semantic attention-based guiding of long short-term memory. Neural Process. Lett. 50(1), 103–119 (2019). https://doi.org/10.1007/s11063-018-09973-5
Cheng, L., Wei, W., Mao, X., Liu, Y., Miao, C.: Stack-VS: stacked visual-semantic attention for image caption generation. IEEE Access 8, 154953–154965 (2020)
Gao, L., Fan, K., Song, J., Liu, X., Xu, X., Shen, H.T.: Deliberate attention networks for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, pp. 8320–8327 (2019)
Zhou, Y., Wang, M., Liu, D., Hu, Z., Zhang, H.: More grounded image captioning by distilling image-text matching model. In: CVPR (2020)
Deng, Z., Zhou, B., He, P., Huang, J., Alfarraj, O., Tolba, A.: A position-aware transformer for image captioning. Comput. Mater. Continua (2021)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Zia, U., Riaz, M.M., Ghafoor, A. (2022). Topic Guided Image Captioning with Scene and Spatial Features. In: Barolli, L., Hussain, F., Enokido, T. (eds) Advanced Information Networking and Applications. AINA 2022. Lecture Notes in Networks and Systems, vol 450. Springer, Cham. https://doi.org/10.1007/978-3-030-99587-4_16
Download citation
DOI: https://doi.org/10.1007/978-3-030-99587-4_16
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-99586-7
Online ISBN: 978-3-030-99587-4
eBook Packages: Intelligent Technologies and RoboticsIntelligent Technologies and Robotics (R0)