Abstract
Image captioning which aims to automatically describe the content of an image using sentences, has become an attractive task in computer vision and natural language processing domain. Recently, neural network approaches have been proposed and proved to be the most efficient methods for image captioning. However, most of the prior work only considers past semantic context information to generate words in the sentence, lacking the consideration of future textual context. Therefore, in this paper, we propose a bidirectional multimodal Recurrent Neural Network (m-RNN) model which considers both history and future semantic context through a bidirectional recurrent layer. We first employ a pre-trained Convolution Neural Network (CNN) to extract image features and then leverage the bidirectional m-RNN to generate the sentences to describe each input image. Besides, we refine visual features by combining word embedding features and raw image features together to further improve the performance. Experimental results performed on the MS-COCO dataset have demonstrated the superiority of our proposed model compared with the original m-RNN model.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Tang, J., Shu, X., Qi, Q.J., Li, Z., Wang, M., Yan, S., Jain, R.: Tri-clustered tensor completion for social-aware image tag refinement. IEEE Trans. Pattern Anal. Mach. Intell. 39(8), 1662–1674 (2017)
Tang, J., Shu, X., Li, Z., Qi, Q.J., Wang, J.: Generalized deep transfer networks for knowledge propagation in heterogeneous domains. ACM Trans. Multimed. Comput. Commun. Appl. 12(4) (2016)
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., Berg, A.C., Berg, T.L.: Babytalk: understanding and generating simple image descriptions. IEEE Trans. Pattern Anal. Mach. Intell. 35(12), 2891–2903 (2013)
Yang, Y., Teo, C.L., Daumé III, H., Aloimonos, Y.: Corpus-guided sentence generation of natural images. In: Proceedings of the Conference on Empirical Methods in Natural Language Processing, pp. 444–454. Association for Computational Linguistics (2011)
Farhadi, A., Hejrati, M., Sadeghi, M.A., Young, P., Rashtchian, C., Hockenmaier, J., Forsyth, D.: Every picture tells a story: generating sentences from images. In: Daniilidis, K., Maragos, P., Paragios, N. (eds.) ECCV 2010. LNCS, vol. 6314, pp. 15–29. Springer, Heidelberg (2010). https://doi.org/10.1007/978-3-642-15561-1_2
Kuznetsova, P., Ordonez, V., Berg, A.C., Berg, T.L., Choi, Y.: Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics: Long Papers, vol. 1, pp. 359–368. Association for Computational Linguistics (2012)
Li, Z., Tang, J.: Weakly supervised deep matrix factorization for social image understanding. IEEE Trans. Image Process. 26(1), 276–288 (2017)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-RNN). In: ICLR (2015)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3156–3164 (2015)
Karpathy, A., Fei-Fei, L.: Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128–3137 (2015)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhudinov, R., Zemel, R., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048–2057 (2015)
Wang, C., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: Proceedings of the 2016 ACM on Multimedia Conference, pp. 988–997. ACM (2016)
Li, Z., Liu, J., Tang, J., Lu, H.: Robust structured subspace learning for data representation. IEEE Trans. Pattern Anal. Mach. Intell. 37(10), 2085–2098 (2015)
Chen, X., Fang, H., Lin, T.Y., Vedantam, R., Gupta, S., Dollár, P., Zitnick, C.L.: Microsoft COCO captions: data collection and evaluation server (2015). arXiv preprint: arXiv:1504.00325
Szegedy, C., Vanhoucke, V., Ioffe, S., Shlens, J., Wojna, Z.: Rethinking the inception architecture for computer vision. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818–2826 (2016)
Papineni, K., Roukos, S., Ward, T., Zhu, W.J.: BLEU: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting on Association for Computational Linguistics, pp. 311–318. Association for Computational Linguistics (2002)
Banerjee, S., Lavie, A.: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments, pp. 228–231 (2005)
Lin, C.Y.: ROUGE: a package for automatic evaluation of summaries. In: Text Summarization Branches Out: Proceedings of the ACL-2004 Workshop, vol. 8, Barcelona, Spain (2004)
Vedantam, R., Lawrence Zitnick, C., Parikh, D.: CIDEr: consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566–4575 (2015)
Acknowledgement
This work was partially supported by the National Natural Science Foundation of China (Grant No. 61522203, 61572252 and 61672285), the Natural Science Foundation of Jiangsu Province (Grant No. BK20140058 and BK20150755).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2018 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Shu, Y., Zhang, L., Li, Z., Tang, J. (2018). Bidirectional Multimodal Recurrent Neural Networks with Refined Visual Features for Image Captioning. In: Huet, B., Nie, L., Hong, R. (eds) Internet Multimedia Computing and Service. ICIMCS 2017. Communications in Computer and Information Science, vol 819. Springer, Singapore. https://doi.org/10.1007/978-981-10-8530-7_8
Download citation
DOI: https://doi.org/10.1007/978-981-10-8530-7_8
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-10-8529-1
Online ISBN: 978-981-10-8530-7
eBook Packages: Computer ScienceComputer Science (R0)