Abstract
In this paper, we introduce a novel deep semantic indexing method, a.k.a. captioning, for image database. Our method can automatically generate a natural language caption describing an image as a semantic reference to index the image. Specifically, we use a convolutional localization network to generate a pool of region proposals from an image, and then leverage the visual attention mechanism to sequentially generate the meaningful language words. Compared with previous methods, our approach can efficiently generate compact captions, which can guarantee higher level of semantic indexing for image database. We evaluate our approach on two widely-used benchmark datasets: Flickr30K, and MS COCO. Experimental results across various evaluation metrics show the superiority of our approach as compared with other visual attention based approaches.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv (2014)
Bhuiyan, A.: Content-based image retrieval for image indexing. IJACSA 6(6), 71–79 (2015)
Bin, Y., Yang, Y., Shen, F., Xu, X., Shen, H.T.: Bidirectional long-short term memory for video description. In: ACM MM, pp. 436–440. ACM (2016)
Bhuiyan, A., Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption generation. arXiv (2014)
Chiueh, T.-C.: Content-based image indexing. In: VLDB, vol. 94, pp. 582–593 (1994)
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)
Ewald, M.D.: Content-based image indexing and retrieval in an image database for technical domains. Trans. MLDM 2(1), 3–22 (2009)
Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. TPAMI 32(9), 1627–1645 (2010)
Girshick, R.: Fast r-cnn. In: ICCV, pp. 1440–1448 (2015)
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)
Hochreiter, S., Schmidhuber, J.: Long short-term memory. NC 9(8), 1735–1780 (1997)
Hosang, J., Benenson, R., Dollr, P., Schiele, B.: What makes for effective detection proposals? TPAMI 38(4), 814–830 (2016)
Hyvönen, E., Saarela, S., Styrman, A., Viljanen, K.: Ontology-based image retrieval. In: WWW (2003)
Karpathy, A., Li, F.-F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Multimodal neural language models. In: ICML, vol. 14, pp. 595–603 (2014)
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv (2014)
Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. In: Vaina, L.M. (ed.) MI. SYLI, vol. 188, pp. 115–141. Springer, Dordrecht (1987). doi:10.1007/978-94-009-3833-5_5
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). doi:10.1007/978-3-319-10602-1_48
Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv (2015)
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv (2014)
Michel, F.: How many photos are uploaded to flickr every day and month? (2012). http://www.flickr.com/photos/franckmichel/6855169886/
Neubig, G.: Neural machine translation and sequence-to-sequence models: a tutorial. arXiv (2017)
Nguyen, T.V., Song, Z., Yan, S.: Stap: spatial-temporal attention-aware pooling for action recognition. TCSVT 25(1), 77–86 (2015)
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. arXiv (2015)
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. IJCV 104(2), 154–171 (2013)
Vailaya, A., Figueiredo, M.A.T., Jain, A.K., Zhang, H.-J.: Image classification for content-based indexing. TIP 10(1), 117–130 (2001)
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR, vol. 1, p. I-511. IEEE (2001)
Wang, C., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: ACM MM, pp. 988–997. ACM (2016)
Xindong, W., Zhu, X., Gongqing, W., Ding, W.: Data mining with big data. TKDE 26(1), 97–107 (2014)
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv 2(3), 5 (2015)
Yanai, K.: Image collector: an image-gathering system from the world-wide web employing keyword-based search engines. In: ICME (2001)
Yang, Y., Luo, Y., Chen, W., Shen, F., Shao, J., Shen, H.T.: Zero-shot hashing via transferring supervised knowledge. In: ACM MM, pp. 1286–1295. ACM (2016)
Yang, Y., Zha, Z.-J., Gao, Y., Zhu, X., Chua, T.-S.: Exploiting web images for semantic video indexing via robust sample-specific loss. TMM 16(6), 1677–1689 (2014)
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). doi:10.1007/978-3-319-10602-1_26
Acknowledgement
This work was supported in part by the National Natural Science Foundation of China under Project 61572108 and Project 61502081, the National Thousand-Young-Talents Program of China, and the Fundamental Research Funds for the Central Universities under Project ZYGX2014Z007 and Project ZYGX2015J055.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2017 Springer International Publishing AG
About this paper
Cite this paper
Zhang, M., Yang, Y., Zhang, H., Ji, Y., Xie, N., Shen, H.T. (2017). Deep Semantic Indexing Using Convolutional Localization Network with Region-Based Visual Attention for Image Database. In: Huang, Z., Xiao, X., Cao, X. (eds) Databases Theory and Applications. ADC 2017. Lecture Notes in Computer Science(), vol 10538. Springer, Cham. https://doi.org/10.1007/978-3-319-68155-9_20
Download citation
DOI: https://doi.org/10.1007/978-3-319-68155-9_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68154-2
Online ISBN: 978-3-319-68155-9
eBook Packages: Computer ScienceComputer Science (R0)