Deep Semantic Indexing Using Convolutional Localization Network with Region-Based Visual Attention for Image Database

Zhang, Mingxing; Yang, Yang; Zhang, Hanwang; Ji, Yanli; Xie, Ning; Shen, Heng Tao

doi:10.1007/978-3-319-68155-9_20

Mingxing Zhang¹⁶,
Yang Yang¹⁶,
Hanwang Zhang¹⁷,
Yanli Ji¹⁸,
Ning Xie¹⁶ &
…
Heng Tao Shen¹⁶

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10538))

Included in the following conference series:

Australasian Database Conference

1124 Accesses
6 Citations

Abstract

In this paper, we introduce a novel deep semantic indexing method, a.k.a. captioning, for image database. Our method can automatically generate a natural language caption describing an image as a semantic reference to index the image. Specifically, we use a convolutional localization network to generate a pool of region proposals from an image, and then leverage the visual attention mechanism to sequentially generate the meaningful language words. Compared with previous methods, our approach can efficiently generate compact captions, which can guarantee higher level of semantic indexing for image database. We evaluate our approach on two widely-used benchmark datasets: Flickr30K, and MS COCO. Experimental results across various evaluation metrics show the superiority of our approach as compared with other visual attention based approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learning to align and translate. arXiv (2014)
Google Scholar
Bhuiyan, A.: Content-based image retrieval for image indexing. IJACSA 6(6), 71–79 (2015)
Google Scholar
Bin, Y., Yang, Y., Shen, F., Xu, X., Shen, H.T.: Bidirectional long-short term memory for video description. In: ACM MM, pp. 436–440. ACM (2016)
Google Scholar
Bhuiyan, A., Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption generation. arXiv (2014)
Google Scholar
Chiueh, T.-C.: Content-based image indexing. In: VLDB, vol. 94, pp. 582–593 (1994)
Google Scholar
Donahue, J., Hendricks, L.A., Guadarrama, S., Rohrbach, M., Venugopalan, S., Saenko, K., Darrell, T.: Long-term recurrent convolutional networks for visual recognition and description. In: CVPR, pp. 2625–2634 (2015)
Google Scholar
Ewald, M.D.: Content-based image indexing and retrieval in an image database for technical domains. Trans. MLDM 2(1), 3–22 (2009)
Google Scholar
Fang, H., Gupta, S., Iandola, F., Srivastava, R.K., Deng, L., Dollár, P., Gao, J., He, X., Mitchell, M., Platt, J.C., et al.: From captions to visual concepts and back. In: CVPR, pp. 1473–1482 (2015)
Google Scholar
Felzenszwalb, P.F., Girshick, R.B., McAllester, D., Ramanan, D.: Object detection with discriminatively trained part-based models. TPAMI 32(9), 1627–1645 (2010)
Article Google Scholar
Girshick, R.: Fast r-cnn. In: ICCV, pp. 1440–1448 (2015)
Google Scholar
Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accurate object detection and semantic segmentation. In: CVPR, pp. 580–587 (2014)
Google Scholar
Hochreiter, S., Schmidhuber, J.: Long short-term memory. NC 9(8), 1735–1780 (1997)
Google Scholar
Hosang, J., Benenson, R., Dollr, P., Schiele, B.: What makes for effective detection proposals? TPAMI 38(4), 814–830 (2016)
Article Google Scholar
Hyvönen, E., Saarela, S., Styrman, A., Viljanen, K.: Ontology-based image retrieval. In: WWW (2003)
Google Scholar
Karpathy, A., Li, F.-F.: Deep visual-semantic alignments for generating image descriptions. In: CVPR, pp. 3128–3137 (2015)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Multimodal neural language models. In: ICML, vol. 14, pp. 595–603 (2014)
Google Scholar
Kiros, R., Salakhutdinov, R., Zemel, R.S.: Unifying visual-semantic embeddings with multimodal neural language models. arXiv (2014)
Google Scholar
Koch, C., Ullman, S.: Shifts in selective visual attention: towards the underlying neural circuitry. In: Vaina, L.M. (ed.) MI. SYLI, vol. 188, pp. 115–141. Springer, Dordrecht (1987). doi:10.1007/978-94-009-3833-5_5
Chapter Google Scholar
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft COCO: common objects in context. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 740–755. Springer, Cham (2014). doi:10.1007/978-3-319-10602-1_48
Google Scholar
Luong, M.-T., Pham, H., Manning, C.D.: Effective approaches to attention-based neural machine translation. arXiv (2015)
Google Scholar
Mao, J., Xu, W., Yang, Y., Wang, J., Huang, Z., Yuille, A.: Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv (2014)
Google Scholar
Michel, F.: How many photos are uploaded to flickr every day and month? (2012). http://www.flickr.com/photos/franckmichel/6855169886/
Neubig, G.: Neural machine translation and sequence-to-sequence models: a tutorial. arXiv (2017)
Google Scholar
Nguyen, T.V., Song, Z., Yan, S.: Stap: spatial-temporal attention-aware pooling for action recognition. TCSVT 25(1), 77–86 (2015)
Google Scholar
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S.: Flickr30k entities: collecting region-to-phrase correspondences for richer image-to-sentence models. In: ICCV, pp. 2641–2649 (2015)
Google Scholar
Ren, S., He, K., Girshick, R., Sun, J.: Faster r-cnn: towards real-time object detection with region proposal networks. In: NIPS, pp. 91–99 (2015)
Google Scholar
Sharma, S., Kiros, R., Salakhutdinov, R.: Action recognition using visual attention. arXiv (2015)
Google Scholar
Sutskever, I., Vinyals, O., Le, Q.V.: Sequence to sequence learning with neural networks. In: NIPS, pp. 3104–3112 (2014)
Google Scholar
Uijlings, J.R.R., van de Sande, K.E.A., Gevers, T., Smeulders, A.W.M.: Selective search for object recognition. IJCV 104(2), 154–171 (2013)
Article Google Scholar
Vailaya, A., Figueiredo, M.A.T., Jain, A.K., Zhang, H.-J.: Image classification for content-based indexing. TIP 10(1), 117–130 (2001)
MATH Google Scholar
Vinyals, O., Toshev, A., Bengio, S., Erhan, D.: Show and tell: a neural image caption generator. In: CVPR, pp. 3156–3164 (2015)
Google Scholar
Viola, P., Jones, M.: Rapid object detection using a boosted cascade of simple features. In: CVPR, vol. 1, p. I-511. IEEE (2001)
Google Scholar
Wang, C., Yang, H., Bartz, C., Meinel, C.: Image captioning with deep bidirectional LSTMs. In: ACM MM, pp. 988–997. ACM (2016)
Google Scholar
Xindong, W., Zhu, X., Gongqing, W., Ding, W.: Data mining with big data. TKDE 26(1), 97–107 (2014)
Google Scholar
Xu, K., Ba, J., Kiros, R., Cho, K., Courville, A., Salakhutdinov, R., Zemel, R.S., Bengio, Y.: Show, attend and tell: neural image caption generation with visual attention. arXiv 2(3), 5 (2015)
Google Scholar
Yanai, K.: Image collector: an image-gathering system from the world-wide web employing keyword-based search engines. In: ICME (2001)
Google Scholar
Yang, Y., Luo, Y., Chen, W., Shen, F., Shao, J., Shen, H.T.: Zero-shot hashing via transferring supervised knowledge. In: ACM MM, pp. 1286–1295. ACM (2016)
Google Scholar
Yang, Y., Zha, Z.-J., Gao, Y., Zhu, X., Chua, T.-S.: Exploiting web images for semantic video indexing via robust sample-specific loss. TMM 16(6), 1677–1689 (2014)
Google Scholar
Yao, L., Torabi, A., Cho, K., Ballas, N., Pal, C., Larochelle, H., Courville, A.: Describing videos by exploiting temporal structure. In: ICCV, pp. 4507–4515 (2015)
Google Scholar
Zitnick, C.L., Dollár, P.: Edge boxes: locating object proposals from edges. In: Fleet, D., Pajdla, T., Schiele, B., Tuytelaars, T. (eds.) ECCV 2014. LNCS, vol. 8693, pp. 391–405. Springer, Cham (2014). doi:10.1007/978-3-319-10602-1_26
Google Scholar

Download references

Acknowledgement

This work was supported in part by the National Natural Science Foundation of China under Project 61572108 and Project 61502081, the National Thousand-Young-Talents Program of China, and the Fundamental Research Funds for the Central Universities under Project ZYGX2014Z007 and Project ZYGX2015J055.

Author information

Authors and Affiliations

School of Computer Science and Engineering, Center for Future Media, UESTC, Chengdu, China
Mingxing Zhang, Yang Yang, Ning Xie & Heng Tao Shen
Department of Computer Science, Columbia University, New York City, USA
Hanwang Zhang
School of Automation, Center for Future Media, UESTC, Chengdu, China
Yanli Ji

Authors

Mingxing Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Yang
View author publications
You can also search for this author in PubMed Google Scholar
Hanwang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yanli Ji
View author publications
You can also search for this author in PubMed Google Scholar
Ning Xie
View author publications
You can also search for this author in PubMed Google Scholar
Heng Tao Shen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Yang .

Editor information

Editors and Affiliations

University of Queensland, Brisbane, Queensland, Australia
Zi Huang
Nanyang Technological University, Singapore, Singapore
Xiaokui Xiao
University of New South Wales, Sydney, New South Wales, Australia
Xin Cao

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zhang, M., Yang, Y., Zhang, H., Ji, Y., Xie, N., Shen, H.T. (2017). Deep Semantic Indexing Using Convolutional Localization Network with Region-Based Visual Attention for Image Database. In: Huang, Z., Xiao, X., Cao, X. (eds) Databases Theory and Applications. ADC 2017. Lecture Notes in Computer Science(), vol 10538. Springer, Cham. https://doi.org/10.1007/978-3-319-68155-9_20

Download citation

DOI: https://doi.org/10.1007/978-3-319-68155-9_20
Published: 20 September 2017
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-68154-2
Online ISBN: 978-3-319-68155-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics