Abstract
Automatic image annotation aims to predict labels for images according to their semantic contents and has become a research focus in computer vision, as it helps people to edit, retrieve and understand large image collections. In the last decades, researchers have proposed many approaches to solve this task and achieved remarkable performance on several standard image datasets. In this paper, we propose a novel learning to rank approach to address image auto-annotation problem. Unlike typical learning to rank algorithms for image auto-annotation which directly rank annotations for image, our approach consists of two phases. In the first phase, neural ranking models are trained to rank image’s semantic neighbors. Then nearest-neighbor based models propagate annotations from these semantic neighbors to the image. Thus our approach integrates learning to rank algorithms and nearest-neighbor based models, including TagProp and 2PKNN, and inherits their advantages. Experimental results show that our method achieves better or comparable performance compared with the state-of-the-art methods on four challenging benchmarks including Corel5K, ESP Games, IAPR TC-12 and NUS-WIDE.
Similar content being viewed by others
Notes
We also tried to use ReLU which perform slightly inferior than using SER (F1 values decrease %4 ∼ %7 in our experiments on four datasets.)
The source code of TagProp is available at: http://lear.inrialpes.fr/people/guillaumin/code.php#tagprop
The source code of 2PKNN is available at: http://researchweb.iiit.ac.in/yashaswi.verma/eccv12/2pknn.zip
These features are available at: http://lear.inrialpes.fr/people/guillaumin/data.php
These features can be find at: http://lms.comp.nus.edu.sg/research/NUS-WIDE.htm
References
Agrawal A, Lu J, Antol S (2015) Vqa: Visual question answering. Int J Comput Vis 123(1):4–31
Ballan L, Uricchio T, Seidenari L, Bimbo AD (2014) A cross-media model for automatic image annotation. In: ACM ICMR, pp 73–80
Blei D, Jordan M (2003) Modeling annotated data. In: ACM SIGIR, pp 127–134
Breiman L (2001) Random forests. Mach Learn 45(1):5–32
Burges C (2005) Learning to rank using gradient descent. In: ICML, pp 89–96
Burges C (2010) From ranknet to lambdarank to lambdamart: An overview. In: Technical report, Microsoft Research
Cai D, He X, Han J (2007) Semi-supervised discriminant analysis. In: ICCV
Cao Z, Qin T (2007) Learning to rank: from pairwise approach to listwise approach. In: ICML, pp 129–136
Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410
Chatfield K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: BMVC, pp 1–12
Chopra S, Hadsell R, LeCun Y (2005) Learning a similarity metric discriminatively, with application to face verification. In: CVPR, pp 539–546
Dehghani M, Zamani H, Severyn A, Kamps J, Croft WB (2017) Neural ranking models with weak supervision. In: ACM SIGIR, pp 65–74
Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Imagenet: A large-scale hierarchical image database. In: CVPR, pp 248–255
Fabian L, Michael J, Nebojsa J (2013) Efficient ranking from pairwise comparisons. In: ICML, pp 109–117
Fenga S, Manmatha R, Lavrenko V (2004) Multiple bernoulli relevance models for image and video annotation. In: CVPR, pp 1002–1009
Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: CVPR, pp 1924–1932
Fernando B, Gawes E, Oramas J, Ghodrati J, Tuytelaars T (2017) Rank pooling for action recognition. IEEE Trans Pattern Anal Mach Intell 39(4):773–787
Fu H, Zhang Q, Qiu G (2012) Random forest for image annotation. In: ECCV, pp 86–99
Gao Z, Nie W, Liu A (2016) Evaluation of local spatial-temporal features for cross-view action recognition. Neurocomputing 173(1):110–117
Gao Z, Zhang H, Liu A (2016) Human action recognition on depth dataset. Neural Comput Applic 27(7):2047–2054
Gao Z, Zhang L, Chen M (2014) Enhanced and hierarchical structure algorithm for data imbalance problem in semantic extraction under massive video dataset. Multimedia Tools Appl 68(3):641–657
Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2014) Deep convolutional ranking for multilabel image annotation. arXiv:13124894
Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233
Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-setence embeddings using large weakly annotated photo collections. In: ECCV, pp 529–545
Gu Y, Xue H, Yang J (2016) Cross-modal saliency correlation for image annotation. Neural Process Lett 45(3):777–789
Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV, pp 309–316
Hardoon D, Szedmak S, Shawe-Taylor J (2004) Cannonical correlation analysis: An overview with application to learning methods. Neural Comput 16(12):2639–2664
He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778
Ioffe S, Szegedy C (2015) Batch normalization: Accelerating deep network training by reducing internal covariate shift. In: ICML, pp 448–456
Jeon J, Lavreko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: ACM SIGIR, pp 119–126
Joachims T (2002) Optimizing search engines using clickthrough data. In: ACM SIGKDD, pp 133–142
Johnson J, Ballan L, Fei-Fei L (2015) Love thy neighbors: Image annotation by exploiting image metadata. In: ICCV, pp 4624–4632
Kang F, Sukthankar R (2006) Correlated label propagation with application to multi-label learning. In: CVPR, pp 1719–1726
Kingma D, Ba J (2014) Adam: A method for stochastic optimization. arXiv:14126980
Kiros R, Szepesvari C (2015) Deep representations and codes for image auto-annotation. In: NIPS, pp 917–925
Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid gaussian-laplacian mixture models for image annotation. arXiv:14117399
Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1106–1114
Lavrenko V, Manmatha R, Jeon J (2004) A model for learning the semantics of pictures. In: NIPS, pp 553–560
Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In: CVPR, pp 2169–2178
Li X, Snoek C, Worring M (2007) Learning social tag relevance by neighbor voting. IEEE TMM 11(7):1310–1322
Li Z, Liu J, Xu C, Lu H (2013) Mlrank: Multi-correlation learning to rank for image annotation. Pattern Recogn 46(10):2700–2710
Liu J, Li M, Liu Q, Lu H, Ma S (2009) Image annotation via graph learning. Pattern Recogn 42(2):218–228
Liu T (2009) Learning to rank for information retrieval. Found Trends Inf Retr 3(3):225–331
Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110
Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: ECCV, pp 316–329
Makadia A, Pavlovic V, Kumar S (2010) Baselines for image annotation. Int J Comput Vis 90(1):88–105
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781
Montazer G, Giveki D (2017) Scene classification using multi-resolution waholb features and neural network classifier. Neural Process Lett 46(2):681–704
Moran S, Lanvrenko V (2014) Sparse kernel learning for image annotation. In: ACM ICMR, p 113
Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3):145–175
Peng X, Zou C, Qiao Y, Peng Q (2010) Action recognition with stacked fisher vectors. In: ECCV, pp 581–595
Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large scale image classification. In: ECCV, pp 143–156
Song Y, Zhuang Z, Li H, Zhao Q, Li J, Lee W, Giles CL (2008) Real-time automatic tag recommendation. In: ACM SIGIR, pp 515–522
Thomas D, Andreas K, Joel W (2014) Parallelizing exploration-exploitation tradeoffs in gaussian process bandit optimization. J Mach Learn Res 15(1):3873–3923
Thorsten J (2006) Training linear svms in linear time. In: KDD, pp 217–226
Venkatesh N, Subhransu M, Manmatha R (2015) Automatic image annotation using deep learning representations. In: ACM ICMR, pp 603–606
Verma Y, Jawahar C (2012) Image annotation using metric learning in semantic neighbourhoods. In: ECCV, pp 836–849
Verma Y, Jawahar C (2013) Exploring svm for image annotation in presence of confusing labels. In: British Machine Vision Conference, pp 1–11
Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) Cnn-rnn: A unified framework for multi-label image classification. In: CVPR, pp 2285–2294
Wang L, Liu L, Khan L (2004) Automatic image annotation and retrieval ussing subspace clustering algorithm. In: ACM International Workshop Multimedia Databases, pp 100–108
Weston J, Bengio S, Usunier N (2011) Wsabie: Scaling up to large vocabulary image annotation. In: IJCAI, pp 2764–2770
Wu F, Jing X, Yue D (2017) Multi-view discriminant dictionary learning via learning view-specific and shared structured dictionaries for image classification. Neural Process Lett 45(2):649–666
Yan X, Su XG (2009) Linear regression analysis: Theory and computing. World Scientfic Publishing Co, Inc, River Edge
Yan Y, Nie F, Li W, Gao C, Yang Y, Xu D (2016) Image classification by cross-media active learning with privileged information. IEEE Trans Multimedia 18(12):2494–2502
Yang C, Dong M, Hua J (2007) Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning. In: CVPR, pp 2057–2063
Yang Y, Xu D, Nie F, Yan S, Zhuang Y (2010) Image clustering using local discriminant models and global integration. IEEE Trans Image Process 19(10):2761–2773
Yang Y, Nie F, Xu D, Luo J, Zhuang Y, Pan Y (2012) A multimedia retrieval framework based on semi-supervised ranking and relevance feedback. IEEE Trans Pattern Anal Mach Intell 34(4):723–742
Yun H, Raman P, Vishwanathan S (2014) Ranking via robust binary classification. In: NIPS, pp 2582–2590
Zhang S, Huang J, Huang Y (2010) Automatic image annotation using group sparsity. In: CVPR, pp 3312–3319
Zhu L, Xu Z, Yang Y, Hauptmann AG (2017) Uncovering the temporal context for video question answering. Int J Comput Vis 124(3):409–421
Acknowledgements
This work is supported by the Natural Science Foundation of China (No. 61572162) and the Zhejiang Provincial Key Science and Technology Project Foundation (No. 2018C01012).
Author information
Authors and Affiliations
Corresponding authors
Ethics declarations
Conflict of interests
The authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Rights and permissions
About this article
Cite this article
Zhang, W., Hu, H. & Hu, H. Neural ranking for automatic image annotation. Multimed Tools Appl 77, 22385–22406 (2018). https://doi.org/10.1007/s11042-018-5973-x
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-018-5973-x