Skip to main content
Log in

Training Visual-Semantic Embedding Network for Boosting Automatic Image Annotation

  • Published:
Neural Processing Letters Aims and scope Submit manuscript

Abstract

Image auto-annotation which annotates images according to their semantic contents has become a research focus in computer vision, as it helps people to edit, retrieve and understand large image collections. In the last decades, researchers have proposed many approaches to solve this task and achieved remarkable performance on several standard image datasets. In this paper, we train neural networks using visual and semantic ranking loss to learn visual-semantic embedding. This embedding can be easily applied to nearest-neighbor based models to boost their performance on image auto-annotation. We test our method on four challenging image datasets, reporting comparison results with existing works. Experimental results show that our method can be applied to several state-of-the-art nearest-neighbor based models including TagProp and 2PKNN, and significantly improves their performance.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. We also tried to use ReLU which perform slightly inferior than using SER (F1 values decrease 3–5% in our experiments on four datasets).

  2. The source code of TagProp is available at: http://lear.inrialpes.fr/people/guillaumin/code.php#tagprop.

  3. The source code of 2PKNN is available at: http://researchweb.iiit.ac.in/~yashaswi.verma/eccv12/2pknn.zip.

  4. These features are available at: http://lear.inrialpes.fr/people/guillaumin/data.php.

References

  1. Ballan L, Uricchio T, Seidenari L, Bimbo AD (2014) A cross-media model for automatic image annotation. In: ACM ICMR, pp 73–80

  2. Blei D, Jordan M (2003) Modeling annotated data. In: ACM SIGIR, pp 127–134

  3. Carneiro G, Chan A, Moreno P, Vasconcelos N (2007) Supervised learning of semantic classes for image annotation and retrieval. IEEE Trans Pattern Anal Mach Intell 29(3):394–410

    Article  Google Scholar 

  4. Chatfield K, Lempitsky V, Vedaldi A, Zisserman A (2011) The devil is in the details: an evaluation of recent feature encoding methods. In: BMVC, pp 1–12

  5. Deng J, Dong W, Socher R, Li L, Li K, Fei-Fei L (2009) Imagenet: a large-scale hierarchical image database. In: CVPR, pp 248–255

  6. Fenga S, Manmatha R, Lavrenko V (2004) Multiple Bernoulli relevance models for image and video annotation. In: CVPR, pp 1002–1009

  7. Fernando B, Anderson P, Hutter M, Gould S (2016) Discriminative hierarchical rank pooling for activity recognition. In: CVPR, pp 1924–1932

  8. Fernando B, Gawes E, Oramas J, Ghodrati J, Tuytelaars T (2017) Rank pooling for action recognition. IEEE Trans Pattern Anal Mach Intell 99:773–787

    Article  Google Scholar 

  9. Fu H, Zhang Q, Qiu G (2012) Random forest for image annotation. In: ECCV, pp 86–99

  10. Gong Y, Jia Y, Leung T, Toshev A, Ioffe S (2014) Deep convolutional ranking for multilabel image annotation. In: CoRR, arXiv:1312.4894

  11. Gong Y, Ke Q, Isard M, Lazebnik S (2014) A multi-view embedding space for modeling internet images, tags, and their semantics. Int J Comput Vis 106(2):210–233

    Article  Google Scholar 

  12. Gong Y, Wang L, Hodosh M, Hockenmaier J, Lazebnik S (2014) Improving image-sentence embeddings using large weakly annotated photo collections. In: ECCV, pp 529–545

  13. Gu Y, Xue H, Yang J (2016) Cross-modal saliency correlation for image annotation. Neural Process Lett 45(3):777–789

    Article  Google Scholar 

  14. Guillaumin M, Mensink T, Verbeek J, Schmid C (2009) Tagprop: discriminative metric learning in nearest neighbor models for image auto-annotation. In: ICCV, pp 309–316

  15. Hardoon D, Szedmak S, Shawe-Taylor J (2004) Cannonical correlation analysis: an overview with application to learning methods. Neural Comput 16(12):2639–2664

    Article  Google Scholar 

  16. He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: CVPR, pp 770–778

  17. Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. In: ICML, pp 448–456

  18. Jeon J, Lavreko V, Manmatha R (2003) Automatic image annotation and retrieval using cross-media relevance models. In: ACM SIGIR, pp 119–126

  19. Joachims T (2002) Optimizing search engines using clickthrough data. In: ACM SIGKDD, pp 133–142

  20. Johnson J, Ballan L, Fei-Fei L (2015) Love thy neighbors: image annotation by exploiting image metadata. In: ICCV, pp 4624–4632

  21. Kiros R, Szepesvari C (2015) Deep representations and codes for image auto-annotation. In: NIPS, pp 917–925

  22. Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid Gaussian–Laplacian mixture models for image annotation. In: CoRR, arXiv:1411.7399

  23. Klein B, Lev G, Sadeh G, Wolf L (2015) Fisher vectors derived from hybrid Gaussian–Laplacian mixture models for image annotation. In: CVPR

  24. Krizhevsky A, Sutskever I, Hinton G (2012) Imagenet classification with deep convolutional neural networks. In: NIPS, pp 1106–1114

  25. Lavrenko V, Manmatha R, Jeon J (2004) A model for learning the semantics of pictures. In: NIPS, pp 553–560

  26. Lazebnik S, Schmid C, Ponce J (2006) Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In: CVPR, pp 2169–2178

  27. Li X, Snoek C, Worring M (2007) Learning social tag relevance by neighbor voting. IEEE TMM 11(7):1310–1322

    Google Scholar 

  28. Liu Y, Xu D, Tsang I, Luo J (2007) Using large-scale web data to facilitate texual query based retrieval of consumer photos. In: ACM MM, pp 1277–1283

  29. Lowe D (2004) Distinctive image features from scale-invariant keypoints. IJCV 60(2):91–110

    Article  MathSciNet  Google Scholar 

  30. Makadia A, Pavlovic V, Kumar S (2008) A new baseline for image annotation. In: ECCV, pp 316–329

  31. Makadia A, Pavlovic V, Kumar S (2010) Baselines for image annotation. Int J Comput Vis 90(1):88–105

    Article  Google Scholar 

  32. Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:1301.3781

  33. Montazer G, Giveki D (2017) Scene classification using multi-resolution WAHOLB features and neural network classifier. Neural Process Lett 46(2):681–704

    Article  Google Scholar 

  34. Moran S, Lanvrenko V (2014) Sparse kernel learning for image annotation. In: ACM ICMR, p 113

  35. Oliva A, Torralba A (2001) Modeling the shape of the scene: a holistic representation of the spatial envelope. IJCV 42(3):145–175

    Article  Google Scholar 

  36. Peng X, Zou C, Qiao Y, Peng Q (2010) Action recognition with stacked fisher vectors. In: ECCV, pp 581–595

  37. Perronnin F, Sanchez J, Mensink T (2010) Improving the fisher kernel for large scale image classification. In: ECCV, pp 143–156

  38. Simonyan K, Zisserman A (2015) Very deep convolutional networks for large scale image recognition. In: ICLR

  39. Song Y, Zhuang Z, Li H, Zhao Q, Li J, Lee W, Giles CL (2008) Real-time automatic tag recommendation. In: ACM SIGIR, pp 515–522

  40. Venkatesh N, Subhransu M, Manmatha R (2015) Automatic image annotation using deep learning representations. In: ACM ICMR, pp 603–606

  41. Verma Y, Jawahar C (2012) Image annotation using metric learning in semantic neighbourhoods. In: ECCV, pp 836–849

  42. Verma Y, Jawahar C (2013) Exploring SVM for image annotation in presence of confusing labels. In: British machine vision conference, pp 1–11

  43. Wang G, Hoiem D, Forsyth D (2009) Building text features for object image classification. In: CVPR, pp 1367–1374

  44. Wang J, Yang Y, Mao J, Huang Z, Huang C, Xu W (2016) Cnn-rnn: a unified framework for multi-label image classification. In: CVPR, pp 2285–2294

  45. Wang L, Liu L, Khan L (2004) Automatic image annotation and retrieval using subspace clustering algorithm. In: ACM international workshop multimedia databases, pp 100–108

  46. Weston J, Bengio S, Usunier N (2011) Wsabie: scaling up to large vocabulary image annotation. In: IJCAI, pp 2764–2770

  47. Wu F, Jing X, Yue D (2017) Multi-view discriminant dictionary learning via learning view-specific and shared structured dictionaries for image classification. Neural Process Lett 45:649–666

    Article  Google Scholar 

  48. Yang C, Dong M, Hua J (2007) Region-based image annotation using asymmetrical support vector machine-based multiple-instance learning. In: CVPR, pp 2057–2063

  49. Yun H, Raman P, Vishwanathan S (2014) Ranking via robust binary classification. In: NIPS, pp 2582–2590

  50. Zhang S, Huang J, Huang Y (2010) Automatic image annotation using group sparsity. In: CVPR, pp 3312–3319

Download references

Acknowledgements

This work is supported by the Natural Science Foundation of China (No. 61572162) and the Zhejiang Provincial Key Science and Technology Project Foundation (No. 2017C01010).

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Hua Hu or Haiyang Hu.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Ethical Approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zhang, W., Hu, H. & Hu, H. Training Visual-Semantic Embedding Network for Boosting Automatic Image Annotation. Neural Process Lett 48, 1503–1519 (2018). https://doi.org/10.1007/s11063-017-9753-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11063-017-9753-9

Keywords

Navigation