Skip to main content

Advertisement

Log in

Image Annotation by Propagating Labels from Semantic Neighbourhoods

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Automatic image annotation aims at predicting a set of semantic labels for an image. Because of large annotation vocabulary, there exist large variations in the number of images corresponding to different labels (“class-imbalance”). Additionally, due to the limitations of human annotation, several images are not annotated with all the relevant labels (“incomplete-labelling”). These two issues affect the performance of most of the existing image annotation models. In this work, we propose 2-pass k-nearest neighbour (2PKNN) algorithm. It is a two-step variant of the classical k-nearest neighbour algorithm, that tries to address these issues in the image annotation task. The first step of 2PKNN uses “image-to-label” similarities, while the second step uses “image-to-image” similarities, thus combining the benefits of both. We also propose a metric learning framework over 2PKNN. This is done in a large margin set-up by generalizing a well-known (single-label) classification metric learning algorithm for multi-label data. In addition to the features provided by Guillaumin et al. (2009) that are used by almost all the recent image annotation methods, we benchmark using new features that include features extracted from a generic convolutional neural network model and those computed using modern encoding techniques. We also learn linear and kernelized cross-modal embeddings over different feature combinations to reduce semantic gap between visual features and textual labels. Extensive evaluations on four image annotation datasets (Corel-5K, ESP-Game, IAPR-TC12 and MIRFlickr-25K) demonstrate that our method achieves promising results, and establishes a new state-of-the-art on the prevailing image annotation datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10

Similar content being viewed by others

Notes

  1. We shall use the terms class/label interchangeably.

  2. These features are available at http://lear.inrialpes.fr/people/guillaumin/data.php.

  3. An implementation of 2PKNN and metric learning is available at http://researchweb.iiit.ac.in/~yashaswi.verma/eccv12/2pknn.zip.

  4. Features indexed by (3), (4), and (9) to (15) in Table 2.

  5. The code is available at http://lear.inrialpes.fr/people/guillaumin/code.php.

References

  • Anderson, C. (2006). The long tail: Why the future of business is selling less of more. Hyperion.

  • Ballan, L., Uricchio, T., Seidenari, L., & Bimbo, A. D. (2014). A cross-media model for automatic image annotation. In Proceedings of the ICMR.

  • Carneiro, G., Chan, A. B., Moreno, P. J., & Vasconcelos, N. (2007). Supervised learning of semantic classes for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3), 394–410.

    Article  Google Scholar 

  • Chen, M., Zheng, A., & Weinberger, K. Q. (2013). Fast image tagging. In Proceedings of the ICML.

  • Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. (2014). DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the ICML.

  • Duygulu, P., Barnard, K., de Freitas, J. F., & Forsyth, D. A. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the ECCV (pp. 97–112).

  • Feng, S. L., Manmatha, R., & Lavrenko, V. (2004). Multiple bernoulli relevance models for image and video annotation. In Proceedings of the CVPR (pp. 1002–1009).

  • Fu, H., Zhang, Q., & Qiu, G. (2012). Random forest for image annotation. In Proceedings of the ECCV (pp. 86–99).

  • Grubinger, M. (2007). Analysis and evaluation of visual information systems performance. PhD thesis, Victoria University, Melbourne, Australia.

  • Guillaumin, M., Mensink, T., Verbeek, J. J., & Schmid, C. (2009). Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In Proceedings of the ICCV (pp. 309–316).

  • Gupta, A., Verma, Y., & Jawahar, C. V. (2012). Choosing linguistics over vision to describe images. In Proceedings of the AAAI.

  • Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12), 2639–2664.

    Article  MATH  Google Scholar 

  • Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.

    Article  MATH  Google Scholar 

  • Huiskes, M. J., & Lew, M. S. (2008). The MIR Flickr retrieval evaluation. In MIR.

  • Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In Proceedings of the CVPR (pp. 3304–3311).

  • Jeon, J., Lavrenko, V., & Manmatha, R. (2003). Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the ACM SIGIR (pp. 119–126).

  • Jin, R., Wang, S., & Zhou, Z. H. (2009). Learning a distance metric from multi-instance multi-label data. In Proceedings of the CVPR (pp. 896–902).

  • Kalayeh, M. M., Idrees, H., & Shah, M. (2014). NMF-KNN: Image annotation using weighted multi-view non-negative matrix factorization. In Proceedings of the CVPR.

  • Lavrenko, V., Manmatha, R., & Jeon, J. (2003). A model for learning the semantics of pictures. In NIPS.

  • Li, X., Snoek, C. G. M., & Worring, M. (2009). Learning social tag relevance by neighbor voting. IEEE Transactions on Multimedia, 11(7), 1310–1322.

    Article  Google Scholar 

  • Liu, J., Li, M., Liu, Q., Lu, H., & Ma, S. (2009). Image annotation via graph learning. Pattern Recognition, 42(2), 218–228.

    Article  MATH  Google Scholar 

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In Proceedings of the ECCV (pp. 316–329).

  • Makadia, A., Pavlovic, V., & Kumar, S. (2010). Baselines for image annotation. International Journal of Computer Vision, 90(1), 88–105.

    Article  Google Scholar 

  • Metzler, D., & Manmatha, R. (2004). An inference network approach to image retrieval. In Proceedings of the CIVR (pp. 42–50).

  • Moran, S., & Lavrenko, V. (2011). Optimal tag sets for automatic image annotation. In Proceedings of the BMVC (pp. 1.1–1.11).

  • Moran, S., & Lavrenko, V. (2014). A sparse kernel relevance model for automatic image annotation. International Journal of Multimedia Information Retrieval, 3(4), 209–229.

  • Mori, Y., Takahashi, H., & Oka, R. (1999). Image-to-word transformation based on dividing and vector quantizing images with words. In MISRM’99 first international workshop on multimedia intelligent storage and retrieval management.

  • Murthy, V. N., Can, E. F., & Manmatha, R. (2014). A hybrid model for automatic image annotation. In Proceedings of the ICMR.

  • Nakayama, H. (2011). Linear distance metric learning for large-scale generic image recognition. PhD thesis, The University of Tokyo, Japan.

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.

    Article  MATH  Google Scholar 

  • Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In Proceedings of the ECCV (pp. 143–156).

  • Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the ICML (pp. 807–814).

  • van de Weijer, J., & Schmid, C. (2006). Coloring local feature extraction. In Proceedings of the ECCV (pp. 334–348).

  • Verbeek, J., Guillaumin, M., Mensink, T., & Schmid, C. (2010). Image Annotation with TagProp on the MIRFLICKR set. In MIR.

  • Verma, Y., & Jawahar, C. V. (2012). Image annotation using metric learning in semantic neighbourhoods. In Proceedings of the ECCV (pp. 836–849).

  • Verma, Y., & Jawahar, C. V. (2013). Exploring SVM for image annotation in presence of confusing labels. In Proceedings of the BMVC.

  • von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In SIGCHI conference on human factors in computing systems (pp. 319–326).

  • Wang, C., Blei, D., & Fei-Fei, L. (2009). Simultaneous image classification and annotation. In Proceedings of the CVPR.

  • Wang, H., Huang, H., & Ding, C. H. Q. (2011). Image annotation using bi-relational graph of images and semantic labels. In Proceedings of the CVPR (pp. 793–800).

  • Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10, 207–244.

    MATH  Google Scholar 

  • Xiang, Y., Zhou, X., Chua, T. S., & Ngo, C. W. (2009). A revisit of generative model for automatic image annotation using markov random fields. In Proceedings of the CVPR (pp. 1153–1160).

  • Yavlinsky, A., Schofield, E., & Rüger, S. (2005). Automated image annotation using global features and robust nonparametric density estimation. In Proceedings of the CIVR (pp. 507–517).

  • Zhang, S., Huang, J., Huang, Y., Yu, Y., Li, H., & Metaxas, D. N. (2010). Automatic image annotation using group sparsity. In Proceedings of the CVPR (pp. 3312–3319).

Download references

Acknowledgments

We thank Prof. Raghavan Manmatha for sharing the Corel-5K dataset, and the anonymous reviewers for their helpful comments. Yashaswi Verma is partially supported by Microsoft Research India PhD fellowship 2013.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yashaswi Verma.

Additional information

Communicated by Shin’ichi Satoh.

Appendix

Appendix

In our conference paper Verma and Jawahar (2012), we proposed to learn a combined distance metric in both distance as well as feature spaces, denoted by \(\mathbf {w}\) and \(\mathbf {v}\) respectively. While the dimensionality of \(\mathbf {w}\) is equal to the number of different features used to represent a sample, that of \(\mathbf {v}\) is equal to the combined dimensionality of all the individual features. As a result, we found it was not easy to learn \(\mathbf {v}\) using limited examples as the number of features increased, and the computational cost was also quite high. Hence, in this paper we have considered only the \(\mathbf {w}\) metric. Empirically, we found that the performance obtained using only the \(\mathbf {w}\) metric was usually comparable to that using both \(\mathbf {w}\) and \(\mathbf {v}\). E.g., in Table 10, we compare their performance using three feature combinations on the Corel-5K dataset. Here, we can observe that though we compromise a bit on the performance, the difference is not significant. More importantly, the training time improves by several orders of magnitude.

Table 10 Performance comparison in terms of F1, N+ and training time (in hours) between our previous metric learning approach (Verma and Jawahar 2012) that learns a combined metric in distance and feature spaces (2PKNN + ML (w + v)) with our current approach that learns a metric only in the distance space (2PKNN + ML(w)) on the Corel-5K dataset

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Verma, Y., Jawahar, C.V. Image Annotation by Propagating Labels from Semantic Neighbourhoods. Int J Comput Vis 121, 126–148 (2017). https://doi.org/10.1007/s11263-016-0927-0

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-016-0927-0

Keywords

Navigation