Image Annotation by Propagating Labels from Semantic Neighbourhoods

Verma, Yashaswi; Jawahar, C. V.

doi:10.1007/s11263-016-0927-0

Image Annotation by Propagating Labels from Semantic Neighbourhoods

Published: 12 July 2016

Volume 121, pages 126–148, (2017)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

1506 Accesses
43 Citations
3 Altmetric
Explore all metrics

Abstract

Automatic image annotation aims at predicting a set of semantic labels for an image. Because of large annotation vocabulary, there exist large variations in the number of images corresponding to different labels (“class-imbalance”). Additionally, due to the limitations of human annotation, several images are not annotated with all the relevant labels (“incomplete-labelling”). These two issues affect the performance of most of the existing image annotation models. In this work, we propose 2-pass k-nearest neighbour (2PKNN) algorithm. It is a two-step variant of the classical k-nearest neighbour algorithm, that tries to address these issues in the image annotation task. The first step of 2PKNN uses “image-to-label” similarities, while the second step uses “image-to-image” similarities, thus combining the benefits of both. We also propose a metric learning framework over 2PKNN. This is done in a large margin set-up by generalizing a well-known (single-label) classification metric learning algorithm for multi-label data. In addition to the features provided by Guillaumin et al. (2009) that are used by almost all the recent image annotation methods, we benchmark using new features that include features extracted from a generic convolutional neural network model and those computed using modern encoding techniques. We also learn linear and kernelized cross-modal embeddings over different feature combinations to reduce semantic gap between visual features and textual labels. Extensive evaluations on four image annotation datasets (Corel-5K, ESP-Game, IAPR-TC12 and MIRFlickr-25K) demonstrate that our method achieves promising results, and establishes a new state-of-the-art on the prevailing image annotation datasets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Article 11 April 2015

Olga Russakovsky, Jia Deng, … Li Fei-Fei

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Article Open access 06 February 2017

Ranjay Krishna, Yuke Zhu, … Li Fei-Fei

Notes

We shall use the terms class/label interchangeably.
These features are available at http://lear.inrialpes.fr/people/guillaumin/data.php.
An implementation of 2PKNN and metric learning is available at http://researchweb.iiit.ac.in/~yashaswi.verma/eccv12/2pknn.zip.
Features indexed by (3), (4), and (9) to (15) in Table 2.
The code is available at http://lear.inrialpes.fr/people/guillaumin/code.php.

References

Anderson, C. (2006). The long tail: Why the future of business is selling less of more. Hyperion.
Ballan, L., Uricchio, T., Seidenari, L., & Bimbo, A. D. (2014). A cross-media model for automatic image annotation. In Proceedings of the ICMR.
Carneiro, G., Chan, A. B., Moreno, P. J., & Vasconcelos, N. (2007). Supervised learning of semantic classes for image annotation and retrieval. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(3), 394–410.
Article Google Scholar
Chen, M., Zheng, A., & Weinberger, K. Q. (2013). Fast image tagging. In Proceedings of the ICML.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., et al. (2014). DeCAF: A deep convolutional activation feature for generic visual recognition. In Proceedings of the ICML.
Duygulu, P., Barnard, K., de Freitas, J. F., & Forsyth, D. A. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In Proceedings of the ECCV (pp. 97–112).
Feng, S. L., Manmatha, R., & Lavrenko, V. (2004). Multiple bernoulli relevance models for image and video annotation. In Proceedings of the CVPR (pp. 1002–1009).
Fu, H., Zhang, Q., & Qiu, G. (2012). Random forest for image annotation. In Proceedings of the ECCV (pp. 86–99).
Grubinger, M. (2007). Analysis and evaluation of visual information systems performance. PhD thesis, Victoria University, Melbourne, Australia.
Guillaumin, M., Mensink, T., Verbeek, J. J., & Schmid, C. (2009). Tagprop: Discriminative metric learning in nearest neighbor models for image auto-annotation. In Proceedings of the ICCV (pp. 309–316).
Gupta, A., Verma, Y., & Jawahar, C. V. (2012). Choosing linguistics over vision to describe images. In Proceedings of the AAAI.
Hardoon, D. R., Szedmak, S., & Shawe-Taylor, J. (2004). Canonical correlation analysis: An overview with application to learning methods. Neural Computation, 16(12), 2639–2664.
Article MATH Google Scholar
Hotelling, H. (1936). Relations between two sets of variates. Biometrika, 28, 321–377.
Article MATH Google Scholar
Huiskes, M. J., & Lew, M. S. (2008). The MIR Flickr retrieval evaluation. In MIR.
Jégou, H., Douze, M., Schmid, C., & Pérez, P. (2010). Aggregating local descriptors into a compact image representation. In Proceedings of the CVPR (pp. 3304–3311).
Jeon, J., Lavrenko, V., & Manmatha, R. (2003). Automatic image annotation and retrieval using cross-media relevance models. In Proceedings of the ACM SIGIR (pp. 119–126).
Jin, R., Wang, S., & Zhou, Z. H. (2009). Learning a distance metric from multi-instance multi-label data. In Proceedings of the CVPR (pp. 896–902).
Kalayeh, M. M., Idrees, H., & Shah, M. (2014). NMF-KNN: Image annotation using weighted multi-view non-negative matrix factorization. In Proceedings of the CVPR.
Lavrenko, V., Manmatha, R., & Jeon, J. (2003). A model for learning the semantics of pictures. In NIPS.
Li, X., Snoek, C. G. M., & Worring, M. (2009). Learning social tag relevance by neighbor voting. IEEE Transactions on Multimedia, 11(7), 1310–1322.
Article Google Scholar
Liu, J., Li, M., Liu, Q., Lu, H., & Ma, S. (2009). Image annotation via graph learning. Pattern Recognition, 42(2), 218–228.
Article MATH Google Scholar
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Article Google Scholar
Makadia, A., Pavlovic, V., & Kumar, S. (2008). A new baseline for image annotation. In Proceedings of the ECCV (pp. 316–329).
Makadia, A., Pavlovic, V., & Kumar, S. (2010). Baselines for image annotation. International Journal of Computer Vision, 90(1), 88–105.
Article Google Scholar
Metzler, D., & Manmatha, R. (2004). An inference network approach to image retrieval. In Proceedings of the CIVR (pp. 42–50).
Moran, S., & Lavrenko, V. (2011). Optimal tag sets for automatic image annotation. In Proceedings of the BMVC (pp. 1.1–1.11).
Moran, S., & Lavrenko, V. (2014). A sparse kernel relevance model for automatic image annotation. International Journal of Multimedia Information Retrieval, 3(4), 209–229.
Mori, Y., Takahashi, H., & Oka, R. (1999). Image-to-word transformation based on dividing and vector quantizing images with words. In MISRM’99 first international workshop on multimedia intelligent storage and retrieval management.
Murthy, V. N., Can, E. F., & Manmatha, R. (2014). A hybrid model for automatic image annotation. In Proceedings of the ICMR.
Nakayama, H. (2011). Linear distance metric learning for large-scale generic image recognition. PhD thesis, The University of Tokyo, Japan.
Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.
Article MATH Google Scholar
Perronnin, F., Sánchez, J., & Mensink, T. (2010). Improving the fisher kernel for large-scale image classification. In Proceedings of the ECCV (pp. 143–156).
Shalev-Shwartz, S., Singer, Y., & Srebro, N. (2007). Pegasos: Primal estimated sub-gradient solver for svm. In Proceedings of the ICML (pp. 807–814).
van de Weijer, J., & Schmid, C. (2006). Coloring local feature extraction. In Proceedings of the ECCV (pp. 334–348).
Verbeek, J., Guillaumin, M., Mensink, T., & Schmid, C. (2010). Image Annotation with TagProp on the MIRFLICKR set. In MIR.
Verma, Y., & Jawahar, C. V. (2012). Image annotation using metric learning in semantic neighbourhoods. In Proceedings of the ECCV (pp. 836–849).
Verma, Y., & Jawahar, C. V. (2013). Exploring SVM for image annotation in presence of confusing labels. In Proceedings of the BMVC.
von Ahn, L., & Dabbish, L. (2004). Labeling images with a computer game. In SIGCHI conference on human factors in computing systems (pp. 319–326).
Wang, C., Blei, D., & Fei-Fei, L. (2009). Simultaneous image classification and annotation. In Proceedings of the CVPR.
Wang, H., Huang, H., & Ding, C. H. Q. (2011). Image annotation using bi-relational graph of images and semantic labels. In Proceedings of the CVPR (pp. 793–800).
Weinberger, K. Q., & Saul, L. K. (2009). Distance metric learning for large margin nearest neighbor classification. Journal of Machine Learning Research, 10, 207–244.
MATH Google Scholar
Xiang, Y., Zhou, X., Chua, T. S., & Ngo, C. W. (2009). A revisit of generative model for automatic image annotation using markov random fields. In Proceedings of the CVPR (pp. 1153–1160).
Yavlinsky, A., Schofield, E., & Rüger, S. (2005). Automated image annotation using global features and robust nonparametric density estimation. In Proceedings of the CIVR (pp. 507–517).
Zhang, S., Huang, J., Huang, Y., Yu, Y., Li, H., & Metaxas, D. N. (2010). Automatic image annotation using group sparsity. In Proceedings of the CVPR (pp. 3312–3319).

Download references

Acknowledgments

We thank Prof. Raghavan Manmatha for sharing the Corel-5K dataset, and the anonymous reviewers for their helpful comments. Yashaswi Verma is partially supported by Microsoft Research India PhD fellowship 2013.

Author information

Authors and Affiliations

Center for Visual Information Technology, IIIT, Hyderabad, India
Yashaswi Verma & C. V. Jawahar

Authors

Yashaswi Verma
View author publications
You can also search for this author in PubMed Google Scholar
C. V. Jawahar
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yashaswi Verma.

Additional information

Communicated by Shin’ichi Satoh.

Appendix

In our conference paper Verma and Jawahar (2012), we proposed to learn a combined distance metric in both distance as well as feature spaces, denoted by \(\mathbf {w}\) and \(\mathbf {v}\) respectively. While the dimensionality of \(\mathbf {w}\) is equal to the number of different features used to represent a sample, that of \(\mathbf {v}\) is equal to the combined dimensionality of all the individual features. As a result, we found it was not easy to learn \(\mathbf {v}\) using limited examples as the number of features increased, and the computational cost was also quite high. Hence, in this paper we have considered only the \(\mathbf {w}\) metric. Empirically, we found that the performance obtained using only the \(\mathbf {w}\) metric was usually comparable to that using both \(\mathbf {w}\) and \(\mathbf {v}\). E.g., in Table 10, we compare their performance using three feature combinations on the Corel-5K dataset. Here, we can observe that though we compromise a bit on the performance, the difference is not significant. More importantly, the training time improves by several orders of magnitude.

Table 10 Performance comparison in terms of F1, N+ and training time (in hours) between our previous metric learning approach (Verma and Jawahar 2012) that learns a combined metric in distance and feature spaces (2PKNN + ML (w + v)) with our current approach that learns a metric only in the distance space (2PKNN + ML(w)) on the Corel-5K dataset

Full size table

Rights and permissions

Reprints and permissions

About this article

Cite this article

Verma, Y., Jawahar, C.V. Image Annotation by Propagating Labels from Semantic Neighbourhoods. Int J Comput Vis 121, 126–148 (2017). https://doi.org/10.1007/s11263-016-0927-0

Download citation

Received: 24 April 2015
Accepted: 23 June 2016
Published: 12 July 2016
Issue Date: January 2017
DOI: https://doi.org/10.1007/s11263-016-0927-0

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Image Annotation by Propagating Labels from Semantic Neighbourhoods

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Image Annotation by Propagating Labels from Semantic Neighbourhoods

Abstract

Access this article

Similar content being viewed by others

Microsoft COCO: Common Objects in Context

ImageNet Large Scale Visual Recognition Challenge

Visual Genome: Connecting Language and Vision Using Crowdsourced Dense Image Annotations

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Appendix

Appendix

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation