Predicting Entry-Level Categories

Ordonez, Vicente; Liu, Wei; Deng, Jia; Choi, Yejin; Berg, Alexander C.; Berg, Tamara L.

doi:10.1007/s11263-015-0815-z

Predicting Entry-Level Categories

Published: 08 April 2015

Volume 115, pages 29–43, (2015)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

Vicente Ordonez¹,
Wei Liu¹,
Jia Deng²,
Yejin Choi³,
Alexander C. Berg¹ &
…
Tamara L. Berg¹

700 Accesses
15 Citations
Explore all metrics

Abstract

Entry-level categories—the labels people use to name an object—were originally defined and studied by psychologists in the 1970s and 1980s. In this paper we extend these ideas to study entry-level categories at a larger scale and to learn models that can automatically predict entry-level categories for images. Our models combine visual recognition predictions with linguistic resources like WordNet and proxies for word “naturalness” mined from the enormous amount of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people, and for learning mappings between concepts predicted by existing visual recognition systems and entry-level concepts. In this work we make use of recent successful efforts on convolutional network models for visual recognition by training classifiers for 7404 object categories on ConvNet activation features. Results for category mapping and entry-level category prediction for images show promise for producing more natural human-like labels. We also demonstrate the potential applicability of our results to the task of image description generation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

This function might bias decisions toward internal nodes. Other alternatives could be explored to estimate internal node scores.

References

Barnard, K., & Yanai, K. (2006). Mutual information of words and pictures. In Information Theory and Applications.
Bird, S. (2006). Nltk: The natural language toolkit. In COLING/ACL.
Brants, T., & Franz, A. (2006). Web 1t 5-gram version 1. In Linguistic Data Consortium.
Chen, X., Shrivastava, A., & Gupta, A. (2013). Extracting visual knowledge from Web Data: NEIL. In ICCV.
Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In CVPR.
Deng, J., Dong, W., Socher, R., Li, L-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Deng, J., Berg, A. C., Li, K., & Li, F-F. (2010). What does classifying more than 10,000 image categories tell us? In ECCV.
Deng, J., Krause, J., Berg, A. C., & Fei-Fei, L. (2012). Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR.
Divvala, S., Farhadi, A., & Guestrin, C. (2014). Learning everything about anything: Webly-supervised visual concept learning. In CVPR.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.
Article Google Scholar
Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences for images. In ECCV.
Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59–70.
Article Google Scholar
Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
MATH Google Scholar
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.
Article Google Scholar
Feng, S., Ravi, S., Kumar, R., Kuznetsova, P., Liu, W., Berg, A. C, Berg, T. L., & Choi, Y. (2015). Refer-to-as relations as semantic knowledge. In AAAI.
Gupta, A., Verma, Y., & Jawahar, C. V. (2012). Choosing linguistics over vision to describe images. In AAAI.
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47(1), 853–899.
Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/
Jolicoeur, P., Gluck, M. A., & Kosslyn, S. M. (1984). Pictures and names: Making the connection. Cognitive Psychology, 16, 243–275.
Article Google Scholar
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2891–2903.
Article Google Scholar
Kuznetsova, P., Ordonez, V., Berg, A., Berg, T. L., & Choi, Y. (2012). Collective generation of natural image descriptions. In ACL.
Kuznetsova, P., Ordonez, V., Berg, T., & Choi, Y. (2014). Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2, 351–362.
Google Scholar
Le, Q. V., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In ICML.
Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. In ACL.
Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A.,&DauméIII, H. (2012). Midge: Generating image descriptions from computer vision detections. In EACL.
Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.
Ordonez, V., Deng, J., Choi, Y., Berg, A. C., & Berg, T. L. (2013). From large scale image categorization to entry-level categories. In ICCV.
Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.
Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers.
Ramnath, K., Baker, S., Vanderwende, L., El-Saban, M., Sinha, S.N., Kannan, A., Hassan, N., Galley, M., Yang, Yi, Ramanan, D., Bergamo, A., & Torresani, L. (2014). Autocaption: Automatic caption generation for personal photos. In WACV.
Rosch, E. (1978). Principles of categorization. In E. Rosch & B. B. Lloyd (Eds.), Cognition and categorization (pp. 27–48). Hillsdale: Erlbaum.
Google Scholar
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Imagenet large scale visual recognition challenge. arXiv:1409.0575.
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77, 157–173.
Article Google Scholar
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv e-prints.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). Going deeper with convolutions. ArXiv e-prints.
Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.
Article Google Scholar
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large scale scene recognition from abbey to zoo. In CVPR.
Yanai, K., & Barnard, K. (2005). Probabilistic web image gathering. In MIR. ACM.
Yang, Y., Teo, C. L., DauméIII, H., & Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In EMNLP.

Download references

Acknowledgments

This work was supported by NSF Career Award #1444234 and NSF Award #1445409.

Author information

Authors and Affiliations

Department of Computer Science, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA
Vicente Ordonez, Wei Liu, Alexander C. Berg & Tamara L. Berg
Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, MI, USA
Jia Deng
Computer Science and Engineering, University of Washington, Seattle, WA, USA
Yejin Choi

Authors

Vicente Ordonez
View author publications
You can also search for this author in PubMed Google Scholar
Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Jia Deng
View author publications
You can also search for this author in PubMed Google Scholar
Yejin Choi
View author publications
You can also search for this author in PubMed Google Scholar
Alexander C. Berg
View author publications
You can also search for this author in PubMed Google Scholar
Tamara L. Berg
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Vicente Ordonez.

Additional information

Communicated by Phil Torr, Steve Seitz, Yi Ma, and Kiriakos Kutulakos.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ordonez, V., Liu, W., Deng, J. et al. Predicting Entry-Level Categories. Int J Comput Vis 115, 29–43 (2015). https://doi.org/10.1007/s11263-015-0815-z

Download citation

Received: 27 October 2014
Accepted: 12 March 2015
Published: 08 April 2015
Issue Date: October 2015
DOI: https://doi.org/10.1007/s11263-015-0815-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Predicting Entry-Level Categories

Abstract

Access this article

Similar content being viewed by others

Low Dimensional Visual Attributes: An Interpretable Image Encoding

Generic decoding of seen and imagined objects using hierarchical visual features

Deep Residual Network Predicts Cortical Representation and Organization of Visual Features for Rapid Categorization

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Predicting Entry-Level Categories

Abstract

Access this article

Similar content being viewed by others

Low Dimensional Visual Attributes: An Interpretable Image Encoding

Generic decoding of seen and imagined objects using hierarchical visual features

Deep Residual Network Predicts Cortical Representation and Organization of Visual Features for Rapid Categorization

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation