Abstract
Entry-level categories—the labels people use to name an object—were originally defined and studied by psychologists in the 1970s and 1980s. In this paper we extend these ideas to study entry-level categories at a larger scale and to learn models that can automatically predict entry-level categories for images. Our models combine visual recognition predictions with linguistic resources like WordNet and proxies for word “naturalness” mined from the enormous amount of text on the web. We demonstrate the usefulness of our models for predicting nouns (entry-level words) associated with images by people, and for learning mappings between concepts predicted by existing visual recognition systems and entry-level concepts. In this work we make use of recent successful efforts on convolutional network models for visual recognition by training classifiers for 7404 object categories on ConvNet activation features. Results for category mapping and entry-level category prediction for images show promise for producing more natural human-like labels. We also demonstrate the potential applicability of our results to the task of image description generation.
Similar content being viewed by others
Notes
This function might bias decisions toward internal nodes. Other alternatives could be explored to estimate internal node scores.
References
Barnard, K., & Yanai, K. (2006). Mutual information of words and pictures. In Information Theory and Applications.
Bird, S. (2006). Nltk: The natural language toolkit. In COLING/ACL.
Brants, T., & Franz, A. (2006). Web 1t 5-gram version 1. In Linguistic Data Consortium.
Chen, X., Shrivastava, A., & Gupta, A. (2013). Extracting visual knowledge from Web Data: NEIL. In ICCV.
Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In CVPR.
Deng, J., Dong, W., Socher, R., Li, L-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A large-scale hierarchical image database. In CVPR.
Deng, J., Berg, A. C., Li, K., & Li, F-F. (2010). What does classifying more than 10,000 image categories tell us? In ECCV.
Deng, J., Krause, J., Berg, A. C., & Fei-Fei, L. (2012). Hedging your bets: Optimizing accuracy-specificity trade-offs in large scale visual recognition. In CVPR.
Divvala, S., Farhadi, A., & Guestrin, C. (2014). Learning everything about anything: Webly-supervised visual concept learning. In CVPR.
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. arXiv preprint arXiv:1310.1531.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88, 303–338.
Farhadi, A., Hejrati, M., Sadeghi, M. A., Young, P., Rashtchian, C., Hockenmaier, J., & Forsyth, D. (2010). Every picture tells a story: Generating sentences for images. In ECCV.
Fei-Fei, L., Fergus, R., & Perona, P. (2007). Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. Computer Vision and Image Understanding, 106, 59–70.
Fellbaum, C. (1998). WordNet: An electronic lexical database. Cambridge: MIT Press.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32, 1627–1645.
Feng, S., Ravi, S., Kumar, R., Kuznetsova, P., Liu, W., Berg, A. C, Berg, T. L., & Choi, Y. (2015). Refer-to-as relations as semantic knowledge. In AAAI.
Gupta, A., Verma, Y., & Jawahar, C. V. (2012). Choosing linguistics over vision to describe images. In AAAI.
Hodosh, M., Young, P., & Hockenmaier, J. (2013). Framing image description as a ranking task: Data, models and evaluation metrics. Journal of Artificial Intelligence Research, 47(1), 853–899.
Jia, Y. (2013). Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.berkeleyvision.org/
Jolicoeur, P., Gluck, M. A., & Kosslyn, S. M. (1984). Pictures and names: Making the connection. Cognitive Psychology, 16, 243–275.
Krizhevsky, A., Sutskever, I., & Hinton, G. (2012). Imagenet classification with deep convolutional neural networks. In NIPS.
Kulkarni, G., Premraj, V., Ordonez, V., Dhar, S., Li, S., Choi, Y., et al. (2013). Babytalk: Understanding and generating simple image descriptions. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 2891–2903.
Kuznetsova, P., Ordonez, V., Berg, A., Berg, T. L., & Choi, Y. (2012). Collective generation of natural image descriptions. In ACL.
Kuznetsova, P., Ordonez, V., Berg, T., & Choi, Y. (2014). Treetalk: Composition and compression of trees for image descriptions. Transactions of the Association for Computational Linguistics, 2, 351–362.
Le, Q. V., Monga, R., Devin, M., Chen, K., Corrado, G. S, Dean, J., & Ng, A. Y. (2012). Building high-level features using large scale unsupervised learning. In ICML.
Mason, R., & Charniak, E. (2014). Nonparametric method for data-driven image captioning. In ACL.
Mitchell, M., Han, X., Dodge, J., Mensch, A., Goyal, A., Berg, A.,&DauméIII, H. (2012). Midge: Generating image descriptions from computer vision detections. In EACL.
Ordonez, V., Kulkarni, G., & Berg, T. L. (2011). Im2text: Describing images using 1 million captioned photographs. In NIPS.
Ordonez, V., Deng, J., Choi, Y., Berg, A. C., & Berg, T. L. (2013). From large scale image categorization to entry-level categories. In ICCV.
Perronnin, F., Akata, Z., Harchaoui, Z., & Schmid, C. (2012). Towards good practice in large-scale learning for image classification. In CVPR.
Platt, J. C. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. In Advances in large margin classifiers.
Ramnath, K., Baker, S., Vanderwende, L., El-Saban, M., Sinha, S.N., Kannan, A., Hassan, N., Galley, M., Yang, Yi, Ramanan, D., Bergamo, A., & Torresani, L. (2014). Autocaption: Automatic caption generation for personal photos. In WACV.
Rosch, E. (1978). Principles of categorization. In E. Rosch & B. B. Lloyd (Eds.), Cognition and categorization (pp. 27–48). Hillsdale: Erlbaum.
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A. C., & Fei-Fei, L. (2014). Imagenet large scale visual recognition challenge. arXiv:1409.0575.
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. International Journal of Computer Vision, 77, 157–173.
Simonyan, K., & Zisserman, A. (2014). Very deep convolutional networks for large-scale image recognition. ArXiv e-prints.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., & Rabinovich, A. (2014). Going deeper with convolutions. ArXiv e-prints.
Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large dataset for non-parametric object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 30, 1958–1970.
Xiao, J., Hays, J., Ehinger, K., Oliva, A., & Torralba, A. (2010). Sun database: Large scale scene recognition from abbey to zoo. In CVPR.
Yanai, K., & Barnard, K. (2005). Probabilistic web image gathering. In MIR. ACM.
Yang, Y., Teo, C. L., DauméIII, H., & Aloimonos, Y. (2011). Corpus-guided sentence generation of natural images. In EMNLP.
Acknowledgments
This work was supported by NSF Career Award #1444234 and NSF Award #1445409.
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Phil Torr, Steve Seitz, Yi Ma, and Kiriakos Kutulakos.
Rights and permissions
About this article
Cite this article
Ordonez, V., Liu, W., Deng, J. et al. Predicting Entry-Level Categories. Int J Comput Vis 115, 29–43 (2015). https://doi.org/10.1007/s11263-015-0815-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-015-0815-z