Skip to main content

Advertisement

Log in

SUN Database: Exploring a Large Collection of Scene Categories

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

Progress in scene understanding requires reasoning about the rich and diverse visual environments that make up our daily experience. To this end, we propose the Scene Understanding database, a nearly exhaustive collection of scenes categorized at the same level of specificity as human discourse. The database contains 908 distinct scene categories and 131,072 images. Given this data with both scene and object labels available, we perform in-depth analysis of co-occurrence statistics and the contextual relationship. To better understand this large scale taxonomy of scene categories, we perform two human experiments: we quantify human scene recognition accuracy, and we measure how typical each image is of its assigned scene category. Next, we perform computational experiments: scene recognition with global image features, indoor versus outdoor classification, and “scene detection,” in which we relax the assumption that one image depicts only one scene category. Finally, we relate human experiments to machine performance and explore the relationship between human and machine recognition errors and the relationship between image “typicality” and machine recognition accuracy.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23
Fig. 24
Fig. 25
Fig. 26
Fig. 27
Fig. 28
Fig. 29
Fig. 30
Fig. 31
Fig. 32

Similar content being viewed by others

Notes

  1. All the images and scene definitions are available at sundatabase.mit.edu or sun.cs.princeton.edu.

  2. This difference also explains why our count is much higher than (Biederman 1987) estimate of about 1,000 basic-level objects – we included all object words in our count, not just basic-level terms.

  3. The number of images is continuously growing as we run our scripts to query more images from time to time.

  4. Category size ranged from 22 images in the smallest categories to 2360 in the largest. A total of 124,901 images were used in the experiment.

  5. All workers were located in the United States and had a good performance record with the service (at least 100 HITs completed with an acceptance rate of 95 % or better). Workers were paid $0.03 per trial.

  6. Note that we use color for dense SIFT computation and train the feature codebook using SUN database that contains color images only. The 15-scene dataset from Lazebnik et al. (2006) contains several categories of grayscale images, which do not have color information. Therefore, the result of our color-based dense SIFT on the 15-scene database (see Fig. 19a) is much worse than what is reported in Lazebnik et al. (2006).

  7. \(F(3,19846) = 278, p < .001\).

  8. Due to the difficulty of the one-versus-all classification task, confidence was low across all classifications, and even correctly-classified images had average confidence scores below zero.

  9. A \(4 \times 2\) ANOVA gives significant main effects of image typicality (\(F(3,19842) = 79.8, p < .001\)) and correct vs. incorrect classification (\(F(1,19842) = 6006, p < .001\)) and a significant interaction between these factors (\(F(3,19842) = 43.5, p < .001\)).

References

  • Ahonen, T., Matas, J., He, C., & Pietikäinen, M., et al. (2009). Rotation invariant image description with local binary pattern histogram fourier features. In Scandinavian Conference on Image Analysis.

  • Arbelaez, P., Fowlkes, C., & Martin, D. (2007). The Berkeley segmentation dataset and benchmark. Retrieved from http://www.eecs.berkeley.edu/Research/Projects/CS/vision/bsds.

  • Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2011). Contour detection and hierarchical image segmentation. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 33(5), 898–916.

  • Barnard, K., Duygulu, P., Forsyth, D., De Freitas, N., Blei, D. M., & Jordan, M. I. (2003). Matching words and pictures. The Journal of Machine Learning Research, 3, 1107–1135.

  • Barriuso, A., & Torralba, A. (2012). Notes on image annotation. Retrieved from arXiv:1210.3448.

  • Biederman, I. (1987). Recognition-by-components: a theory of human image understanding. Psychological Review, 94(2), 115.

  • Boutell, M. R., Luo, J., Shen, X., & Brown, C. M. (2004). Learning multi-label scene classification. Pattern recognition, 37(9), 1757–1771.

  • Bunge, J., & Fitzpatrick, M. (1993). Estimating the number of species: A review. Journal of the American Statistical Association, 88(421), 364–373.

  • Dalal, N., & Triggs, B. (2005, June). Histograms of oriented gradients for human detection. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on (Vol. 1, pp. 886-893). IEEE.

  • Deng, J., Dong, W., Socher, R., Li, L. J., Li, K., & Fei-Fei, L. (2009, June). Imagenet: A large-scale hierarchical image database. In Computer Vision and Pattern Recognition, 2009. CVPR 2009. IEEE Conference on (pp. 248-255). IEEE.

  • Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., & Darrell, T. (2013). Decaf: A deep convolutional activation feature for generic visual recognition. Retrieved from arXiv:1310.1531.

  • Ehinger, K. A., Xiao, J., Torralba, A., & Oliva, A. (2011). Estimating scene typicality from human ratings and image features. Massachusetts: Cognitive science.

  • Epstein, R., & Kanwisher, N. (1998). A cortical representation of the local visual environment. Nature, 392(6676), 598–601.

  • Everingham, M., Van Gool, L., Williams, C. K., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

  • Fei-Fei, L., Fergus, R., & Perona, P., et al. (2004). Learning generative visual models from few training examples. In Computer Vision and Pattern Recognition Workshop on Generative-Model Based Vision.

  • Fei-Fei, L., & Perona, P. (2005). A bayesian hierarchical model for learning natural scene categories. In Computer Vision and Pattern Recognition, 2005. CVPR 2005. IEEE Computer Society Conference on (Vol. 2, pp. 524–531). IEEE.

  • Fellbaum, C. (1998). Wordnet: An electronic lexical database. Bradford: Bradford Books.

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 32(9), 1627–1645.

  • Griffin, G., Holub, A., Perona, P., et al. (2007). Caltech-256 object category dataset. Technical Report.

  • Hays, J., & Efros, A. A. (2008). IM2GPS: estimating geographic information from a single image. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.

  • Hoiem, D., Efros, A. A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1), 151–172.

  • Jolicoeur, P., Gluck, M. A., & Kosslyn, S. M. (1984). Pictures and names: Making the connection. Cognitive Psychology, 16(2), 243–275.

    Article  Google Scholar 

  • Kosecka, J., & Zhang, W. (2002). Video compass. In Computer Vision-ECCV 2002 (pp. 476–490). Berlin: Springer.

  • Lalonde, J.F., Hoiem, D., Efros, A.A., Rother, C., Winn, J., & Criminisi, A., et al. (2007). Photo clip art. SIGGRAPH.

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: Spatial pyramid matching for recognizing natural scene categories. In Computer Vision and Pattern Recognition, 2006 IEEE Computer Society Conference on (Vol. 2, pp. 2169–2178). IEEE.

  • Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In Computer Vision, 2001. ICCV 2001. Proceedings. Eighth IEEE International Conference on (Vol. 2, pp. 416-423). IEEE.

  • Matas, J., Chum, O., Urban, M., & Pajdla, T. (2004). Robust wide-baseline stereo from maximally stable extremal regions. Image and Vision Computing, 22(10), 761–767.

  • Ojala, T., Pietikainen, M., & Maenpaa, T. (2002). Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 24(7), 971–987.

  • Oliva, A., & Torralba, A. (2001). Modeling the shape of the scene: A holistic representation of the spatial envelope. International Journal of Computer Vision, 42(3), 145–175.

  • Philbin, J., Chum, O., Isard, M., Sivic, J., & Zisserman, A. (2008, June). Lost in quantization: Improving particular object retrieval in large scale image databases. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (pp. 1–8). IEEE.

  • Sharif Razavian, A., Azizpour, H., Sullivan, J., & Carlsson, S. (2014). CNN Features off-the-shelf: An Astounding Baseline for Recognition. Retrieved from arXiv:1403.6382.

  • Renninger, L. W., & Malik, J. (2004). When is scene identification just texture recognition?. Vision Research, 44(19), 2301–2311.

  • Rosch, E. H. (1973). Natural categories. Cognitive Psychology, 4(3), 328–350.

  • Rosch, E., Mervis, C. B., Gray, W. D., Johnson, D. M., & Boyes-Braem, P. (1976). Basic objects in natural categories. Cognitive Psychology, 8(3), 382–439.

  • Russakovsky, O., Deng, J., Huang, Z., Berg, A. C., & Fei-Fei, L. (2013, December). Detecting avocados to zucchinis: what have we done, and where are we going?. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 2064–2071). IEEE.

  • Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). LabelMe: a database and web-based tool for image annotation. International Journal of Computer Vision, 77(1–3), 157–173.

  • Sadeghi, M. A., & Farhadi, A. (2011, June). Recognition using visual phrases. In Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on (pp. 1745–1752). IEEE.

  • Sanchez, J., Perronnin, F., Mensink, T., & Verbeek, J. (2013). Image classification with the Fisher vector: Theory and practice. International Journal of Computer Vision, 105(3), 222–245.

  • Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fergus, R., & LeCun, Y. (2013). Overfeat: Integrated recognition, localization and detection using convolutional networks. Retrieved from arXiv:1312.6229.

  • Shechtman, E., & Irani, M. (2007, June). Matching local self-similarities across images and videos. In Computer Vision and Pattern Recognition, 2007. CVPR’07. IEEE Conference on (pp. 1–8). IEEE.

  • Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). Textonboost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In ECCV (pp. 1–15). Berlin: Springer.

  • Sivic, J., & Zisserman, A. (2004, June). Video data mining using configurations of viewpoint invariant regions. In Computer Vision and Pattern Recognition, 2004. CVPR 2004. Proceedings of the 2004 IEEE Computer Society Conference on (Vol. 1, pp. I–488). IEEE.

  • Song, S., & Xiao, J. (2014). Sliding Shapes for 3D object detection in RGB-D images. In European Conference on Computer Vision.

  • Spain, M., & Perona, P. (2008). Some objects are more equal than others: measuring and predicting importance. In: European Conference on Computer Vision.

  • Torralba, A., Fergus, R., & Freeman, W. T. (2008). 80 million tiny images: A large data set for nonparametric object and scene recognition. Pattern Analysis and Machine Intelligence, IEEE Transactions on, 30(11), 1958–1970.

  • Torralba, A., Murphy, K. P., Freeman, W. T., & Rubin, M. A. (2003, October). Context-based vision system for place and object recognition. In Computer Vision, 2003. Proceedings. Ninth IEEE International Conference on (pp. 273–280). IEEE.

  • Tversky, B., & Hemenway, K. (1983). Categories of environmental scenes. Cognitive Psychology, 15(1), 121–149.

  • Vedaldi, A., & Fulkerson, B. (2010, October). An open and portable library of computer vision algorithms: VLFeat. In Proceedings of the international conference on Multimedia (pp. 1469–1472). ACM.

  • Vogel, J., & Schiele, B. (2004). A semantic typicality measure for natural scene categorization. In Pattern Recognition (pp. 195–203). Berlin: Springer.

  • Vogel, J., & Schiele, B. (2007). Semantic modeling of natural scenes for content-based image retrieval. International Journal of Computer Vision, 72(2), 133–157.

  • Xiao, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2012, June). Recognizing scene viewpoint using panoramic place representation. In Computer Vision and Pattern Recognition (CVPR), 2012 IEEE Conference on (pp. 2695–2702). IEEE.

  • Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010, June). Sun database: Large-scale scene recognition from abbey to zoo. In Computer vision and pattern recognition (CVPR), 2010 IEEE conference on (pp. 3485–3492). IEEE.

  • Xiao, J., Owens, A., & Torralba, A. (2013, December). SUN3D: A database of big spaces reconstructed using sfm and object labels. In Computer Vision (ICCV), 2013 IEEE International Conference on (pp. 1625–1632). IEEE.

  • Zhang, Y., Song, S., Tan, P., & Xiao, J., et al. (2014). PanoContext: A whole-room 3D context model for panoramic scene understanding. In European Conference on Computer Vision.

Download references

Acknowledgments

We thank Yinda Zhang for help on the scene classification experiments. This work is funded by Google Research Award to J. X., NSF Grant 1016862 to A.O, NSF CAREER Award 0747120 to A.T., NSF CAREER Award 1149853 to J.H, as well as ONR MURI N000141010933, Foxconn and gifts from Microsoft and Google. K.A.E was funded by a NSF Graduate Research fellowship.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jianxiong Xiao.

Additional information

Communicated by M. Hebert.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Xiao, J., Ehinger, K.A., Hays, J. et al. SUN Database: Exploring a Large Collection of Scene Categories. Int J Comput Vis 119, 3–22 (2016). https://doi.org/10.1007/s11263-014-0748-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-014-0748-y

Keywords

Navigation