Abstract
This paper describes a system for interpreting a scene by assigning a semantic label at every pixel and inferring the spatial extent of individual object instances together with their occlusion relationships. First we present a method for labeling each pixel aimed at achieving broad coverage across hundreds of object categories, many of them sparsely sampled. This method combines region-level features with per-exemplar sliding window detectors. Unlike traditional bounding box detectors, per-exemplar detectors perform well on classes with little training data and high intra-class variation, and they allow object masks to be transferred into the test image for pixel-level segmentation. Next, we use per-exemplar detections to generate a set of candidate object masks for a given test image. We then select a subset of objects that explain the image well and have valid overlap relationships and occlusion ordering. This is done by minimizing an integer quadratic program either using a greedy method or a standard solver. We alternate between using the object predictions to refine the pixel labels and using the pixel labels to improve the object predictions. The proposed system obtains promising results on two challenging subsets of the LabelMe dataset, the largest of which contains 45,676 images and 232 classes.

















Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
Technically, headlights are attached to cars, but we do not make a distinction between attachment and occlusion in this work.
Fig. 2 Overview of our instance inference approach (see Sect. 4 for details). We use our region- and detector-based image parser (Fig. 1) to generate semantic labels for each pixel (a) and a set of candidate object masks (not shown). Next, we select a subset of these masks to cover the image (b). We alternate between refining the pixel labels and the object predictions until we obtain the final pixel labeling (c) and object predictions (d). On this image, our initial pixel labeling contains several “car” blobs, some of them representing multiple cars, but the object predictions separate these blobs into individual car instances. We also infer an occlusion ordering (e), which places the road behind the cars, and puts the three nearly overlapping cars on the left side in the correct depth order. Note that our instance-level inference formulation does not require the image to be completely covered. Thus, while our pixel labeling erroneously infers two large “building” areas in the mid-section of the image, these labels do not have enough confidence, so no corresponding “building” object instances get selected
To determine depth ordering from polygon annotations, we use the LMsortlayers function from the LabelMe toolbox, which takes a collection of polygons and returns their depth ordering.
References
Adelson, E. H. (2001). On seeing stuff: The perception of materials by humans and machines. In Human Vision and Electronic Imaging, pp. 1–12.
Boykov, Y., & Kolmogorov, V. (2003). Computing geodesics and minimal surfaces via graph cuts. In International Conference on Computer Vision (ICCV), Nice, France.
Boykov, Y., & Kolmogorov, V. (2004). An experimental comparison of min-cut/max-flow algorithms for energy minimization in vision. PAMI, 26(9), 1124–37.
Brostow, G. J., Shotton, J., Fauqueur, J., & Cipolla, R. (2008). Segmentation and recognition using structure from motion point clouds. In European Conference on Computer Vision (ECCV), Marseille, France.
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), San Diego, CA.
Dean, T., Ruzon, M. A., Segal, M., Shlens, J., Vijayanarasimhan, S., & Yagnik, J. (2013). Fast, accurate detection of 100,000 object classes on a single machine. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Portland, OR.
Eigen, D., & Fergus, R. (2012). Nonparametric image parsing using adaptive neighbor sets. In IEEE Conference on Computer Vision and Pattern Recognition, Providence, RI (CVPR).
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results. http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html
Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2012). Scene parsing with multiscale feature learning, purity trees, and optimal covers. In International Conference on Machine Learning (ICML), Edinburgh, Scotland.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. PAMI, 32(9), 1627–1645.
Floros, G., Rematas, K., & Leibe, B. (2011). Multi-class image labeling with top-down segmentation and generalized robust \({P}^{N}\) potentials. In Proceedings of the British Machine Vision Conference (BMVC), Dundee, UK.
Gould, S., Fulton, R., & Koller, D. (2009). Decomposing a scene into geometric and semantically consistent regions. In International Conference on Computer Vision (ICCV), Kyoto, Japan.
Guo, R., & Hoiem, D. (2012). Beyond the line of sight: labeling the underlying surfaces. In European Conference on Computer Vision (ECCV), Amsterdam.
Hariharan, B., Malik, J., & Ramanan, D. (2012). Discriminative decorrelation for clustering and classification. In The European Conference on Computer Vision (ECCV), Amsterdam.
Heitz, G., & Koller, D. (2008). Learning spatial context: Using stuff to find things. In: European Conference on Computer Vision, Marseille, France, (ECCV), pp. 30–43.
IBM. (2013). Cplex optimizer. http://www.ibm.com/software/commerce/optimization/cplex-optimizer/.
Isola, P., & Liu, C. (2013). Scene collaging: Analysis and synthesis of natural images with semantic layers. In IEEE International Conference on Computer Vision (ICCV), Sydney, Australia.
Kim, B., Sun, M., Kohli, P., & Savarese, S. (2012). Relating things and stuff by high-order potential modeling. In ECCV Workshop on Higher-Order Models and Global Constraints in Computer Vision.
Kim, J., & Grauman, K. (2012). Shape sharing for object segmentation. In European Conference on Computer Vision (ECCV), Amsterdam.
Kolmogorov, V., & Zabih, R. (2004). What energy functions can be minimized via graph cuts? PAMI, 26(2), 147–59.
Krahenbuhl, P., & Koltun, V. (2011). Efficient inference in fully connected CRFs with Gaussian edge potentials. In Annual Conference on Neural Information Processing Systems (NIPS).
Ladický, L., Sturgess, P., Alahari, K., Russell, C., & Torr, P. H. (2010). What, where & how many? combining object detectors and CRFs. In The 11th European Conference on Computer Vision (ECCV), Heraklion, Greece.
Liu, C., Yuen, J., & Torralba, A. (2011). Nonparametric scene parsing via label transfer. PAMI, 33(12), 2368–2382.
Malisiewicz, T., Gupta, A., & Efros, A. A. (2011). Ensemble of exemplar-SVMs for object detection and beyond. In 13th International Conference on Computer Vision (ICCV), Barcelona, Spain.
Myeong, H. J., Chang, Y., & Lee, K. M. (2012). Learning object relationships via graph-based context model. In Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI.
Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In Proceedings of the Twenty-First Annual Conference on Neural Information Processing Systems, Vancouver (NIPS).
Rother, C., Kolmogorov, V., & Blake, A. (2004). “grabCut”—interactive foreground extraction using iterated graph cuts. In Special Interest Group on Computer Graphics and Interactive Techniques, Los Angeles, CA (SIGGRAPH).
Russell, B. C., & Torralba, A. (2009). Building a database of 3d scenes from user annotations. In IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR), Miami, FL.
Russell, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2008). Labelme: A database and web-based tool for image annotation. IJCV, 77(1–3), 157–173.
Shotton, J., Winn, J. M., Rother, C., & Criminisi, A. (2009). Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. IJCV, 81(1), 2–23.
Sturgess, P., Alahari, K., Ladický, L., & Torr, P. H. S. (2009). Combining appearance and structure from motion features for road scene understanding. In British Machine Vision Conference (BMVC), London, UK.
Tighe, J., & Lazebnik, S. (2013). Finding things: Image parsing with regions and per-exemplar detectors. In IEEE Conference on Computer Vision and Pattern Recognition, Portland, OR (CVPR).
Tighe, J., & Lazebnik, S. (2013). SuperParsing: Scalable nonparametric image parsing with superpixels. IJCV, 101(2), 329–349.
Tighe, J., Niethammer, M., & Lazebnik, S. (2014). Scene parsing with object instances and occlusion ordering. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Columbus, OH.
Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2005). Image parsing: Unifying segmentation, detection, and recognition. IJCV, 63(2), 113–140.
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., & Torralba, A. (2010). Sun database: Large-scale scene recognition from abbey to zoo. In The Twenty-Third IEEE Conference on Computer Vision and Pattern Recognition (CVPR), San Francisco, CA.
Yang, Y., Hallman, S., Ramanan, D., & Fowlkes, C. (2012). Layered object models for image segmentation. PAMI, 34(9), 1731–1743.
Yao, J., Fidler, S., & Urtasun, R. (2012). Describing the scene as a whole: Joint object detection, scene classification and semantic segmentation. In Conference on Computer Vision and Pattern Recognition (CVPR), Providence, RI.
Zhang, C., Wang, L., & Yang, R. (2010). Semantic segmentation of urban scenes using dense depth maps. In The 11th European Conference on Computer Vision (ECCV), Heraklion, Greece.
Acknowledgments
This research was supported in part by NSF grants IIS 1228082 and CIF 1302438, DARPA Computer Science Study Group (D12AP00305), Microsoft Research Faculty Fellowship, Sloan Foundation, and Xerox. We thank Arun Mallya for helping to adapt the LDA detector code of Hariharan et al. (2012).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Derek Hoiem, James Hays, Jianxiong Xiao, and Aditya Khosla.
Rights and permissions
About this article
Cite this article
Tighe, J., Niethammer, M. & Lazebnik, S. Scene Parsing with Object Instance Inference Using Regions and Per-exemplar Detectors. Int J Comput Vis 112, 150–171 (2015). https://doi.org/10.1007/s11263-014-0778-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-014-0778-5