Skip to main content
Log in

TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

This paper details a new approach for learning a discriminative model of object classes, incorporating texture, layout, and context information efficiently. The learned model is used for automatic visual understanding and semantic segmentation of photographs. Our discriminative model exploits texture-layout filters, novel features based on textons, which jointly model patterns of texture and their spatial layout. Unary classification and feature selection is achieved using shared boosting to give an efficient classifier which can be applied to a large number of classes. Accurate image segmentation is achieved by incorporating the unary classifier in a conditional random field, which (i) captures the spatial interactions between class labels of neighboring pixels, and (ii) improves the segmentation of specific object instances. Efficient training of the model on large datasets is achieved by exploiting both random feature selection and piecewise training methods.

High classification and segmentation accuracy is demonstrated on four varied databases: (i) the MSRC 21-class database containing photographs of real objects viewed under general lighting conditions, poses and viewpoints, (ii) the 7-class Corel subset and (iii) the 7-class Sowerby database used in He et al. (Proceeding of IEEE Conference on Computer Vision and Pattern Recognition, vol. 2, pp. 695–702, June 2004), and (iv) a set of video sequences of television shows. The proposed algorithm gives competitive and visually pleasing results for objects that are highly textured (grass, trees, etc.), highly structured (cars, faces, bicycles, airplanes, etc.), and even articulated (body, cow, etc.).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Amit, Y., Geman, D., & Wilder, K. (1997). Joint induction of shape features and tree classifiers. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19, 1300–1305.

    Article  Google Scholar 

  • Baluja, S., & Rowley, H. A. (2005). Boosting sex identification performance. In AAAI (pp. 1508–1513).

  • Beal, M. J. (2003). Variational algorithms for approximate Bayesian inference. Ph.D. thesis, Gatsby Computational Neuroscience Unit, University College London.

  • Beis, J. S., & Lowe, D. G. (1997). Shape indexing using approximate nearest-neighbour search in high-dimensional spaces. In Proceedings of IEEE conference on computer vision and pattern recognition (pp. 1000–1006). June 1997.

  • Belongie, S., Malik, J., & Puzicha, J. (2002). Shape matching and object recognition using shape contexts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(24), 509–522.

    Article  Google Scholar 

  • Berg, A. C., Berg, T. L., & Malik, J. (2005). Shape matching and object recognition using low distortion correspondences. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 26–33). June 2005.

  • Bishop, C. M. (2006). Pattern recognition and machine learning. New York: Springer.

    MATH  Google Scholar 

  • Blake, A., Rother, C., Brown, M., Perez, P., & Torr, P. H. S. (2004). Interactive image segmentation using an adaptive GMMRF model. In T. Pajdla & J. Matas (Eds.), LNCS : Vol. 3021. Proceedings of European conference on computer vision (pp. 428–441). Prague, Czech Republic, May 2004. New York: Springer.

    Google Scholar 

  • Borenstein, E., Sharon, E., & Ullman, S. (2004) Combining top-down and bottom-up segmentations. In IEEE workshop on perceptual organization in computer vision (Vol. 4, p. 46).

  • Boykov, Y., & Jolly, M.-P. (2004). Interactive graph cuts for optimal boundary and region segmentation of objects in N-D images. In Proceedings of international conference on computer vision (Vol. 1, pp. 105–112). Vancouver, Canada, July 2001.

  • Criminisi, A., Perez, P., & Toyama, K. (2004). Region filling and object removal by exemplar-based inpainting. IEEE Transactions on Image Processing, 13(9), 1200–1212.

    Article  Google Scholar 

  • Dempster, A., Laird, N., & Rubin, D. (1976). Maximum likelihood from incomplete data via the EM algorithm. JRSS B, 39, 1–38.

    MathSciNet  Google Scholar 

  • Dollár, P., Tu, Z., & Belongie, S. (2006). Supervised learning of edges and object boundaries. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 1964–1971).

  • Duygulu, P., Barnard, K., de Freitas, N., & Forsyth, D. (2002). Object recognition as machine translation: Learning a lexicon for a fixed image vocabulary. In A. Heyden, G. Sparr, & P. Johansen (Eds.), LNCS : Vol. 2353. Proceedings of European conference on computer vision (pp. 97–112). May 2002. New York: Springer.

    Google Scholar 

  • Elkan, C. (2003). Using the triangle inequality to accelerate k-means. In Proceedings of international conference on machine learning (pp. 147–153).

  • Fei-Fei, L., Fergus, R., & Perona, P. (2004). Learning generative visual models from few training examples: an incremental bayesian approach tested on 101 object categories. In Proceedings of CVPR 2004. Workshop on generative-model based vision.

  • Fergus, R., Perona, P., & Zisserman, A. (2003). Object class recognition by unsupervised scale-invariant learning. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 264–271). June 2003.

  • Friedman, J., Hastie, T., & Tibshirani, R. (2000). Additive logistic regression: a statistical view of boosting. Annals of Statistics, 28(2), 337–407.

    Article  MATH  MathSciNet  Google Scholar 

  • He, X., Zemel, R. S., & Carreira-Perpiñán, M.Á. (2004). Multiscale conditional random fields for image labeling. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 695–702). June 2004.

  • He, X., Zemel, R. S., & Ray, D. (2006). Learning and incorporating top-down cues in image segmentation. In A. Leonardis, H. Bischof, & A. Pinz (Eds.), LNCS : Vol. 3951. Proceeding of European conference on computer vision (pp. 338–351). May 2006. New York: Springer.

    Google Scholar 

  • Johnson, M., Brostow, G., Shotton, J., Arandjelovic, O., Kwatra, V., & Cipolla, R. (2006). Semantic photo synthesis. Computer Graphics Forum, 25(3), 407–413.

    Article  Google Scholar 

  • Jones, D. G., & Malik, J. (1992). A computational framework for determining stereo correspondence from a set of linear spatial filters. In Proceedings of European conference on computer vision (pp. 395–410).

  • Julesz, B. (1981). Textons, the elements of texture perception, and their interactions. Nature, 290(5802), 91–97.

    Article  Google Scholar 

  • Kohli, P., & Torr, P. H. S. (2005). Efficiently solving dynamic Markov random fields using graph cuts. In Proceedings of international conference on computer vision (Vol. 2, pp. 922–929), Beijing, China, October 2005.

  • Kolmogorov, V., & Zabih, R. (2004). What energy functions can be minimized via graph cuts? IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(2), 147–159.

    Article  Google Scholar 

  • Konishi, S., & Yuille, A. L. (2000). Statistical cues for domain specific image segmentation with performance analysis. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 125–132). June 2000.

  • Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2005). OBJ CUT. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 18–25). June 2005.

  • Kumar, S., & Hebert, M. (2003). Discriminative random fields: A discriminative framework for contextual interaction in classification. In Proceedings of international conference on computer vision (Vol. 2, pp. 1150–1157). October 2003.

  • Lafferty, J., McCallum, A., & Pereira, F. (2001). Conditional random fields: Probabilistic models for segmenting and labeling sequence data. In Proceedings of international conference on machine learning (pp. 282–289).

  • Lazebnik, S., Schmid, C., & Ponce, J. (2006). Beyond bags of features: spatial pyramid matching for recognizing natural scene categories. In Proceedings of IEEE conference on computer vision and pattern recognition.

  • Leibe, B., & Schiele, B. (2003). Interleaved object categorization and segmentation. In Proceedings of British machine vision conference (Vol. II, pp. 264–271).

  • Lepetit, V., Lagger, P., & Fua, P. (2005). Randomized trees for real-time keypoint recognition. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 2, pp. 775–781). June 2005.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Malik, J., Belongie, S., Leung, T., & Shi, J. (2001). Contour and texture analysis for image segmentation. International Journal Computer Vision, 43(1), 7–27.

    Article  MATH  Google Scholar 

  • Marszałek, M., & Schmid, C. (2007). Semantic hierarchies for visual object recognition. In Proceedings of IEEE conference on computer vision and pattern recognition. June 2007.

  • Mikolajczyk, K., & Schmid, C. (2002). An affine invariant interest point detector. In A. Heyden, G. Sparr, & P. Johansen (Eds.), LNCS : Vol. 2350. Proceedings of European conference on computer vision (pp. 128–142). May 2002. New York: Springer.

    Google Scholar 

  • Pearl, J. (1988). Probabilistic reasoning in intelligent systems. San Mateo: Morgan Kaufmann.

    Google Scholar 

  • Porikli, F. M. (2005). Integral histogram: A fast way to extract histograms in cartesian spaces. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 829–836), June 2005.

  • Ren, X., Fowlkes, C., & Malik, J. (2006). Figure/ground assignment in natural images. In A. Leonardis, H. Bischof, & A. Pinz (Eds.), Proceedings of European conference on computer vision (Vol. 2, pp. 614–627). Graz, Austria, May 2006. New York: Springer.

    Google Scholar 

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). GrabCut—interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23(3), 309–314.

    Article  Google Scholar 

  • Rother, C., Bordeaux, L., Hamadi, Y., & Blake, A. (2006). AutoCollage. ACM Transactions on Graphics, 25(3), 847–852.

    Article  Google Scholar 

  • Russel, B. C., Torralba, A., Murphy, K. P., & Freeman, W. T. (2005). LabelMe: database and web-based tool for image annotation (Technical Report 25). MIT AI Lab, September 2005.

  • Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2006). TextonBoost: Joint appearance, shape and context modeling for multi-class object recognition and segmentation. In A. Leonardis, H. Bischof, & A. Pinz (Eds.), LNCS : Vol. 3951. Proceedings of European conference on computer vision (pp. 1–15). May 2006. New York: Springer.

    Google Scholar 

  • Sutton, C., & McCallum, A. (2005). Piecewise training of undirected models. In Proceedings of conference on uncertainty in artificial intelligence.

  • Torralba, A., Murphy, K. P., & Freeman, W. T. (2007). Sharing visual features for multiclass and multiview object detection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 19(5), 854–869.

    Article  Google Scholar 

  • Tu, Z., Chen, X., Yuille, A. L., & Zhu, S. C. (2003). Image parsing: unifying segmentation, detection, and recognition. In Proceedings of international conference on computer vision (Vol. 1, pp. 18–25). Nice, France, October 2003.

  • Varma, M., & Zisserman, A. (2005). A statistical approach to texture classification from single images. International Journal Computer Vision, 62(1–2), 61–81.

    Google Scholar 

  • Viola, P., & Jones, M. J. (2001). Rapid object detection using a boosted cascade of simple features. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 511–518). December 2001.

  • Winn, J., & Jojic, N. (2005). LOCUS: Learning object classes with unsupervised segmentation. In Proceedings of international conference on computer vision (Vol. 1, pp. 756–763). Beijing, China, October 2005.

  • Winn, J., & Shotton, J. (2006). The layout consistent random field for recognizing and segmenting partially occluded objects. In Proceedings of IEEE conference on computer vision and pattern recognition (Vol. 1, pp. 37–44). June 2006.

  • Winn, J., Criminisi, A., & Minka, T. (2005). Categorization by learned universal visual dictionary. In Proceedings of international conference on computer vision (Vol. 2, pp. 1800–1807). Beijing, China, October 2005.

  • Yedidia, J. S., Freeman, W. T., & Weiss, Y. (2003). Understanding belief propagation and its generalizations. San Mateo: Morgan Kaufmann.

    Google Scholar 

  • Yin, P., Criminisi, A., Winn, J., & Essa, I. (2007). Tree based classifiers for bilayer video segmentation. In Proceedings of IEEE conference on computer vision and pattern recognition.

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Jamie Shotton.

Additional information

J. Shotton is now working at Toshiba Corporate Research & Development Center, Kawasaki, Japan.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Shotton, J., Winn, J., Rother, C. et al. TextonBoost for Image Understanding: Multi-Class Object Recognition and Segmentation by Jointly Modeling Texture, Layout, and Context. Int J Comput Vis 81, 2–23 (2009). https://doi.org/10.1007/s11263-007-0109-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-007-0109-1

Keywords

Navigation