Skip to main content
Log in

Probabilistic Joint Image Segmentation and Labeling by Figure-Ground Composition

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We propose a layered statistical model for image segmentation and labeling obtained by combining independently extracted, possibly overlapping sets of figure-ground (FG) segmentations. The process of constructing consistent image segmentations, called tilings, is cast as optimization over sets of maximal cliques sampled from a graph connecting all non-overlapping figure-ground segment hypotheses. Potential functions over cliques combine unary, Gestalt-based figure qualities, and pairwise compatibilities among spatially neighboring segments, constrained by T-junctions and the boundary interface statistics of real scenes. Building on the segmentation layer, we further derive a joint image segmentation and labeling model (JSL) which, given a bag of FGs, constructs a joint probability distribution over both the compatible image interpretations (tilings) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, followed by sampling labelings conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on maximum likelihood with a novel estimation procedure we refer to as incremental saddle-point approximation. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect configurations that are rated as probable by candidate models during learning. State of the art results are reported on the Berkeley, Stanford and Pascal VOC datasets, where an improvement of 28 % was achieved for the segmentation task only (tiling), and an accuracy of 47.8 % was obtained on the test set of VOC12 for semantic labeling (JSL).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. The segments are defined by the contained pixels and have fixed positions in the image–they cannot be moved like puzzle pieces. Moreover, while disallowing overlap increases the exposure to imperfect boundary alignments between segments selected in any single tiling, it leads to a dramatic reduction in the solution space and does not raise additional issues with assigning pixels lying on segment intersections.

  2. We call a segmentation assembled from non-overlapping figure-ground segments a tiling, and the tiling together with the set of corresponding labels for its segments a labeling (rather than a labeled tiling). Assigning a label to a segment also assigns this label to all the pixels of the segment.

  3. This approximation is similar in spirit to the one in Sect. 2.2. By enforcing at least one tiling to be retained for each segment, we aim for a uniform spread of the sampled tilings, which at the same time correspond to modes of the probability distribution.

  4. For our implementations, the actual running-times where 180.3 h for the PMA and 1.3 h for the incremental saddle-point. The used computer was an Intel Xeon workstation.

  5. An alternative strategy to approximate the partition function by using samples from the target distribution is contrastive divergence (Hinton 2002). Samples are obtained by running an MCMC chain for a limited number of steps. The obtained estimate is biased, but has been observed to perform well in practice.

  6. As previously done by the method we compare to, when evaluating FG-Tiling, only the annotated regions are considered.

  7. The tiling parameters have been learned for BSDS on the training set, for Stanford over \(5\) folds, and for VOC2009 on the training set, respectively, using the methodology described in Sect. 2.2.

  8. We also selected the scale parameter that optimized the First score on each dataset.

  9. The 1 min slot given to Enum (1min) is about 7.5\(\times \) the average run-time of FG-Tiling on the BSDS test set. Without the time constraint, Enum did not finish enumerating cliques after 48 h on an image where a pool of \(|\mathcal {S}|=120\) figure-ground segmentations were used.

  10. Recall, the VOC score is defined as the average per-class overlap between pixels labeled in each class and the respective ground truth annotation.

References

  • Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2009). From contours to regions: An empirical evaluation. In: IEEE International Conference on Computer Vision and Pattern Recognition.

  • Arbelaez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev, L. D., & Malik, J. (2012). Semantic segmentation using regions and parts. In: IEEE International Conference on Computer Vision and Pattern Recognition.

  • Bagon, S., Boiman, O., & Irani, M. (2008). What is a good image segment? a unified approach to segment extraction. In: European Conference on Computer Vision.

  • Barbu, A., & Zhu, S. C. (2005). Generalizing swendsen-wang to sampling arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1239–1253.

    Article  Google Scholar 

  • Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M., & Jordan, M. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.

    MATH  Google Scholar 

  • Bomze, I., Budinich, M., Pardalos, P., & Pelillo, M. (1999). Handbook of combinatorial optimization (pp. 1–74). Dordrecht: Kluwer Academic.

    Book  Google Scholar 

  • Bomze, I., Pelillo, M., & Stix, V. (2000). Approximating the maximum weight clique using replicator dynamics. IEEE Transactions on Neural Networks, 11(6), 1228–1241.

    Google Scholar 

  • Brendel, W., & Todorovic, S. (2010). Segmentation as maximum-weight independent set. In: Advances in Neural Information Processing Systems.

  • Carreira, J., & Sminchisescu, C. (2012). Cpmc: Automatic object segmentation using constrained parametric min-cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7), 1312–1328.

    Google Scholar 

  • Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012a). Semantic segmentation with second-order pooling. In: European Conference on Computer Vision.

  • Carreira, J., Li, F., & Sminchisescu, C. (2012b). Object recognition by sequential figure-ground ranking. International Journal of Computer Vision, 98(3), 243–262.

    Google Scholar 

  • Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619.

    Article  Google Scholar 

  • Cour, T., Gogin, N., & Shi, J. (2005). Learning spectral graph segmentation. In: Artificial Intelligence and Statistics.

  • Csurka, G., & Perronnin, F. (2010). An efficient approach to semantic segmentation. International Journal of Computer Vision, 88, 1–15.

    Google Scholar 

  • Dann, C., Gehler, P. V., Roth, S., & Nowozin, S. (2012). Pottics—the potts topic model for semantic image segmentation. In: Proceedings of DAGM/OAGM Symposium.

  • Endres, I., & Hoiem, D. (2010). Category independent object proposals. In: European Conference on Computer Vision.

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.

    Article  Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. & Zisserman, A. (2012). The PASCAL visual object classes challenge (VOC) results. http://www.pascal-network.org/challenges/VOC/.

  • Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 1915–1929.

    Article  Google Scholar 

  • Felzenszwalb, P., & Huttenlocher, D. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 167–181.

    Article  Google Scholar 

  • Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.

    Article  Google Scholar 

  • Fulkerson, B., Vedaldi, A., & Soatto, S. (2009). Class segmentation and object localization with superpixel neighborhoods. In: IEEE International Conference on Computer Vision.

  • Ghose, T., & Palmer, S. (2005). Surface convexity and extremal edges in depth and figure-ground perception. Journal of Vision, 5(8), 970–970.

    Article  Google Scholar 

  • Gonfaus, J. M., Boix, X., van de Weijer, J., Bagdanov, A. D., Serrat, J., & Gonzalez, J. (2010). Harmony potentials for joint classification and segmentation. In: IEEE International Conference on Computer Vision and Pattern Recognition.

  • Gould, S., Rodgers, J., Cohen, D., Elidan, G., & Koller, D. (2008). Multi-class segmentation with relative location prior. International Journal of Computer Vision, 80(3), 300–316.

    Article  Google Scholar 

  • Gould, S., Fulton, R., & Koller, D. (2009a). Decomposing a scene into geometric and semantically consistent regions. In: IEEE International Conference on Computer Vision.

  • Gould, S., Gao, T., & Koller, D. (2009b). Region-based segmentation and object detection. In: Advances in Neural Information Processing Systems.

  • He, X., Zemel, R. S., & Carreira-Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In: IEEE International Conference on Computer Vision and Pattern Recognition.

  • Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.

    Article  MATH  MathSciNet  Google Scholar 

  • Hoiem, D., Efros, A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1), 151–172.

    Article  Google Scholar 

  • Huggins, P., Chen, H., Belhumeur, P., & Zucker, S. (2001). Finding folds: On the appearance and identification of occlusion. In: IEEE International Conference on Computer Vision and Pattern Recognition.

  • Ion, A., Carreira, J., & Sminchisescu, C. (2011a). Image segmentation by figure-ground composition into maximal cliques. In: IEEE International Conference on Computer Vision.

  • Ion, A., Carreira, J., & Sminchisescu, C. (2011b). Probabilistic joint image segmentation and labeling. In: Advances in Neural Information Processing Systems.

  • Kohli, P., Ladicky, L., & Torr, P. (2008). Robust higher order potentials for enforcing label consistency. In: IEEE International Conference on Computer Vision and Pattern Recognition.

  • Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583.

    Article  Google Scholar 

  • Kumar, M. P., & Koller, D. (2010). Efficiently selecting regions for scene understanding. In: IEEE International Conference on Computer Vision and Pattern Recognition.

  • Kumar, S., August, J., & Hebert, M. (2005). Exploiting inference for approximate parameter learning in discriminative fields: An empirical study. In: Energy Minimization Methods in Computer Vision and Pattern Recognition.

  • Ladicky, L., Russell, C., Kohli, P., & Torr, P. H. S. (2009). Associative hierarchical crfs for object class image segmentation. In: IEEE International Conference on Computer Vision.

  • Ladicky, L., Sturgess, P., Alaharia, K., Russel, C., & Torr, P. (2010). What, where & how many ? combining object detectors and crfs. In: European Conference on Computer Vision.

  • Leichter, I. & Lindenbaum, M., (2009). Boundary ownership by lifting to 2.1d. In: IEEE International Conference on Computer Vision.

  • Li, F., Ionescu, C., & Sminchisescu, C. (2010). Random Fourier approximations for skewed multiplicative histogram kernel. In: Proceedings of DAGM Symposium.

  • Lim, J., Arbelaez, P., Gu, C., & Malik, J. (2009). Context by region ancestry. In: IEEE International Conference on Computer Vision.

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91– 110.

    Article  Google Scholar 

  • Malisiewicz, T., & Efros, A. (2007). Improving spatial support for objects via multiple segmentations. In: British Machine Vision Conference.

  • Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: IEEE International Conference on Computer Vision.

  • Nowozin, S., Gehler, P., & Lampert, C. (2010). On parameter learning in crf-based approaches to object class image segmentation. In: European Conference on Computer Vision.

  • Pantofaru, C., Schmid, C., & Hebert, M. (2008). Object recognition by integrating multiple image segmentations. In: European Conference on Computer Vision.

  • Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems.

  • Ren, X., & Malik, J. (2003). Learning a classification model for segmentation. In: IEEE International Conference on Computer Vision.

  • Ren, X., Fowlkes, C., & Malik, J. (2006). Figure/ground assignment in natural images. In: European Conference on Computer Vision.

  • van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1582–1596.

    Article  Google Scholar 

  • Sarawagi, S., & Cohen, W. W. (2004). Semi-markov conditional random fields for information extraction. In: Advances in Neural Information Processing Systems.

  • Sharon, E., Galun, M., Sharon, D., Basri, R., & Brandt, A. (2006). Hierarchy and adaptivity in segmenting visual scenes. Nature, 442(7104), 719–846.

    Article  Google Scholar 

  • Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.

    Article  Google Scholar 

  • Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2009). Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 81, 2–23.

    Google Scholar 

  • Tu, Z., Chen, X., Yuille, A., & Zhu, S. C. (2003). Image parsing: unifying segmentation, detection, and recognition. In: IEEE International Conference on Computer Vision.

  • Xia, W., Song, Z., Feng, J., Cheong, L.F. & Yan, S. (2012). Segmentation over detection by coupled global and local sparse representations. In: European Conference on Computer Vision.

  • Yang, Y., Hallman, S., Ramanan, D., & Fowlkes, C. C. (2012). Layered object models for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1731–1743.

    Article  Google Scholar 

Download references

Acknowledgments

This work was supported, in part, by CNCS-UEFICSDI, under PCE-2011-3-0438, and CT-ERC-2012-1, and by FCT under PTDC/EEA-CRO/122812/2010. The authors thank the anonymous reviewers for their useful comments and suggestions.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cristian Sminchisescu.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ion, A., Carreira, J. & Sminchisescu, C. Probabilistic Joint Image Segmentation and Labeling by Figure-Ground Composition. Int J Comput Vis 107, 40–57 (2014). https://doi.org/10.1007/s11263-013-0663-7

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-013-0663-7

Keywords

Navigation