Abstract
We propose a layered statistical model for image segmentation and labeling obtained by combining independently extracted, possibly overlapping sets of figure-ground (FG) segmentations. The process of constructing consistent image segmentations, called tilings, is cast as optimization over sets of maximal cliques sampled from a graph connecting all non-overlapping figure-ground segment hypotheses. Potential functions over cliques combine unary, Gestalt-based figure qualities, and pairwise compatibilities among spatially neighboring segments, constrained by T-junctions and the boundary interface statistics of real scenes. Building on the segmentation layer, we further derive a joint image segmentation and labeling model (JSL) which, given a bag of FGs, constructs a joint probability distribution over both the compatible image interpretations (tilings) composed from those segments, and over their labeling into categories. The process of drawing samples from the joint distribution can be interpreted as first sampling tilings, followed by sampling labelings conditioned on the choice of a particular tiling. We learn the segmentation and labeling parameters jointly, based on maximum likelihood with a novel estimation procedure we refer to as incremental saddle-point approximation. The partition function over tilings and labelings is increasingly more accurately approximated by including incorrect configurations that are rated as probable by candidate models during learning. State of the art results are reported on the Berkeley, Stanford and Pascal VOC datasets, where an improvement of 28 % was achieved for the segmentation task only (tiling), and an accuracy of 47.8 % was obtained on the test set of VOC12 for semantic labeling (JSL).








Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
The segments are defined by the contained pixels and have fixed positions in the image–they cannot be moved like puzzle pieces. Moreover, while disallowing overlap increases the exposure to imperfect boundary alignments between segments selected in any single tiling, it leads to a dramatic reduction in the solution space and does not raise additional issues with assigning pixels lying on segment intersections.
We call a segmentation assembled from non-overlapping figure-ground segments a tiling, and the tiling together with the set of corresponding labels for its segments a labeling (rather than a labeled tiling). Assigning a label to a segment also assigns this label to all the pixels of the segment.
This approximation is similar in spirit to the one in Sect. 2.2. By enforcing at least one tiling to be retained for each segment, we aim for a uniform spread of the sampled tilings, which at the same time correspond to modes of the probability distribution.
For our implementations, the actual running-times where 180.3 h for the PMA and 1.3 h for the incremental saddle-point. The used computer was an Intel Xeon workstation.
An alternative strategy to approximate the partition function by using samples from the target distribution is contrastive divergence (Hinton 2002). Samples are obtained by running an MCMC chain for a limited number of steps. The obtained estimate is biased, but has been observed to perform well in practice.
As previously done by the method we compare to, when evaluating FG-Tiling, only the annotated regions are considered.
The tiling parameters have been learned for BSDS on the training set, for Stanford over \(5\) folds, and for VOC2009 on the training set, respectively, using the methodology described in Sect. 2.2.
We also selected the scale parameter that optimized the First score on each dataset.
The 1 min slot given to Enum (1min) is about 7.5\(\times \) the average run-time of FG-Tiling on the BSDS test set. Without the time constraint, Enum did not finish enumerating cliques after 48 h on an image where a pool of \(|\mathcal {S}|=120\) figure-ground segmentations were used.
Recall, the VOC score is defined as the average per-class overlap between pixels labeled in each class and the respective ground truth annotation.
References
Arbelaez, P., Maire, M., Fowlkes, C., & Malik, J. (2009). From contours to regions: An empirical evaluation. In: IEEE International Conference on Computer Vision and Pattern Recognition.
Arbelaez, P., Hariharan, B., Gu, C., Gupta, S., Bourdev, L. D., & Malik, J. (2012). Semantic segmentation using regions and parts. In: IEEE International Conference on Computer Vision and Pattern Recognition.
Bagon, S., Boiman, O., & Irani, M. (2008). What is a good image segment? a unified approach to segment extraction. In: European Conference on Computer Vision.
Barbu, A., & Zhu, S. C. (2005). Generalizing swendsen-wang to sampling arbitrary posterior probabilities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(8), 1239–1253.
Barnard, K., Duygulu, P., Forsyth, D., de Freitas, N., Blei, D. M., & Jordan, M. (2003). Matching words and pictures. Journal of Machine Learning Research, 3, 1107–1135.
Bomze, I., Budinich, M., Pardalos, P., & Pelillo, M. (1999). Handbook of combinatorial optimization (pp. 1–74). Dordrecht: Kluwer Academic.
Bomze, I., Pelillo, M., & Stix, V. (2000). Approximating the maximum weight clique using replicator dynamics. IEEE Transactions on Neural Networks, 11(6), 1228–1241.
Brendel, W., & Todorovic, S. (2010). Segmentation as maximum-weight independent set. In: Advances in Neural Information Processing Systems.
Carreira, J., & Sminchisescu, C. (2012). Cpmc: Automatic object segmentation using constrained parametric min-cuts. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(7), 1312–1328.
Carreira, J., Caseiro, R., Batista, J., & Sminchisescu, C. (2012a). Semantic segmentation with second-order pooling. In: European Conference on Computer Vision.
Carreira, J., Li, F., & Sminchisescu, C. (2012b). Object recognition by sequential figure-ground ranking. International Journal of Computer Vision, 98(3), 243–262.
Comaniciu, D., & Meer, P. (2002). Mean shift: A robust approach toward feature space analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, 24(5), 603–619.
Cour, T., Gogin, N., & Shi, J. (2005). Learning spectral graph segmentation. In: Artificial Intelligence and Statistics.
Csurka, G., & Perronnin, F. (2010). An efficient approach to semantic segmentation. International Journal of Computer Vision, 88, 1–15.
Dann, C., Gehler, P. V., Roth, S., & Nowozin, S. (2012). Pottics—the potts topic model for semantic image segmentation. In: Proceedings of DAGM/OAGM Symposium.
Endres, I., & Hoiem, D. (2010). Category independent object proposals. In: European Conference on Computer Vision.
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2010). The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88(2), 303–338.
Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J. & Zisserman, A. (2012). The PASCAL visual object classes challenge (VOC) results. http://www.pascal-network.org/challenges/VOC/.
Farabet, C., Couprie, C., Najman, L., & LeCun, Y. (2013). Learning hierarchical features for scene labeling. IEEE Transactions on Pattern Analysis and Machine Intelligence, 35, 1915–1929.
Felzenszwalb, P., & Huttenlocher, D. (2004). Efficient graph-based image segmentation. International Journal of Computer Vision, 59(2), 167–181.
Felzenszwalb, P. F., Girshick, R. B., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part-based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1627–1645.
Fulkerson, B., Vedaldi, A., & Soatto, S. (2009). Class segmentation and object localization with superpixel neighborhoods. In: IEEE International Conference on Computer Vision.
Ghose, T., & Palmer, S. (2005). Surface convexity and extremal edges in depth and figure-ground perception. Journal of Vision, 5(8), 970–970.
Gonfaus, J. M., Boix, X., van de Weijer, J., Bagdanov, A. D., Serrat, J., & Gonzalez, J. (2010). Harmony potentials for joint classification and segmentation. In: IEEE International Conference on Computer Vision and Pattern Recognition.
Gould, S., Rodgers, J., Cohen, D., Elidan, G., & Koller, D. (2008). Multi-class segmentation with relative location prior. International Journal of Computer Vision, 80(3), 300–316.
Gould, S., Fulton, R., & Koller, D. (2009a). Decomposing a scene into geometric and semantically consistent regions. In: IEEE International Conference on Computer Vision.
Gould, S., Gao, T., & Koller, D. (2009b). Region-based segmentation and object detection. In: Advances in Neural Information Processing Systems.
He, X., Zemel, R. S., & Carreira-Perpinan, M. (2004). Multiscale conditional random fields for image labeling. In: IEEE International Conference on Computer Vision and Pattern Recognition.
Hinton, G. E. (2002). Training products of experts by minimizing contrastive divergence. Neural Computation, 14(8), 1771–1800.
Hoiem, D., Efros, A., & Hebert, M. (2007). Recovering surface layout from an image. International Journal of Computer Vision, 75(1), 151–172.
Huggins, P., Chen, H., Belhumeur, P., & Zucker, S. (2001). Finding folds: On the appearance and identification of occlusion. In: IEEE International Conference on Computer Vision and Pattern Recognition.
Ion, A., Carreira, J., & Sminchisescu, C. (2011a). Image segmentation by figure-ground composition into maximal cliques. In: IEEE International Conference on Computer Vision.
Ion, A., Carreira, J., & Sminchisescu, C. (2011b). Probabilistic joint image segmentation and labeling. In: Advances in Neural Information Processing Systems.
Kohli, P., Ladicky, L., & Torr, P. (2008). Robust higher order potentials for enforcing label consistency. In: IEEE International Conference on Computer Vision and Pattern Recognition.
Kolmogorov, V. (2006). Convergent tree-reweighted message passing for energy minimization. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(10), 1568–1583.
Kumar, M. P., & Koller, D. (2010). Efficiently selecting regions for scene understanding. In: IEEE International Conference on Computer Vision and Pattern Recognition.
Kumar, S., August, J., & Hebert, M. (2005). Exploiting inference for approximate parameter learning in discriminative fields: An empirical study. In: Energy Minimization Methods in Computer Vision and Pattern Recognition.
Ladicky, L., Russell, C., Kohli, P., & Torr, P. H. S. (2009). Associative hierarchical crfs for object class image segmentation. In: IEEE International Conference on Computer Vision.
Ladicky, L., Sturgess, P., Alaharia, K., Russel, C., & Torr, P. (2010). What, where & how many ? combining object detectors and crfs. In: European Conference on Computer Vision.
Leichter, I. & Lindenbaum, M., (2009). Boundary ownership by lifting to 2.1d. In: IEEE International Conference on Computer Vision.
Li, F., Ionescu, C., & Sminchisescu, C. (2010). Random Fourier approximations for skewed multiplicative histogram kernel. In: Proceedings of DAGM Symposium.
Lim, J., Arbelaez, P., Gu, C., & Malik, J. (2009). Context by region ancestry. In: IEEE International Conference on Computer Vision.
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91– 110.
Malisiewicz, T., & Efros, A. (2007). Improving spatial support for objects via multiple segmentations. In: British Machine Vision Conference.
Martin, D., Fowlkes, C., Tal, D., & Malik, J. (2001). A database of human segmented natural images and its application to evaluating segmentation algorithms and measuring ecological statistics. In: IEEE International Conference on Computer Vision.
Nowozin, S., Gehler, P., & Lampert, C. (2010). On parameter learning in crf-based approaches to object class image segmentation. In: European Conference on Computer Vision.
Pantofaru, C., Schmid, C., & Hebert, M. (2008). Object recognition by integrating multiple image segmentations. In: European Conference on Computer Vision.
Rahimi, A., & Recht, B. (2007). Random features for large-scale kernel machines. In: Advances in Neural Information Processing Systems.
Ren, X., & Malik, J. (2003). Learning a classification model for segmentation. In: IEEE International Conference on Computer Vision.
Ren, X., Fowlkes, C., & Malik, J. (2006). Figure/ground assignment in natural images. In: European Conference on Computer Vision.
van de Sande, K. E. A., Gevers, T., & Snoek, C. G. M. (2010). Evaluating color descriptors for object and scene recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(9), 1582–1596.
Sarawagi, S., & Cohen, W. W. (2004). Semi-markov conditional random fields for information extraction. In: Advances in Neural Information Processing Systems.
Sharon, E., Galun, M., Sharon, D., Basri, R., & Brandt, A. (2006). Hierarchy and adaptivity in segmenting visual scenes. Nature, 442(7104), 719–846.
Shi, J., & Malik, J. (2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.
Shotton, J., Winn, J., Rother, C., & Criminisi, A. (2009). Textonboost for image understanding: Multi-class object recognition and segmentation by jointly modeling texture, layout, and context. International Journal of Computer Vision, 81, 2–23.
Tu, Z., Chen, X., Yuille, A., & Zhu, S. C. (2003). Image parsing: unifying segmentation, detection, and recognition. In: IEEE International Conference on Computer Vision.
Xia, W., Song, Z., Feng, J., Cheong, L.F. & Yan, S. (2012). Segmentation over detection by coupled global and local sparse representations. In: European Conference on Computer Vision.
Yang, Y., Hallman, S., Ramanan, D., & Fowlkes, C. C. (2012). Layered object models for image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(9), 1731–1743.
Acknowledgments
This work was supported, in part, by CNCS-UEFICSDI, under PCE-2011-3-0438, and CT-ERC-2012-1, and by FCT under PTDC/EEA-CRO/122812/2010. The authors thank the anonymous reviewers for their useful comments and suggestions.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Ion, A., Carreira, J. & Sminchisescu, C. Probabilistic Joint Image Segmentation and Labeling by Figure-Ground Composition. Int J Comput Vis 107, 40–57 (2014). https://doi.org/10.1007/s11263-013-0663-7
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-013-0663-7