Segmentation of objects in a detection window by Nonparametric Inhomogeneous CRFs

https://doi.org/10.1016/j.cviu.2011.07.006Get rights and content

Abstract

This paper presents a method for segmenting objects of a specific class in a given detection window. The task is to label each pixel as belonging to the foreground or the background. We pose the problem as that of finding the maximum a posterior (MAP) estimation in a modified form of Conditional Random Field model that we call a Nonparametric Inhomogeneous CRF (NICRFs). An NICRF, like a conventional CRF, has nodes representing pixels and pairwise links connecting neighboring pixels; however, both the unary and pairwise energy terms are inhomogeneous in the sense of being dependent on pixel positions to account for prior information of the known object class. It differs from earlier methods in that position information is in form of unique term functions for each individual pixel, rather than the same parametric function but with varying parameters. Unary terms are given by a learned boosted classifier based on novel Adaptive Edgelet Features (AEFs) for inferring probability of a pixel being foreground; pairwise terms are learned by joint probabilities for neighboring pixels as a function of contrast; a monotonicity constraint is used to reduce possible over-fit effects. We expand the neighborhood used for pairwise terms, and add inhomogeneous weighting factors for different pairwise terms. We use the Loopy Belief Propagation (LBP) algorithm for MAP estimation. A local search process is proposed to deal with inaccurate detection windows. We evaluate our approach on examples of pedestrians and cars and demonstrate significant improvements compared to earlier methods.

Highlights

► Segmenting objects of a specific class in a given detection window. ► Nonparametric Inhomogeneous Conditional Random Field (NICRF) is introduced. ► Both unary and pairwise terms are different functions for different pixels. ► A local search process is proposed to deal with inaccurate detection windows. ► Significant improvements on examples of pedestrians and cars.

Introduction

Automatic detection and segmentation of objects are fundamental tasks in computer vision with many applications. Recently, there has been significant progress in methods of detection of specific objects, which provide detection windows around the objects of interest, but the object may not necessarily be centered in the box (i.e., not aligned precisely) nor does the box has to tightly touch the object; some examples are shown in Fig. 1. More accurate delineation, or segmentation, of the object in a detection window would be useful for obtaining better object descriptions that could be helpful for tasks such as pose estimation [1], object tracking and object identification [2]. In these applications, it is important to maintain high pixel accuracy as well as preserve important parts like human limbs.

Segmentation of objects has been of interest in computer vision from the very early days, and the term “segmentation” has been used for many different tasks. Generally, the purpose of segmentation is to find regions in an image, that correspond to objects or their parts. Segmentation methods can be classified as using object model information or not. Traditional general segmentation approaches do not use object model information, and tend to segment an image into regions with homogeneous properties, such as color [3], [4], texture [5], symmetry [6], and self-description [7]. While these methods may lead to good image segmentation, regions do not necessarily correspond to objects (or their parts). On the other hand, object segmentation methods often focus on a specific class of objects, e.g., faces, pedestrians, and cars [8], [9], [10]. Combining the two kinds of approaches together may achieve better results [11], [12], [13]. Spatial correlations between pixels of objects or their parts often contain valuable information for modeling objects. Cao and Fei-Fei [14] proposed a spatially coherent latent topic model to adopt latent correlations between different parts of an image; Jojic et al. [15] modeled spatial correlations between different structures in an image. These approaches are able to segment images into more than two regions, but often require high resolution images where object parts or fragments are easily distinguishable; therefore, they may not be applicable to lower resolution data which are common in applications such as surveillance or web data processing.

There have also been impressive developments in techniques of interactive or semi-automatic segmentation [7], [16], [17], [18], [19]. However, these methods are not applicable to processing large amounts of data where full automation is required, such as tracking in video streams.

In this paper, we focus on segmenting objects of a specific class given a detection window; thus our task is to label each pixel in the box as belonging to the object or the background. Some methods have been developed for this task in recent work [20], [21]; they formulate the labeling problem as one of classification and use features that are similar to those used for detection. They show good results on several examples, including those with limited resolution. However, they do not exploit the spatial coherence of labels of neighboring pixels.

The segmentation problem can be formulated as the task of finding the most probable labels for all pixels in a given image. Our approach includes use of both unary pixel properties (these include features computed from neighboring pixels) and pairwise relations between pixel labels. These relationships are embedded in a Conditional Random Field (CRF) model. To learn better unary terms, we develop a new class of features called Adaptive Edgelet Features (AEFs), which grow automatically to fit the object contours. CRF models offer a suitable formalism for rigorous combination of information between the pixels for considering both unary and pairwise terms. Though there are no constraints on definition of both terms, most previous work adopts uniformly parametric functions, without taking advantage of the prior knowledge of the object to be segmented. We exploit the generality of CRFs, resulting in a model that we call Nonparametric Inhomogeneous CRF (NICRF), where energy terms are nonparametric functions of pixel positions in the detection window (i.e., inhomogeneous), so that it naturally incorporates the constraints relevant to object segmentation in a detection window.

In a CRF, the labeling problem is transformed to an energy minimization problem; the energy is usually defined as the sum of a series of unary terms and pairwise terms, which indicate individual label preferences and spatial coherence respectively. Defining appropriate energy functions, representative of the task, is critical to obtaining good results, as has also been observed by Meltzer et al. [22]. In normal use of CRF for image segmentation, most methods actually use the term Markov Random Field or MRF, but CRF is more appropriate as only conditional probabilities are computed. Most energy terms are often independent of the position of the pixels (i.e., homogeneous) [18], [23], [24], [19]. Some dependence on pixel position is common in most segmentation work that uses shape information; however, earlier work, e.g., [25], [26], [27], [28], typically learns some parameters of term functions or combination coefficients of different terms, but these term functions are fixed and used uniformly. To our best knowledge, this paper is the first one that describes learning different functions for different pixels, both for the unary and the pairwise terms. Note that our NICRF model learns different parts of a CRF model than [24]. In [24], both term functions are fixed but the weight for each term is learned; however, in NICRF, term functions are learnt inhomogeneously, but weights are predefined. The two approaches are thus complementary to each other.

Rationale for introducing this inhomogeneity is due to the observation that the features used for labeling pixels in one part of the object (say the head area of a person) may be quite different than in other parts (say the feet areas of a person) because of differences in shape; we argue that using the same term functions by varying parameters according to positions does not fully utilize the object prior knowledge, and the terms do not model objects accurately.

Our unary terms are based on learning from auto-grown Adaptive Edgelet Features (AEFs). The pairwise term is set to be joint probabilities of neighboring pixel labels as a function of the contrast; we learn nonparametric models for them and introduce a novel monotonicity constraint to avoid over-fitting to limited training data. We also expand the neighborhood of pixel pairs used in pairwise terms to be beyond the usual 4-connected pixels, and introduce inhomogeneous weighting factors for pairwise terms according to the distance of pixel pairs.

Once the CRF model is built, the problem becomes inference of labels for all pixels that maximize the overall likelihood. There are several standard methods to make efficient inferences in CRFs such as Graph cut [16] and belief propagation [29]. Since our energy function is not sub-modular, it is difficult to compute the global optimum; in particular, graph-cut method does not guarantee this. Instead, we use the Loopy Belief Propagation (LBP) method [30] for simplicity. LBP maximizes the marginal probabilities of labels; we find that the method converges in just a few iterations (shown in Section 5).

The energy terms in the NICRF model allow some variations in the observed location and shape of the object (such as due to articulations and view point changes). However, sometimes the objects in the detected windows may be located far outside the normal range. To account for such variations, we introduce a local search process by perturbing the detection window to a few positions around the detected location and choosing the result with the highest likelihood.

We evaluate our approach on two multi-view data sets of pedestrians and cars and compare with several previous methods. Experiments show that our approach results in significant improvements, and demonstrate the effectiveness of our learning methods for both the unary and pairwise terms, as well as the importance of the weighted balancing factors and local search approach.

The rest of this paper is organized as follows: formulation of the NICRF model is given in Section 2; Section 3 describes learning approaches for unary and pairwise term of the NICRF; the segmentation process by NICRF inference is presented in Section 4; experimental results are shown in Section 5, followed by conclusion in Section 6.

Section snippets

Nonparametric Inhomogeneous CRF

Given an input image patch I normalized to a reference size w × h, the objective of segmentation is to infer a hidden state variable X = (x1,  , xm), m = w × h, in which xi = 1 indicates that the ith pixel is foreground and xi = 0 indicates background. The segmentation task is formulated as an MAP problem to computeX=argmaxXP(X|I)=argmaxX1Zexp(-Ψ(X|I))where Z is the partition function, and Ψ is a suitable “potential function”. Assuming that the joint distributions for more than two labels do not have

Learning energy terms in NICRF

In this section, we introduce approaches for learning unary and pairwise terms. A new feature named Adaptive Edgelet Feature (AEF) is proposed to learn better unary terms, and label joint probabilities with monotonicity constraints are learned as pairwise terms.

Object segmentation by NICRF inference

This section describes how to segment objects of a specific class by a MAP inference in the NICRF model by using Loopy Belief Propagation. Local search is proposed to deal with inaccurate detection windows.

Experiments

We applied our method to two object categories: pedestrians and cars. For pedestrians, we collected over 3000 image samples from the MIT pedestrian set [33] and the Internet, and separated them into 4:1 for training and testing respectively. This data set includes pedestrians with different view-points and articulations. Unlike [20], which separates frontal/rear view from the side view, we use all the samples together for segmentation. The detection samples were obtained by the detector

Conclusion

We have developed a more general use of CRF, called Nonparametric Inhomogeneous Conditional Random Field (NICRF), to segment objects from a detection window, and developed methods to learn inhomogeneous unary and pairwise terms from ground truth data by nonparametric models. We find that CRF is a good framework for object segmentation, and adopting nonparametric inhomogeneous term functions improves the performance considerably for specific kind of objects. We observe that the auto-grown

Acknowledgments

This paper is based upon work supported in part by Office of Naval Research under grant number N00014-10-1-0517.

References (40)

  • M. Andriluka, S. Roth, B. Schiele, Pictorial structures revisited: people detection and articulated pose estimation,...
  • C.-H. Kuo, C. Huang, R. Nevatia, Multi-target tracking by on-line learned discriminative appearance models, in: CVPR,...
  • J. Shi et al.

    Normalized cuts and image segmentation

    PAMI

    (2000)
  • D. Comaniciu et al.

    Mean shift: a robust approach toward feature space analysis

    PAMI

    (2002)
  • J. Chen et al.

    Adaptive perceptual color-texture image segmentation

    IEEE Transactions on Image Processing

    (2005)
  • Y. Sun, B. Bhanu, Symmetry integrated region-based image segmentation, in: CVPR,...
  • S. Bagon, O. Boiman, M. Irani, What is a good image segment? a unified approach to segment extraction, in: ECCV,...
  • J. Winn, N. Jojic, Locus: Learning object classes with unsupervised segmentation, in: ICCV,...
  • Z. Lin, L.S. Davis, A pose-invariant descriptor for human detection and segmentation, in: ECCV,...
  • C. Beleznai, H. Bischof, Fast human detection in crowded scenes by contour integration and local shape estimation, in:...
  • E. Borenstein, E. Sharon, S. Ullman, Combining top-down and bottom-up segmentation, in: CVPR,...
  • M. Kumar, P. Torr, A. Zisserman, Objcut, in: CVPR,...
  • A. Levin et al.

    Learning to combine bottom-up and top-down segmentation

    IJCV

    (2009)
  • L. Cao, L. Fei-Fei, Spatially coherent latent topic model for concurrent segmentation and classification of objects and...
  • N. Jojic, A. Perina, M. Cristani, V. Murino, B. Frey, Stel component analysis: Modeling spatial correlations in image...
  • Y. Boykov et al.

    An experimental comparison of min-cut/max- flow algorithms for energy minimization in vision

    PAMI

    (2004)
  • S. Vicente, V. Kolmogorov, C. Rother, Graph cut based image segmentation with connectivity priors, in: CVPR,...
  • C. Rother, V. Kolmogorov, A. Blake, “grabcut” interactive foreground extraction using iterated graph cuts, in: ACM...
  • V. Lempitsky, P. Kohli, C. Rother, T. Sharp, Image segmentation with a bounding box prior, in: ICCV,...
  • B. Wu, R. Nevatia, Simultaneous object detection and segmentation by boosting local shape feature based classifier, in:...
  • Cited by (7)

    • Robust human body segmentation based on part appearance and spatial constraint

      2013, Neurocomputing
      Citation Excerpt :

      Therefore, automatic scheme for human body segmentation is proposed. According to whether use the pose/shape as context information, the automatic approaches mainly fall into two categories: context-free methods [13,20–22] and context-assisted ones [23–27]. Context-free methods use specified foreground and background regions as input and do not incorporate any human-like pose/shape information.

    • Human body segmentation based on shape constraint

      2017, Machine Vision and Applications
    • Object segmentation based on saliency extraction and bounding box

      2014, International Conference on Logistics, Engineering, Management and Computer Science, LEMCS 2014
    • Improving an object detector and extracting regions using superpixels

      2013, Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition
    • Introducing context awareness in multi-target tracking using re-identification methodologies

      2013, 5th International Conference on Imaging for Crime Detection and Prevention, ICDP 2013
    • New progress in geometric computing for image and video processing

      2012, Frontiers of Computer Science in China
    View all citing articles on Scopus
    View full text