Edge affinity for pose-contour matching

https://doi.org/10.1016/j.cviu.2006.06.008Get rights and content

Abstract

We present an approach for whole-body pose-contour matching. Contour matching in natural images in the absence of foreground–background segmentation is difficult. Usually an asymmetric approach is adopted, where a contour is said to match well if it aligns with a subset of the image’s gradients. This leads to problems as the contour can match with a portion of an object’s outline and ignore the remainder. We present a model for using edge continuity to address this issue. Pairs of edge elements in the image are linked with affinities if they are likely to belong to the same object. A contour that matches with a set of image gradients is constrained to also match with other gradients having high affinities with the chosen ones. Experimental results show that this improves matching performance.

Introduction

This paper explores the use of edge continuity for improving contour matching in natural images. The domain of application is human pose matching and gesture recognition. Given a set of human poses in the form of contour points and an image containing a person in one of the poses with some deformation, we seek to compute likelihoods for each pose to have occurred in the image.

Contour matching is used extensively in computer vision for human pose detection and recognition tasks. When applied for action or gesture recognition, it is used to compute pose observation likelihoods, which are then modeled using Hidden Markov Models (HMMs) [1], Markov Chain Monte Carlo (MCMC) [2], etc. Contour matching has also been used for object detection e.g., [3], [4], etc. There are three stages to contour matching:

  • (1) Edge features of the objects in the images are detected.

  • (2) A pose-contour is imposed on the image for matching.

  • (3) The score for the match is generated by computing distance between the image’s edge features and the imposed pose–contour.

Many studies—including ours—use gradient-based operators such as Canny edge-detector, Gaussian derivatives, etc., for detecting edge features. Reliably detecting object boundaries in general illumination conditions is difficult. Recent research on boundary detection has focussed on using region segmentation as a pre-processing step for generating “super-pixels”—relatively small groups of pixels that have homogenous features and are highly likely to belong to the same object. Boundaries of the super-pixels are used for matching object boundaries. For example, Mori et al. use normalized-cuts (n-cuts) to obtain super-pixels and then analyze their configurations to detect baseball players [5]. Sharon et al. use a multigrid approach for obtaining segment boundaries [6].

The pose-contour to be matched with the test image could either be collected during training or generated using a model. Whole-body contours have been used for human pose-matching in [7], [8], [9], [10], [1], etc. Zhang et al. use a Bayes-nets based articulated model for pedestrian detection [11]. Ronfard et al. follow a bottom-up part-based approach to detecting people [12]. They train Support Vector Machines (SVMs) on gradients of limbs obtained from training images. In the present study, the pose-contours correspond to the whole body of the subject and are collected during a training phase.

Rosin and West presented a continuous form of chamfer distance which includes the saliency of the edges in the matching [13]. Their method avoids setting threshold on the gradient magnitudes, which is a difficult issue. Butt and Maragos presented an efficient approach for computing chamfer distance while minimizing errors due to discretization [14]. Toyama and Blake use sets of exemplar contours and chamfer distance for tracking pedestrians and mouth movements [9]. Mori and Malik introduced the Shape Context technique for matching human pose contours [10]. Olson and Huttenlocher used the Hausdorff distance for object recognition [7]. Leibe et al. present a study comparing contour-based and appearance-based object recognition in [15].

Images of people in natural scenes have significant edge clutter present in the background in addition to the subject’s figure. Ideally, these background edges should be ignored when matching pose-contours. However, reliable background suppression in natural images in the presence of camera and subject motion is difficult. There are three general ways of handling this:

Not perform the difficult task of background subtraction but rather compromise with asymmetric matching, which only measures how well a model pose-contour matches with the image’s gradients. It does not verify whether these matching gradients form a coherent object. Current contour matching schemes either follow this asymmetric approach or assume background subtracted images, e.g., [1], [3], [4], [7], [14], [9], [10], [11]. Predictably, this leads to problems as a contour can match well with a subset of the edges of an object and ignore the rest of it. Consider the case shown in Fig. 1. Fig. 1a and e show an image and the edges of the subject. Fig. 1c and g show two pose-contours in the database extracted from training images shown in Fig. 1b and f, respectively. Clearly, the contour in Fig. 1c is the correct pose. However, when the poses are matched with the image (Fig. 1d and h) using chamfer matching, the wrong pose obtains a better score. The reason is that it has smaller extent at the arms, which—due to articulation—are the zones of highest errors in matching. Normalizing the error w.r.t. the length of the boundary does not ameliorate the situation.

The second approach uses segmentation as a pre-processing step and then analyzes the segment boundaries for matching.

Typically, the continuity constraints are imposed on the segment boundaries—high curvatures are penalized and straight boundaries are promoted. Leung and Malik proposed a pairwise pixel affinity which takes into account intervening gradients between them [16]. n-Cuts was used to obtain the final region segmentation. Ren and Malik presented a segmentation scheme in which super-pixels were computed as a pre-processing step for segmentation [17]. The continuity of super-pixel edges along a segment’s boundary were included as part of the segment’s goodness value. Yu and Shi generalized the n-cuts algorithm to partition both the pixels and edge elements [18]. The graph nodes corresponding to edge elements are connected by affinities based on continuation. However, obtaining segments that directly correspond to holistic objects is a challenge. Usually, over-segmentation followed by recognition on groups of segments is favored e.g., [5], etc.

Jermyn and Ishikawa proposed an energy function for segmentation which includes both region and boundary cues [19]. The basic idea is to integrate the function along boundaries of segments and choose the segment with lowest energy. There has been related work on integrating segments using region and boundary cues [20], [21].

A closely related approach is based on detecting limbs as components shaped as rectangles and combining them using graphs or trees. The rectangles are detected using templates with uniform interior color and contrasting color in the periphery [22], [23], [24]. In [25], the components are combined using a cascade. It is not clear how these techniques could prevent errors due to asymmetric matching—the case shown in Fig. 1. These methods can easily ignore the extended arm in Fig. 1a and confine themselves to the torso—leading to an erroneous match.

The third approach—the one followed here—is to avoid performing segmentation while still taking into consideration edge continuity constraints. Given an image and a pose-contour to be matched, we find the set of gradients in the image that are likely to belong to the subject. If the given pose-contour is correct then this set must belong to the foreground. However, this “initial” set might be closely linked with other gradients in the image—which must also belong to the foreground. Edge continuity is used to expand the initial set to include other linked gradients. For the given pose-contour to be a good match to the image, it should match with the expanded set of gradients. The matching is performed using a modified form of chamfer distance. This framework provides a large measure of resistance to spurious matches in the case of highly textured scenes, and to incorrect matches when some poses match only partially with the subject but obtain a high score by avoiding integrating errors in articulated parts of the body (as illustrated in Fig. 1).

A closely related approach for detecting lakes in satellite imagery was proposed by Elder et al. [26]. Here, edge continuity constraints are included in a probabilistic model to detect closed contours in edge maps. The authors also describe a method for learning the edge continuity priors in the context of detecting lakes. In our problem, the goal is to match a given set of contours with an image—this is different from the detection problem addressed in [26].

Thayananthan et al. [27] proposed an improvement to the Shape Context technique by enforcing neighborhood constraints on the matchings between point sets. They require that neighboring points on the pose-contour be mapped to neighboring points on the image. However, it is not clear whether this would guarantee that the mapped gradients also form a holistic object.

Additionally, there have been many recent studies on linking segmentation and object recognition. Cremers et al. introduced a variational framework for combining segmentation and recognition [28]. Yu et al. introduced a generalized version of the normalized-cuts algorithm in which the graph affinities include body-part configuration constraints along with spatial continuity criteria [29]. Borenstein et al. extended the multiscale segmentation algorithm to enable object recognition by using the segments’ saliency as constraints [30], [31]. These approaches employ region-based segmentation and appearance modeling. We complement them by introducing a model for combining edge grouping with contour matching.

Our model for matching a pose-contour to an image combines two measures:

  • (1) The first one measures how well the pose-contour aligns with the gradients in the image. This is computed using an extended form of chamfer matching applied to a continuous gradient magnitude field instead of a discrete edge map. We refer to this as cpi.

  • (2) The second measures how well the subject’s gradients in the image align with the pose-contour. It verifies whether the image gradients underlying the test pose-contour form a holistic object, or are part of a larger object. This measure is computed from the expanded set of gradients obtained from edge continuity. It is referred to as cip.

We propose an edge-affinity model for grouping edge elements in natural images depending upon whether they could belong to the same object. A pair of edge elements have high affinity if their orientations have good continuity and their neighborhoods have similar color statistics. Given an image and a pose-contour to be matched, an initial set of edge elements matching with the pose-contour is obtained. An iterative process is then used to expand this set to include other edge elements having high affinity with its members. The measure, cip, is computed from the degree of mismatch between the estimated outline of the subject and the pose-contour being considered.

The pose contours used in the present study were collected as part of a gesture recognition system. The training database consists of 14 gestures performed by 5 subjects (cf. Section 5). The subjects stand upright and the arms are the principal modes of gesticulation. The proposed pose-matching system is tested both with still images and in a gesture recognition application.

We first review work on edge continuity and then describe the edge affinity model. Section 3 describes the algorithm for using the edge affinities to compute cip. The extended form of chamfer matching is described in Section 4.

Section snippets

Edge affinity

Two edge elements in a given image are said to have high affinity if they are likely to be part of an object’s boundary. This depends upon:

  • (1) The “goodness” of the contour that could pass between them, with the contour’s orientation constrained by the orientation of the edge elements.

  • (2) The color statistics in their neighborhoods.

The proposed edge affinity model is presented in stages. First the dependence on the curvature of the contour connecting the two edge elements is described (cf. Eq.

Obtaining the activation fields

In most pose tracking and gesture recognition applications, a bootstrap subject-detection phase is used to locate the subject in the field of view. This provides the approximate location and scale of the subject for pose-matching. However, when a pose-contour is placed on an image, it will not coincide exactly with the subject’s gradients in the image. This could be due to variation in subject morphology, apparel, gesticulation style, etc. We allow for Gaussian additive noise in the location of

Extended chamfer matching for computing cpi

In classical chamfer matching, an image is first reduced to a map of feature points and a distance map is constructed from this feature map. The pose-contour to be matched is placed on the distance map and the distances are integrated along the contour. If the pose-contour matches well with a subset of features in the image then this integral would be small. The feature maps could be edge maps generated by thresholded gradient magnitudes, etc.

The basic form of chamfer matching is limited

Still images

We tested the pose-matching model with 103 natural images to observe the improvement due to the edge affinity model and cpik. The test images had cluttered backgrounds, including brick walls, grass, parking lots, etc. The pose-database consisted of 1847 poses performed by five subjects (a subset of the database is shown in Fig. 11). The poses in the database are registered to one another w.r.t. the heads of the subjects. The test images were generated by four subjects, three of whom were not

Conclusion

We presented a model for combining edge continuity with contour matching, and illustrated its utility in the context of human pose matching. The experiments indicate that the model is able to characterize the inherent grouping of the edges—e.g., Fig. 5, Fig. 7. The tests show that the use of edge affinities leads to significant improvements in matching. This demonstrates the importance of perceptual organization for object recognition.

References (37)

  • J. Luo et al.

    Perceptual grouping of segmented regions in color images

    Pattern Recogn.

    (2003)
  • A. Elgammal, V.D. Shet, Y. Yacoob, L.S. Davis, Learning dynamics for exemplar-based gesture recognition, in: Proc. IEEE...
  • M. W. Lee, I. Cohen, Proposal maps driven MCMC for estimating human body pose in static images, in: Proc. IEEE Conf....
  • D. Gavrila, Multi-feature hierarchical template matching using distance transforms, in: Proc. Int. Conf. Pattern...
  • A. Mohan et al.

    Example based object detection in images by components

    IEEE Trans. Pattern Anal. Machine Intell.

    (2001)
  • G. Mori, X. Ren, A.A. Efros, J. Malik, Recovering human body configurations: Combining segmentation and recognition,...
  • E. Sharon, A. Brandt, R. Basri, Segmentation and boundary detection using multiscale intensity measurements, in: Proc....
  • C.F. Olson et al.

    Automatic target recognition by matching oriented edge pixels

    IEEE Trans. Image Proc.

    (1997)
  • D. Gavrila, V. Philomin, Real-time object detection for “smart” vehicles, in: Proc. IEEE Int. Conf. Computer Vision...
  • K. Toyama, A. Blake, Probabilistic tracking in a metric space, in: Proc. Int. Conf. Computer Vision (ICCV’01), 2001,...
  • G. Mori, J. Malik, Estimating human body configurations using shape context matching, in: Proc. European Conf. Computer...
  • J. Zhang, R. Collins, Y. Liu, Representation and matching of articulated shapes, in: Proc. IEEE Conf. Computer Vision...
  • R. Ronfard, C. Schmid, B. Triggs, Learning to parse pictures of people, in: Proc. European Conf. Computer Vision...
  • P.L. Rosin, G. West, Salience distance transforms, Computer Vision, Graphics and Image Processing—Graphical Models and...
  • M.A. Butt et al.

    Optimum design of chamfer distance transforms

    IEEE Trans. Image Proc.

    (1998)
  • B. Leibe, B. Schiele, Analyzing appearance and contour based methods for object categorization, in: Proc. IEEE Conf....
  • T.K. Leung, J. Malik, Contour continuity in region based image segmentation, in: Proc. European Conf. Computer Vision...
  • X. Ren, J. Malik, Learning a classification model for segmentation, in: Proc. IEEE Int. Conf. Computer Vision...
  • Cited by (3)

    We thank the U.S. Government for supporting the research described in this paper.

    View full text