Editor’s choice article
Canonical locality preserving Latent Variable Model for discriminative pose inference

https://doi.org/10.1016/j.imavis.2012.06.009Get rights and content

Abstract

Discriminative approaches for human pose estimation model the functional mapping, or conditional distribution, between image features and 3D poses. Learning such multi-modal models in high dimensional spaces, however, is challenging with limited training data; often resulting in over-fitting and poor generalization. To address these issues Latent Variable Models (LVMs) have been introduced. Shared LVMs learn a low dimensional representation of common causes that give rise to both the image features and the 3D pose. Discovering the shared manifold structure can, in itself, however, be challenging. In addition, shared LVM models are often non-parametric, requiring the model representation to be a function of the training set size. We present a parametric framework that addresses these shortcomings. In particular, we jointly learn latent spaces for both image features and 3D poses by maximizing the non-linear dependencies in the projected latent space, while preserving local structure in the original space; we then learn a multi-modal conditional density between these two low-dimensional spaces in the form of Gaussian Mixture Regression. With this model we can address the issue of over-fitting and generalization, since the data is denser in the learned latent space, as well as avoid the need for learning a shared manifold for the data. We quantitatively compare the performance of the proposed method to several state-of-the-art alternatives, and show that our method gives a competitive performance.

Graphical abstract

Highlights

► Parametric framework learns multi-modal models in high dimensional spaces. ► Finding latent variables keeping local structure while maximizing the correlation. ► Dealing with multi-modality in the form of GMR.

Introduction

Monocular pose estimation has been a focus of much research in computer vision due to abundance of applications for marker-less motion capture (MoCap) technologies. Marker-less MoCap spans a large number of application domains including entertainment, sport rehabilitation and training, activity recognition, human computer interaction and clinical analysis. Despite much research, however, monocular pose estimation remains a difficult task; challenges include high-dimensionality of the state space, image clutter, occlusion, lighting and appearance variation, to name a few.

Most prior methods can be classified into two classes of approaches: generative and discriminative. Generative approaches [1], [2] define an image formation model by predicting appearance of the body x given a hypothesized state of the body (pose) y; an inference framework is then used to infer the posterior, p(y|x)  p(x|y)p(y) over time. Since the inference often takes the form of non-convex search in a high-dimensional space of body articulations, these methods are computationally expensive and can suffer from local convergence (typically requiring a good initial guess for pose to seed the inference).

Discriminative approaches [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16] avoid building an explicit imaging model, and instead opt to learn regression function, y = f(x), that maps from image features, x, to 3D poses, y; or probabilistically, a conditional distribution p(y|x) directly. The main goal is to learn a model from labeled training data, {x(i), y(i)}i = 1N, that provides efficient and effective generalization for new examples at test time. The difficulty with this class of methods is two-fold: (1) the conditional probability of poses given image features, p(y|x), is typically multi-modal: different image features can be explained by several 3D poses; and (2) learning high dimensional regression functions, or conditional distributions, using limited training data is challenging and often results in over-fitting. Here we focus on discriminative pose estimation.

To deal with multi-modality, on the parametric side, mixture models were introduced, e.g., Mixture of Regressors [4] or Mixture of Experts [14]. On the non-parametric side, local models that cluster data into convex sets and use uni-modal predictions within each cluster became popular (e.g., Local Gaussian Process Latent Variable Models (Local GPLVM) [16]). In both cases over-fitting and generalization remained an issue, due to the need for large training datasets, as noted in [12] (Fig. 1).

To alleviate the need for large labeled datasets Latent Variable Models (LVMs) were introduced as an intermediate representation. Kanaujia et al., [11], proposed Spectral LVMs to learn a non-linear latent embedding of the 3D pose data and a separately trained mixture model to map from the image features to the plausible latent positions in the sub-space. The relationship between the image features and latent space, however, was assumed to be linear within each mixture component.

Most traditional LVMs attempt to preserve distances between points in the original high-dimensional space. For example, if two human poses are close in the original space (according to some predefined distance metric), their latent representatives in the low-dimensional space should also be close and vice versa. Several latent variable techniques have been proposed to preserve the non-linear (or linear) structure of the high-dimensional data in the low dimensional space. He et al. proposed Locality Preserving Projections (LPP) to find the optimal linear approximation to the eigenfunctions of the Laplace Beltrami operator on the manifold [20]. Weinberger et al. presented maximum variance unfolding (MVU) which preserved distance by learning a kernel matrix [21]. Song et al. extended this method by Colored Maximum Variance Unfolding, and maximized the variance aligning with the side information (e.g. , labeled information), while preserving the local distance structures from the data [22]. However, all these methods are introduced in the context of learning a single low-dimensional representation for the data (e.g. , either image features or 3D poses, but not both); they contain no notion of input–output relationship between the features and poses that would facilitate discriminative inference.

As an alternative, Shared Gaussian Processes Latent Variable Model (Shared GPLVM) was introduced in [12], [17], where the latent embedding was learned to preserve the joint structure of image features and 3D poses simultaneously; the forward non-linear mappings from the latent space to the input and output spaces were also learned at the same time. Due to the lack of backward mapping from the image features (and 3D poses) to the latent space, inference remained expensive, requiring multiple optimizations at the cost of O(N2), where N is the number of training examples. Shared Kernel Information Embeddings (sKIE) [18] provided closed form mappings to and from the latent space reducing the training and inference complexity by an order of magnitude. Shared GPLVM and sKIE are compelling, but are inherently non-parametric, with the model complexity being a function of the training set size; while this made them effective with small dataset it prevented their use with larger datasets. Alternatively, Supervised Local Subspace Learning (SL2) [23] can learn directly a non-linear mapping from the feature space to the pose space without previously learning a joint latent space. SL2 re-sample the pose space, and learn a mixture of local subspaces, one for each resampled point in the pose space. This feature makes SL2 robust to non-uniform distribution of the feature space. Unfortunately, in our case, the pose space is high-dimensional and it will be computationally expensive to uniformly sample the input space (i.e. , pose space).

Dimensionality reduction for regression (DRR) techniques, instead of learning a joint embedding, opt for learning of a low-dimensional manifold embedding of the input data such that it preserves most, if not all, the necessary information for regression to the desired output. One way to formulate the DRR task is using the notion of sufficiency in dimension reduction (SDR) which find the subspace bases (or basis functions) such that the projected input yields the outputs independent of the original covariates. Manifold Kernel Dimensional Reduction (mKDR) presented in [24] is one such example, however, it involves a nonconvex optimization, potentially suffering from the existence of local minima. Alternatively, Covariance Operator Inverse Regression [25] generalizes Inverse Regression (IR) to nonlinear input/output spaces without explicit target slicing, but it assumes that the inverse regression is a smooth function.

We extend our work in [19], and propose Canonical Local Preserving Latent Variable Model (CLPLVM). Our formulation also extends [26], where traditional Canonical Correlation Analysis (CCA) was generalized to discover the low-dimensional manifold structure by maintaining the local information in the multiple data set. Similar to [26], we construct a cost function to find two sets of latent variables that keep local structure of the input image features and of the output 3D poses respectively, in their original high-dimensional spaces, while maximizing the correlation between related input and output latent variables at the same time.

Unlike [26], we also learn a multi-modal joint density model between the latent image features and the latent 3D poses, in the form of a Gaussian Mixture Model (GMM). GMM allows us to deal with multi-modality in the data and derive explicit conditional distributions for inference, in the form of Gaussian Mixture Regression (GMR).

Section snippets

Canonical Correlation Analysis (CCA)

CCA is a technique to extract common features from a pair of multivariate data. CCA, first proposed by Hotelling in 1936 [27], identifies relationships between two sets of variables by finding the linear combination of the variables in the first set (e.g. , image features) X = {x1,x2,…xN}  D × N (see notation1

Canonical Local Preserving Latent Variable Model

CCA models relation between the observation variables and the target variables in the linear latent spaces, while KCCA extends this relationship to non-linear latent spaces. As the intrinsic dimensionality of the data is typically much lower, regression between latent spaces found by CCA and KCCA tend to perform better than regression in the original space of image features and 3D poses. This is especially the case when the training data is limited, as will be illustrated in Section 5. However,

Gaussian Mixture Regression

Given a latent observation vector, zx ϵ dx, and the corresponding latent 3D pose, zy ϵ dy, we assume the joint latent sample, (zx,zy), follows the Gaussian mixture distribution with K mixture components,p(zx,zy)=k=1Kπkp(zx,zy;μk,Λk)where p(zx,zy; μkk) is the multivariate Gaussian density function. The parameters of model include prior weights, πk, means, μk=[μk,zxμk,nt]T, and variances, Λk=[Λk,zxΛk,zxzy;Λk,zyzxΛk,zy], of each Gaussian component.

The joint density can be expressed as the sum

Experiment

We tested the performance of our method on three datasets: (1) Poser dataset – consisting of synthetic sequences produced by Poser software [34], (2) CMU dataset – comprising real motion capture dataset and video publicly available from [35], and (3) standard dataset with provided error metrics made available by Agarwal and Triggs [3].

Conclusions and future work

In this paper, we presented a parametric discriminative framework for 3D pose inference. Our model has a number of appealing properties, mainly it can: (1) model complex structure of the image feature and pose manifolds, (2) keep local structure in the latent space, (3) deal with multi-modalities in the data, and (4) alleviate the need for learning costly shared non-linear non-parametric manifold models. We show that our performance is comparative or superior to parametric and non-parametric

References (37)

  • T. Sun et al.

    Locality preserving CCA with applications to data visualization and pose estimation

    Image Vision Comput.

    (2007)
  • T. Melzer et al.

    Appearance models based on kernel canonical correlation analysis, Pattern Recognition

  • H. Sidenbladh et al.

    Stochastic tracking of 3d human figures using 2d image motion

    IEEE Proc. Eur. Conf. Comput. Vis.

    (2000)
  • C. Sminchisescu et al.

    Covariance scaled sampling for monocular 3d body tracking

    IEEE Proc. Comput. Vis. Pattern Recognit.

    (2001)
  • A. Agarwal et al.

    3d human pose from silhouettes by relevance vector regression

    IEEE Proc. Comput. Vis. Pattern Recognit.

    (2004)
  • A. Agarwal et al.

    Monocular human motion capture with a mixture of regressors

    IEEE Workshop Comput. Vis. Pattern Recognit.

    (2005)
  • A. Bissacco et al.

    Fast human pose estimation using appearance and motion via multi-dimensional boosting regression

    IEEE Proc. Comput. Vis. Pattern Recognit.

    (2007)
  • L. Bo et al.

    Structured output-associative regression

    IEEE Proc. Comput. Vis. Pattern Recognit.

    (2009)
  • A.M. Elgammal et al.

    Inferring 3d body pose from silhouettes using activity manifold learning

    IEEE Proc. Comput. Vis. Pattern Recognit.

    (2004)
  • A. Fathi et al.

    Human pose estimation using motion exemplars

    IEEE Proc. Inte. Conf. Comput. Vis.

    (2007)
  • F. Guo et al.

    Learning and inference of 3d human poses from Gaussian mixture modeled silhouettes

    IEEE Proc. Int. Conf. Pattern Recognit.

    (2006)
  • T. Jaeggli et al.

    Learning generative models for multi-activity body pose estimation

    Int. J. Comput. Vis.

    (2009)
  • A. Kanaujia et al.

    Spectral latent variable models for perceptual inference

    IEEE Proc. Int. Conf. Comput. Vis.

    (2007)
  • R. Navaratnam et al.

    The joint manifold model for semi-supervised multi-valued regression

    IEEE Proc. Int. Conf. Comput. Vis.

    (2007)
  • G. Shakhnarovich et al.

    Fast pose estimation with parameter-sensitive hashing

    IEEE Proc. Int. Conf. Comput.Vis.

    (2003)
  • C. Sminchisescu et al.

    Discriminative density propagation for 3d human motion estimation

    IEEE Proc. Comput. Vis. Pattern Recognit.

    (2005)
  • C. Sminchisescu et al.

    Learning joint top-down and bottom-up processes for 3d visual inference

    IEEE Proc. Comput. Vis. Pattern Recognit.

    (2006)
  • R. Urtasun et al.

    Sparse probabilistic regression for activity-independent human pose inference

    IEEE Proc. Comput. Vis. Pattern Recognit.

    (2008)
  • Cited by (11)

    • Densely connected attentional pyramid residual network for human pose estimation

      2019, Neurocomputing
      Citation Excerpt :

      With the deployment of increasing multimedia sensors, human pose estimation has drawn more attention in the multimedia industry due to the substantial applications of motion capture technologies, including fashion parsing [1], clinical analysis [2], human computer interaction [3–5], activity recognition [6–9], sport rehabilitation and training [10], and entertainment [11].

    • 3D Human pose estimation: A review of the literature and analysis of covariates

      2016, Computer Vision and Image Understanding
      Citation Excerpt :

      Latent variable models:Latent variables are often used in the literature (Taylor et al., 2010; Yao et al., 2011), because it is often difficult to obtain accurate estimates of part labels because of possible occlusions. To alleviate the need for large labeled datasets, Tian et al. (2013) proposed a discriminative approach that employs Latent Variable Models (LVMs) that successfully address over-fitting and poor generalization. Aiming to exploit the advantages of both Canonical Correlation Analysis (CCA) and Kernel Canonical Correlation Analysis (KCCA), they introduced a Canonical Local Preserving Latent Variable Model (CLP-LVM) that adds additional regularized terms that preserve local structure in the data.

    • Human-robot skills transfer interfaces for a flexible surgical robot

      2014, Computer Methods and Programs in Biomedicine
    • A survey on canonical correlation analysis based multi-view learning

      2022, Chinese Journal of Intelligent Science and Technology
    • A Survey on Canonical Correlation Analysis

      2021, IEEE Transactions on Knowledge and Data Engineering
    View all citing articles on Scopus

    Editor's Choice Articles are invited and handled by a select rotating 12 member Editorial Board committee. This paper has been recommended for acceptance by Vladimir Pavlovic.

    View full text