Canonical locality preserving Latent Variable Model for discriminative pose inference

doi:10.1016/j.imavis.2012.06.009

Image and Vision Computing

Volume 31, Issue 3, March 2013, Pages 223-230

https://doi.org/10.1016/j.imavis.2012.06.009 Get rights and content

Abstract

Discriminative approaches for human pose estimation model the functional mapping, or conditional distribution, between image features and 3D poses. Learning such multi-modal models in high dimensional spaces, however, is challenging with limited training data; often resulting in over-fitting and poor generalization. To address these issues Latent Variable Models (LVMs) have been introduced. Shared LVMs learn a low dimensional representation of common causes that give rise to both the image features and the 3D pose. Discovering the shared manifold structure can, in itself, however, be challenging. In addition, shared LVM models are often non-parametric, requiring the model representation to be a function of the training set size. We present a parametric framework that addresses these shortcomings. In particular, we jointly learn latent spaces for both image features and 3D poses by maximizing the non-linear dependencies in the projected latent space, while preserving local structure in the original space; we then learn a multi-modal conditional density between these two low-dimensional spaces in the form of Gaussian Mixture Regression. With this model we can address the issue of over-fitting and generalization, since the data is denser in the learned latent space, as well as avoid the need for learning a shared manifold for the data. We quantitatively compare the performance of the proposed method to several state-of-the-art alternatives, and show that our method gives a competitive performance.

Graphical abstract

Highlights

► Parametric framework learns multi-modal models in high dimensional spaces. ► Finding latent variables keeping local structure while maximizing the correlation. ► Dealing with multi-modality in the form of GMR.

Introduction

Monocular pose estimation has been a focus of much research in computer vision due to abundance of applications for marker-less motion capture (MoCap) technologies. Marker-less MoCap spans a large number of application domains including entertainment, sport rehabilitation and training, activity recognition, human computer interaction and clinical analysis. Despite much research, however, monocular pose estimation remains a difficult task; challenges include high-dimensionality of the state space, image clutter, occlusion, lighting and appearance variation, to name a few.

Most prior methods can be classified into two classes of approaches: generative and discriminative. Generative approaches [1], [2] define an image formation model by predicting appearance of the body x given a hypothesized state of the body (pose) y; an inference framework is then used to infer the posterior, p(y|x) ∝ p(x|y)p(y) over time. Since the inference often takes the form of non-convex search in a high-dimensional space of body articulations, these methods are computationally expensive and can suffer from local convergence (typically requiring a good initial guess for pose to seed the inference).

Discriminative approaches [3], [4], [5], [6], [7], [8], [9], [10], [11], [12], [13], [14], [15], [16] avoid building an explicit imaging model, and instead opt to learn regression function, y = f(x), that maps from image features, x, to 3D poses, y; or probabilistically, a conditional distribution p(y|x) directly. The main goal is to learn a model from labeled training data, {x⁽ⁱ⁾, y⁽ⁱ⁾}_i = 1^N, that provides efficient and effective generalization for new examples at test time. The difficulty with this class of methods is two-fold: (1) the conditional probability of poses given image features, p(y|x), is typically multi-modal: different image features can be explained by several 3D poses; and (2) learning high dimensional regression functions, or conditional distributions, using limited training data is challenging and often results in over-fitting. Here we focus on discriminative pose estimation.

To deal with multi-modality, on the parametric side, mixture models were introduced, e.g., Mixture of Regressors [4] or Mixture of Experts [14]. On the non-parametric side, local models that cluster data into convex sets and use uni-modal predictions within each cluster became popular (e.g., Local Gaussian Process Latent Variable Models (Local GPLVM) [16]). In both cases over-fitting and generalization remained an issue, due to the need for large training datasets, as noted in [12] (Fig. 1).

To alleviate the need for large labeled datasets Latent Variable Models (LVMs) were introduced as an intermediate representation. Kanaujia et al., [11], proposed Spectral LVMs to learn a non-linear latent embedding of the 3D pose data and a separately trained mixture model to map from the image features to the plausible latent positions in the sub-space. The relationship between the image features and latent space, however, was assumed to be linear within each mixture component.

Most traditional LVMs attempt to preserve distances between points in the original high-dimensional space. For example, if two human poses are close in the original space (according to some predefined distance metric), their latent representatives in the low-dimensional space should also be close and vice versa. Several latent variable techniques have been proposed to preserve the non-linear (or linear) structure of the high-dimensional data in the low dimensional space. He et al. proposed Locality Preserving Projections (LPP) to find the optimal linear approximation to the eigenfunctions of the Laplace Beltrami operator on the manifold [20]. Weinberger et al. presented maximum variance unfolding (MVU) which preserved distance by learning a kernel matrix [21]. Song et al. extended this method by Colored Maximum Variance Unfolding, and maximized the variance aligning with the side information (e.g. , labeled information), while preserving the local distance structures from the data [22]. However, all these methods are introduced in the context of learning a single low-dimensional representation for the data (e.g. , either image features or 3D poses, but not both); they contain no notion of input–output relationship between the features and poses that would facilitate discriminative inference.

As an alternative, Shared Gaussian Processes Latent Variable Model (Shared GPLVM) was introduced in [12], [17], where the latent embedding was learned to preserve the joint structure of image features and 3D poses simultaneously; the forward non-linear mappings from the latent space to the input and output spaces were also learned at the same time. Due to the lack of backward mapping from the image features (and 3D poses) to the latent space, inference remained expensive, requiring multiple optimizations at the cost of O(N²), where N is the number of training examples. Shared Kernel Information Embeddings (sKIE) [18] provided closed form mappings to and from the latent space reducing the training and inference complexity by an order of magnitude. Shared GPLVM and sKIE are compelling, but are inherently non-parametric, with the model complexity being a function of the training set size; while this made them effective with small dataset it prevented their use with larger datasets. Alternatively, Supervised Local Subspace Learning (SL²) [23] can learn directly a non-linear mapping from the feature space to the pose space without previously learning a joint latent space. SL² re-sample the pose space, and learn a mixture of local subspaces, one for each resampled point in the pose space. This feature makes SL² robust to non-uniform distribution of the feature space. Unfortunately, in our case, the pose space is high-dimensional and it will be computationally expensive to uniformly sample the input space (i.e. , pose space).

Dimensionality reduction for regression (DRR) techniques, instead of learning a joint embedding, opt for learning of a low-dimensional manifold embedding of the input data such that it preserves most, if not all, the necessary information for regression to the desired output. One way to formulate the DRR task is using the notion of sufficiency in dimension reduction (SDR) which find the subspace bases (or basis functions) such that the projected input yields the outputs independent of the original covariates. Manifold Kernel Dimensional Reduction (mKDR) presented in [24] is one such example, however, it involves a nonconvex optimization, potentially suffering from the existence of local minima. Alternatively, Covariance Operator Inverse Regression [25] generalizes Inverse Regression (IR) to nonlinear input/output spaces without explicit target slicing, but it assumes that the inverse regression is a smooth function.

We extend our work in [19], and propose Canonical Local Preserving Latent Variable Model (CLPLVM). Our formulation also extends [26], where traditional Canonical Correlation Analysis (CCA) was generalized to discover the low-dimensional manifold structure by maintaining the local information in the multiple data set. Similar to [26], we construct a cost function to find two sets of latent variables that keep local structure of the input image features and of the output 3D poses respectively, in their original high-dimensional spaces, while maximizing the correlation between related input and output latent variables at the same time.

Unlike [26], we also learn a multi-modal joint density model between the latent image features and the latent 3D poses, in the form of a Gaussian Mixture Model (GMM). GMM allows us to deal with multi-modality in the data and derive explicit conditional distributions for inference, in the form of Gaussian Mixture Regression (GMR).

Section snippets

Canonical Correlation Analysis (CCA)

CCA is a technique to extract common features from a pair of multivariate data. CCA, first proposed by Hotelling in 1936 [27], identifies relationships between two sets of variables by finding the linear combination of the variables in the first set (e.g. , image features) X = {x₁,x₂,…x_N} ∈ ℝ^D × N (see notation¹

Canonical Local Preserving Latent Variable Model

CCA models relation between the observation variables and the target variables in the linear latent spaces, while KCCA extends this relationship to non-linear latent spaces. As the intrinsic dimensionality of the data is typically much lower, regression between latent spaces found by CCA and KCCA tend to perform better than regression in the original space of image features and 3D poses. This is especially the case when the training data is limited, as will be illustrated in Section 5. However,

Gaussian Mixture Regression

Given a latent observation vector, z_x ϵ ℝ ^d_x, and the corresponding latent 3D pose, z_y ϵ ℝ ^d_y, we assume the joint latent sample, (z_x,z_y), follows the Gaussian mixture distribution with K mixture components, $p (z_{x}, z_{y}) = \sum_{k = 1}^{K} π_{k} p (z_{x}, z_{y}; μ_{k}, Λ_{k})$ where p(z_x,z_y; μ_k,Λ_k) is the multivariate Gaussian density function. The parameters of model include prior weights, π_k, means, $μ_{k} = {[μ_{k, z_{x}} μ_{k, n t}]}^{T}$ , and variances, $Λ_{k} = [Λ_{k, z_{x}} Λ_{k, z_{x} z_{y}}; Λ_{k, z_{y} z_{x}} Λ_{k, z_{y}}]$ , of each Gaussian component.

The joint density can be expressed as the sum

Experiment

We tested the performance of our method on three datasets: (1) Poser dataset – consisting of synthetic sequences produced by Poser software [34], (2) CMU dataset – comprising real motion capture dataset and video publicly available from [35], and (3) standard dataset with provided error metrics made available by Agarwal and Triggs [3].

Conclusions and future work

In this paper, we presented a parametric discriminative framework for 3D pose inference. Our model has a number of appealing properties, mainly it can: (1) model complex structure of the image feature and pose manifolds, (2) keep local structure in the latent space, (3) deal with multi-modalities in the data, and (4) alleviate the need for learning costly shared non-linear non-parametric manifold models. We show that our performance is comparative or superior to parametric and non-parametric

References (37)

T. Sun et al.
Locality preserving CCA with applications to data visualization and pose estimation
Image Vision Comput.
(2007)
T. Melzer et al.
Appearance models based on kernel canonical correlation analysis, Pattern Recognition
H. Sidenbladh et al.
Stochastic tracking of 3d human figures using 2d image motion
IEEE Proc. Eur. Conf. Comput. Vis.
(2000)
C. Sminchisescu et al.
Covariance scaled sampling for monocular 3d body tracking
IEEE Proc. Comput. Vis. Pattern Recognit.
(2001)
A. Agarwal et al.
3d human pose from silhouettes by relevance vector regression
IEEE Proc. Comput. Vis. Pattern Recognit.
(2004)
A. Agarwal et al.
Monocular human motion capture with a mixture of regressors
IEEE Workshop Comput. Vis. Pattern Recognit.
(2005)
A. Bissacco et al.
Fast human pose estimation using appearance and motion via multi-dimensional boosting regression
IEEE Proc. Comput. Vis. Pattern Recognit.
(2007)
L. Bo et al.
Structured output-associative regression
IEEE Proc. Comput. Vis. Pattern Recognit.
(2009)
A.M. Elgammal et al.
Inferring 3d body pose from silhouettes using activity manifold learning
IEEE Proc. Comput. Vis. Pattern Recognit.
(2004)
A. Fathi et al.
Human pose estimation using motion exemplars
IEEE Proc. Inte. Conf. Comput. Vis.
(2007)

F. Guo et al.

Learning and inference of 3d human poses from Gaussian mixture modeled silhouettes

IEEE Proc. Int. Conf. Pattern Recognit.

(2006)

T. Jaeggli et al.

Learning generative models for multi-activity body pose estimation

Int. J. Comput. Vis.

(2009)

A. Kanaujia et al.

Spectral latent variable models for perceptual inference

IEEE Proc. Int. Conf. Comput. Vis.

(2007)

R. Navaratnam et al.

The joint manifold model for semi-supervised multi-valued regression

IEEE Proc. Int. Conf. Comput. Vis.

(2007)

G. Shakhnarovich et al.

Fast pose estimation with parameter-sensitive hashing

IEEE Proc. Int. Conf. Comput.Vis.

(2003)

C. Sminchisescu et al.

Discriminative density propagation for 3d human motion estimation

IEEE Proc. Comput. Vis. Pattern Recognit.

(2005)

C. Sminchisescu et al.

Learning joint top-down and bottom-up processes for 3d visual inference

IEEE Proc. Comput. Vis. Pattern Recognit.

(2006)

R. Urtasun et al.

Sparse probabilistic regression for activity-independent human pose inference

IEEE Proc. Comput. Vis. Pattern Recognit.

(2008)

Cited by (11)

Densely connected attentional pyramid residual network for human pose estimation
2019, Neurocomputing
Citation Excerpt :
With the deployment of increasing multimedia sensors, human pose estimation has drawn more attention in the multimedia industry due to the substantial applications of motion capture technologies, including fashion parsing [1], clinical analysis [2], human computer interaction [3–5], activity recognition [6–9], sport rehabilitation and training [10], and entertainment [11].
Research on 3D human pose estimation based on deep neural networks has recently witnessed substantial progress in both accuracy and execution efficiency. Many methods combine the deep neural network-based 2D pose estimation and 3D pose matching. However, (1) multiscale analysis has the potential to improve inference accuracy, and (2) isometric constraint is not considered in the 3D matching stage. In this paper, a new module named the Densely Connected Attentional Pyramid Residual Module (DCAPRM) in the bottom-up mapping stage is presented to effectively increase the inference accuracy. A new isometric regularization term is also proposed to punish limb extension or shrinkage in the top-down fitting phase. The performance of our approach on 3D human pose datasets is evaluated. The experimental results show that our approach provides better results than other approaches in terms of the accuracy of 3D human pose estimation.
3D Human pose estimation: A review of the literature and analysis of covariates
2016, Computer Vision and Image Understanding
Citation Excerpt :
Latent variable models:Latent variables are often used in the literature (Taylor et al., 2010; Yao et al., 2011), because it is often difficult to obtain accurate estimates of part labels because of possible occlusions. To alleviate the need for large labeled datasets, Tian et al. (2013) proposed a discriminative approach that employs Latent Variable Models (LVMs) that successfully address over-fitting and poor generalization. Aiming to exploit the advantages of both Canonical Correlation Analysis (CCA) and Kernel Canonical Correlation Analysis (KCCA), they introduced a Canonical Local Preserving Latent Variable Model (CLP-LVM) that adds additional regularized terms that preserve local structure in the data.
Estimating the pose of a human in 3D given an image or a video has recently received significant attention from the scientific community. The main reasons for this trend are the ever increasing new range of applications (e.g., human-robot interaction, gaming, sports performance analysis) which are driven by current technological advances. Although recent approaches have dealt with several challenges and have reported remarkable results, 3D pose estimation remains a largely unsolved problem because real-life applications impose several challenges which are not fully addressed by existing methods. For example, estimating the 3D pose of multiple people in an outdoor environment remains a largely unsolved problem. In this paper, we review the recent advances in 3D human pose estimation from RGB images or image sequences. We propose a taxonomy of the approaches based on the input (e.g., single image or video, monocular or multi-view) and in each case we categorize the methods according to their key characteristics. To provide an overview of the current capabilities, we conducted an extensive experimental evaluation of state-of-the-art approaches in a synthetic dataset created specifically for this task, which along with its ground truth is made publicly available for research purposes. Finally, we provide an in-depth discussion of the insights obtained from reviewing the literature and the results of our experiments. Future directions and challenges are identified.
Human-robot skills transfer interfaces for a flexible surgical robot
2014, Computer Methods and Programs in Biomedicine
In minimally invasive surgery, tools go through narrow openings and manipulate soft organs to perform surgical tasks. There are limitations in current robot-assisted surgical systems due to the rigidity of robot tools. The aim of the STIFF-FLOP European project is to develop a soft robotic arm to perform surgical tasks. The flexibility of the robot allows the surgeon to move within organs to reach remote areas inside the body and perform challenging procedures in laparoscopy. This article addresses the problem of designing learning interfaces enabling the transfer of skills from human demonstration. Robot programming by demonstration encompasses a wide range of learning strategies, from simple mimicking of the demonstrator's actions to the higher level imitation of the underlying intent extracted from the demonstrations. By focusing on this last form, we study the problem of extracting an objective function explaining the demonstrations from an over-specified set of candidate reward functions, and using this information for self-refinement of the skill. In contrast to inverse reinforcement learning strategies that attempt to explain the observations with reward functions defined for the entire task (or a set of pre-defined reward profiles active for different parts of the task), the proposed approach is based on context-dependent reward-weighted learning, where the robot can learn the relevance of candidate objective functions with respect to the current phase of the task or encountered situation. The robot then exploits this information for skills refinement in the policy parameters space. The proposed approach is tested in simulation with a cutting task performed by the STIFF-FLOP flexible robot, using kinesthetic demonstrations from a Barrett WAM manipulator.
A survey on canonical correlation analysis based multi-view learning
2022, Chinese Journal of Intelligent Science and Technology
A Survey on Canonical Correlation Analysis
2021, IEEE Transactions on Knowledge and Data Engineering
Mixture models for the analysis, edition, and synthesis of continuous time series
2021, arXiv

View all citing articles on Scopus

^☆: Editor's Choice Articles are invited and handled by a select rotating 12 member Editorial Board committee. This paper has been recommended for acceptance by Vladimir Pavlovic.

View full text

Editor’s choice articleCanonical locality preserving Latent Variable Model for discriminative pose inference☆

Abstract

Graphical abstract

Highlights

Introduction

Section snippets

Canonical Correlation Analysis (CCA)

Canonical Local Preserving Latent Variable Model

Gaussian Mixture Regression

Experiment

Conclusions and future work

Image Vision Comput.

Stochastic tracking of 3d human figures using 2d image motion

IEEE Proc. Eur. Conf. Comput. Vis.

Covariance scaled sampling for monocular 3d body tracking

IEEE Proc. Comput. Vis. Pattern Recognit.

3d human pose from silhouettes by relevance vector regression

IEEE Proc. Comput. Vis. Pattern Recognit.

Monocular human motion capture with a mixture of regressors

IEEE Workshop Comput. Vis. Pattern Recognit.

Fast human pose estimation using appearance and motion via multi-dimensional boosting regression

IEEE Proc. Comput. Vis. Pattern Recognit.

Structured output-associative regression

IEEE Proc. Comput. Vis. Pattern Recognit.

Inferring 3d body pose from silhouettes using activity manifold learning

IEEE Proc. Comput. Vis. Pattern Recognit.

Human pose estimation using motion exemplars

IEEE Proc. Inte. Conf. Comput. Vis.

Learning and inference of 3d human poses from Gaussian mixture modeled silhouettes

IEEE Proc. Int. Conf. Pattern Recognit.

Learning generative models for multi-activity body pose estimation

Int. J. Comput. Vis.

Spectral latent variable models for perceptual inference

IEEE Proc. Int. Conf. Comput. Vis.

The joint manifold model for semi-supervised multi-valued regression

IEEE Proc. Int. Conf. Comput. Vis.

Fast pose estimation with parameter-sensitive hashing

IEEE Proc. Int. Conf. Comput.Vis.

Discriminative density propagation for 3d human motion estimation

IEEE Proc. Comput. Vis. Pattern Recognit.

Learning joint top-down and bottom-up processes for 3d visual inference

IEEE Proc. Comput. Vis. Pattern Recognit.

Sparse probabilistic regression for activity-independent human pose inference

IEEE Proc. Comput. Vis. Pattern Recognit.

Editor’s choice article
Canonical locality preserving Latent Variable Model for discriminative pose inference☆