Combining Generative and Discriminative Models in a Framework for Articulated Pose Estimation

Rosales, RÓMer; Sclaroff, Stan

doi:10.1007/s11263-006-5165-4

Combining Generative and Discriminative Models in a Framework for Articulated Pose Estimation

Published: 01 March 2006

Volume 67, pages 251–276, (2006)
Cite this article

International Journal of Computer Vision Aims and scope Submit manuscript

RÓMer Rosales¹ &
Stan Sclaroff²

269 Accesses
40 Citations
Explore all metrics

Abstract

We develop a method for the estimation of articulated pose, such as that of the human body or the human hand, from a single (monocular) image. Pose estimation is formulated as a statistical inference problem, where the goal is to find a posterior probability distribution over poses as well as a maximum a posteriori (MAP) estimate. The method combines two modeling approaches, one discriminative and the other generative. The discriminative model consists of a set of mapping functions that are constructed automatically from a labeled training set of body poses and their respective image features. The discriminative formulation allows for modeling ambiguous, one-to-many mappings (through the use of multi-modal distributions) that may yield multiple valid articulated pose hypotheses from a single image. The generative model is defined in terms of a computer graphics rendering of poses. While the generative model offers an accurate way to relate observed (image features) and hidden (body pose) random variables, it is difficult to use it directly in pose estimation, since inference is computationally intractable. In contrast, inference with the discriminative model is tractable, but considerably less accurate for the problem of interest. A combined discriminative/generative formulation is derived that leverages the complimentary strengths of both models in a principled framework for articulated pose inference. Two efficient MAP pose estimation algorithms are derived from this formulation; the first is deterministic and the second non-deterministic. Performance of the framework is quantitatively evaluated in estimating articulated pose of both the human hand and human body.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

Article Open access 08 October 2020

Jonathon Luiten, Aljos̆a Os̆ep, … Bastian Leibe

LSD-SLAM: Large-Scale Direct Monocular SLAM

A review of three dimensional reconstruction techniques

Article 12 February 2021

Jonathan Then Sien Phang, King Hann Lim & Raymond Choo Wee Chiong

References

Alt, F.L. 1962. Digital pattern recognition by moments. Journal of the Association for Computing Machinery, 9(2):240–258.
MATH Google Scholar
Amari, S.I. 1995. Information geometry of the EM and em algorithms for neural networks. Neural Networks, 8(9):1379–1408.
Article Google Scholar
Barron, C. and Kakadiaris, I. 2000. Estimating anthropometry and pose from a single image. In Proc. Computer Vision and Pattern Recognition, pp. 669–676.
Black, M.J., Yacoob, Y., Jepson, A.D., Fleet, D.J., 1997. Tracking and recognizing rigid and non-rigid facial motion using local parametric models of image motion. In Proc. International Conference on Computer Vision.
Black, M.J., Yacoob, Y., Jepson, A.D., Fleet, D.J. 1997. Learning parameterized models of image motion. IEEE Conf. on Computer Vision and Pattern Recognition, CVPR-97, Puerto Rico, pp. 561–567.
Brand, M. 1999. Shadow puppetry. In Proc. International Conference on Computer Vision, pp. 1237–1244.
Bregler, C. 1998. Tracking people with twists and exponential maps. In Proc. Computer Vision and Pattern Recognition, pp. 8–15.
Cheng, J. and Druzdzel, M. 2000. AIS-BN: An adaptive importance sampling algorithm for evidential reasoning in large bayesian networks. Journal of Artificial Intelligence Research, 13:155–188.
MathSciNet Google Scholar
Cover, T. and Thomas, J. 1991. Elements of Information Theory. Wiley Series in Telecommunications, John Wiley & Sons: New York, NY, USA.
Csiszar, I. and Tusnady, G. 1984. Information geometry and alternating minimization procedures. Statistics and Decisions, 1:205–237.
MathSciNet Google Scholar
Dempster, A., Laird, N., and Rubin, D. 1977. Maximum likelihood estimation from incomplete data. Journal of the Royal Statistical Society (B), 39(1):1–38.
MathSciNet Google Scholar
Deutscher, J., Blake, A., and Reid, I. 2000. Articulated body motion capture by annealed particle filtering. In Proc. Computer Vision and Pattern Recognition.
Felzenszwalb, P. and Huttenlocher, D. 2000. Efficient matching of pictorial structures. In Proc. Computer Vision and Pattern Recognition.
Friedman, J.H. 1991. Multivatiate adaptive regression splines. The Annals of Statistics, 19:1–141.
MATH MathSciNet Google Scholar
Gavrila, D. and Davis, L. 1995. Towards 3-d model-based tracking and recognition of human movement: a multi-view approach. In International Workshop on Automatic Face and Gesture Recognition, pp. 272–277.
Haritaouglu, I., Harwood, D., and Davis, L. 1998. Ghost: A human body part labeling system using silhouettes. In International Conference on Pattern Recognition, pp. 77–82.
Heap, T. and Hogg, D. 1996. Towards 3d hand tracking using a deformable model. In Proc. International Conference on Automatic Face and Gesture Recognition, pp. 140–145.
Hinton, G., Sallans, B., and Ghahramani, Z. 1998. A hierarchical community of experts. In Learning in Graphical Models, M. Jordan (ed.), pp. 479–494.
Hogg, D., Dudani, S., Breeding, K., and McGhee, R. 1983. Model-based vision: A program to see a walking person. Image and Vision Computing, 1(1):5–20.
Article Google Scholar
Howe, N.R., Leventon, M.E., and Freeman, W.T. 2000. Bayesian reconstruction of 3d human motion from single-camera video. In Advances in Neural Information Processing Systems, 12:820–826.
Google Scholar
Hu, M.K. 1962. Visual pattern recognition by moment invariants. IRE Transactions Information Theory, IT(8):179–187.
Google Scholar
Iijima, T., Genchi, H., and Mori, K. 1973. A theory of character recognition by pattern matching method. In Proc. First Int'l Joint Conf. Pattern Recognition, pp. 50–56.
Isard, M. and Blake, A. 1998. Condensation – conditional density propagation for visual tracking. International Journal of Computer Vision, 29(1):5–28.
Article Google Scholar
Johansson, G. 1973. Visual perception of biological motion and a model for its analysis. Perception and Psychophysics, 14(2):210–211.
Google Scholar
Jordan, M. 1999. Learning in Graphical Models. Kluwer Academic: The Netherlands.
Jordan, M. and Jacobs, R. 1994. Hierarchical mixtures of experts and the EM algorithm. Neural Computation, 6:181–214.
Google Scholar
Mackay, D. 1998. Introduction to Monte Carlo methods. Learning in Graphical Models.
McLachlan, G.J. 1992. Discriminant Analysis and Statistical Pattern Recognition. Wiley: New York.
Neal, R. and Hinton, G. 1998. A view of the em algorithm that justifies incremental, sparse, and other variants. In Learning in Graphical Models, M. Jordan (ed.), pp. 355–368.
Ng, A. and Jordan, M. 2001. On discriminative vs. generative classifiers: a comparison of logistic regression and naive Bayes. Advances in Neural Information Processing Systems, pp. 841–848.
Ormoneit, D., Sidenbladh, H., Black, M., and Hastie, T. 2001. Learning and tracking cyclic human motion. Advances in Neural Information Processing Systems 13:894–900.
Google Scholar
Pavlović, V., Rehg, J., and MacCormick, J. 2001. Learning switching linear models of human motion. Advances in Neural Information Processing Systems, 13:981–987.
Google Scholar
Pearl, J. 1988. Probabilistic Reasoning in Intelligent Systems. Morgan-Kaufman.
Rehg, J.M. and Kanade, T. 1995. Model-based tracking of self-occluding articulated objects. In Proc. International Conference on Computer Vision, pp. 612–617.
Rissanen, J. 1986. Stochastic complexity and modeling. Annals of Statistics, 14:1080–1100.
MATH MathSciNet Google Scholar
Rosales, R. 2002. The specialized mappings architecture, with applications to vision-based estimation of articulated body pose. PhD thesis, Boston University.
Rosales, R., Athitsos, V., Sigal, L., and Sclaroff, S. 2001. 3d hand pose estimation using specialized mappings. In Proc. International Conference on Computer Vision, pp. 378–387.
Rubinstein, R. 1981. Simulation and the Monte Carlo Method. John Wiley & Sons.
Rubinstein, Y. and Hastie, T. 1997. Discriminative vs. informative learning. In 3rd International Conference on Knowledge Discovery and Data Mining, pp. 49–56.
Shimada, N., Shirai, Y., Kuno, Y., and Miura, J. 1998. Hand gesture estimation and model refinement using monocular camera - ambiguity limitation by inequality constraints. In Proc. International Conference on Automatic Face and Gesture Recognition, pp. 268–273.
Sigal, L., Sclaroff, S., and Athitsos, V. 2000. Estimation and prediction of evolving color distributions for skin segmentation undervarying illumination. In Proc. Computer Vision and Pattern Recognition, pp. 152–159.
Sminchisescu, C. and Triggs, B. 2001. Covariance scaled sampling for monocular 3d body tracking. In Proc. Computer Vision and Pattern Recognition, pp. 447–454.
Song, Y., Feng, X., and Perona, P. 2000. Towards detection of human motion. In Proc. Computer Vision and Pattern Recognition, pp. 810–817.
Taylor, C.J. 2000. Reconstruction of articulated objects from point correspondences in a single uncalibrated image. Computer Vision and Image Understanding: CVIU, 80(3):349–363.
Article MATH Google Scholar
Virtual Technologies, Inc. 1998. Palo Alto, CA. VirtualHand Software Library Reference Manual.
Wren, C., Azarbayejani, A., Darrell, T., and Pentland, A. 1997. Pfinder: Real time tracking of the human body. PAMI, 19(7):780–785.
Google Scholar
Zhu, S.C., Guo, C., and Wu, Y. 2003. Modeling visual patterns by integrating descriptive and generative models. International Journal of Computer Vision, 53(1):5–29.
Article Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, 02139, USA
RÓMer Rosales
Image and Video Computing Group, Dept. of Computer Science, Boston University, Boston, MA, 02215, USA
Stan Sclaroff

Authors

RÓMer Rosales
View author publications
You can also search for this author in PubMed Google Scholar
Stan Sclaroff
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to RÓMer Rosales.

Additional information

Most of this work was done while the first author was with Boston University.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rosales, R., Sclaroff, S. Combining Generative and Discriminative Models in a Framework for Articulated Pose Estimation. Int J Comput Vision 67, 251–276 (2006). https://doi.org/10.1007/s11263-006-5165-4

Download citation

Received: 07 April 2004
Revised: 18 April 2005
Accepted: 25 July 2005
Published: 01 March 2006
Issue Date: May 2006
DOI: https://doi.org/10.1007/s11263-006-5165-4

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Combining Generative and Discriminative Models in a Framework for Articulated Pose Estimation

Abstract

Access this article

Similar content being viewed by others

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

LSD-SLAM: Large-Scale Direct Monocular SLAM

A review of three dimensional reconstruction techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Navigation

Combining Generative and Discriminative Models in a Framework for Articulated Pose Estimation

Abstract

Access this article

Similar content being viewed by others

HOTA: A Higher Order Metric for Evaluating Multi-object Tracking

LSD-SLAM: Large-Scale Direct Monocular SLAM

A review of three dimensional reconstruction techniques

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Search

Navigation