Skip to main content
Log in

Discriminative Appearance Models for Pictorial Structures

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

In this paper we consider people detection and articulated pose estimation, two closely related and challenging problems in computer vision. Conceptually, both of these problems can be addressed within the pictorial structures framework (Felzenszwalb and Huttenlocher in Int. J. Comput. Vis. 61(1):55–79, 2005; Fischler and Elschlager in IEEE Trans. Comput. C-22(1):67–92, 1973), even though previous approaches have not shown such generality. A principal difficulty for such a general approach is to model the appearance of body parts. The model has to be discriminative enough to enable reliable detection in cluttered scenes and general enough to capture highly variable appearance. Therefore, as the first important component of our approach, we propose a discriminative appearance model based on densely sampled local descriptors and AdaBoost classifiers. Secondly, we interpret the normalized margin of each classifier as likelihood in a generative model and compute marginal posteriors for each part using belief propagation. Thirdly, non-Gaussian relationships between parts are represented as Gaussians in the coordinate system of the joint between the parts. Additionally, in order to cope with shortcomings of tree-based pictorial structures models, we augment our model with additional repulsive factors in order to discourage overcounting of image evidence. We demonstrate that the combination of these components within the pictorial structures framework results in a generic model that yields state-of-the-art performance for several datasets on a variety of tasks: people detection, upper body pose estimation, and full body pose estimation.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  • Andriluka, M., Roth, S., & Schiele, B. (2008). People-tracking-by-detection and people-detection-by-tracking. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: people detection and articulated pose estimation. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D pose estimation and tracking by detection. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Belongie, S., Malik, J., & Puzicha, J. (2001). Shape context: a new descriptor for shape matching and object recognition. In Adv. in neur. inf. proc. sys. (NIPS).

    Google Scholar 

  • Bergtholdt, M., Kappes, J., Schmidt, S., & Schnörr, C. (2009). A study of parts-based object class detection using complete graphs. International Journal of Computer Vision, 87(1–2), 93–117.

    Google Scholar 

  • Bourdev, L., & Malik, J. (2009). Poselets: body part detectors trained using 3D human pose annotations. In IEEE int. conf. on comp. vis. (ICCV).

    Google Scholar 

  • Buehler, P., Everingham, M., Huttenlocher, D. P., & Zisserman, A. (2008). Long term arm and hand tracking for continuous sign language TV broadcasts. In Brit. mach. vis. conf. (BMVC).

    Google Scholar 

  • Crandall, D., Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Spatial priors for part-based recognition using statistical models. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In Brit. mach. vis. conf. (BMVC).

    Google Scholar 

  • Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results. http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html.

    Google Scholar 

  • Felzenszwalb, P. F., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(32), 1627–1645.

    Article  Google Scholar 

  • Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.

    Article  Google Scholar 

  • Felzenszwalb, P. F., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Ferrari, V., Marin, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Ferrari, V., Marin, M., & Zisserman, A. (2009a). 2D human pose estimation in TV shows. In D. Cremers, B. Rosenhahn, A. L. Yuille, & F. R. Schmidt (Eds.), Lect. notes in comp. sci.: Vol. 5604. Statistical and geometrical approaches to visual motion analysis (pp. 128–147). Berlin: Springer.

    Chapter  Google Scholar 

  • Ferrari, V., Marin, M., & Zisserman, A. (2009b). Pose search: Retrieving people using their pose. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computers, C-22(1), 67–92.

    Article  Google Scholar 

  • Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.

    Article  MathSciNet  MATH  Google Scholar 

  • Gall, J., & Lempitsky, V. (2009). Class-specific hough forests for object detection. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Guan, P., Weiss, A., Balan, A., & Black, M. J. (2009). Estimating human shape and pose from a single image. In IEEE int. conf. on comp. vis. (ICCV).

    Google Scholar 

  • Ionescu, C., Bo, L., & Sminchisescu, C. (2009). Structural SVM for visual localization and continuous state estimation. In IEEE int. conf. on comp. vis. (ICCV).

    Google Scholar 

  • Jiang, H. (2009). Human pose estimation using consistent max-covering. In IEEE int. conf. on comp. vis. (ICCV).

    Google Scholar 

  • Jiang, H., & Martin, D. R. (2008). Global pose estimation using non-tree models. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Jie, L., Caputo, B., & Ferrari, V. (2009). Who’s doing what: joint modeling of names and verbs for simultaneous face and pose annotation. In Adv. in neur. inf. proc. sys. (NIPS).

    Google Scholar 

  • Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In 2nd IEEE international workshop on machine learning for vision-based motion analysis.

    Google Scholar 

  • Kschischang, F. R., Frey, B. J., & Loelinger, H.-A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.

    Article  MATH  Google Scholar 

  • Kumar, P., Zisserman, A., & Torr, P. H. S. (2009). Efficient discriminative learning of parts-based models. In IEEE int. conf. on comp. vis. (ICCV).

    Google Scholar 

  • Lan, X., & Huttenlocher, D. P. (2005). Beyond trees: common-factor models for 2D human pose recovery. In IEEE int. conf. on comp. vis. (ICCV).

    Google Scholar 

  • Lee, H.-J., & Chen, Z. (1985). Determination of 3D human body postures from a single view. Computer Vision, Graphics, and Image Processing, 30, 148–168.

    Article  MathSciNet  Google Scholar 

  • Lee, M. W., & Cohen, I. (2004). Proposal maps driven MCMC for estimating human body pose in static images. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Leibe, B., Seemann, E., & Schiele, B. (2005). Pedestrian detection in crowded scenes. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.

    Article  Google Scholar 

  • Mikolajczyk, K., Leibe, B., & Schiele, B. (2006). Multiple object class detection with a generative model. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630.

    Article  Google Scholar 

  • Mooij, J. M. (2009). libDAI 0.2.2: a free/open source C++ library for discrete approximate inference. http://www.libdai.org/.

  • Pearl, Judea (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference (2nd ed.) San Francisco: Morgan Kaufmann.

    Google Scholar 

  • Ramanan, D. (2007). Learning to parse images of articulated objects. In Adv. in neur. inf. proc. sys. (NIPS).

    Google Scholar 

  • Ramanan, D., & Sminchisescu, C. (2006). Training deformable models for localization. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Ren, X., Berg, A. C., & Malik, J. (2005). Recovering human body configurations using pairwise constraints between parts. In IEEE int. conf. on comp. vis. (ICCV).

    Google Scholar 

  • Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. In Eur. conf. on comp. vis. (ECCV).

    Google Scholar 

  • Roth, S., & Black, M. J. (2009). Fields of experts. International Journal of Computer Vision, 82(2), 205–229.

    Article  Google Scholar 

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). “Grabcut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23, 309–314.

    Article  Google Scholar 

  • Sapp, B., Jordan, C., & Taskar, B. (2010). Adaptive pose priors for pictorial structures. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Sigal, L., & Black, M. J. (2006). Measure locally, reason globally: occlusion-sensitive articulated pose estimation. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Sigal, L., & Black, M. J. (2006). Predicting 3D people from 2D pictures. In AMDO.

    Google Scholar 

  • Sudderth, E. B., Mandel, M. I., Freeman, W. T., & Willsky, A. S. (2005). Distributed occlusion reasoning for tracking with nonparametric belief propagation. In Adv. in neur. inf. proc. sys. (NIPS).

    Google Scholar 

  • Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. Computer Vision and Image Understanding, 80, 349–363.

    Article  MATH  Google Scholar 

  • Tran, D., & Forsyth, D. (2008). Configuration estimates improve pedestrian finding. In Adv. in neur. inf. proc. sys. (NIPS).

    Google Scholar 

  • Tu, Z., Chen, X., Yuille, A. L., & Zhu, S.-C. (2005). Image parsing: unifying segmentation, detection, and recognition. International Journal of Computer Vision, 63(2), 113–140.

    Article  Google Scholar 

  • Urtasun, R., Fleet, D. J., & Fua, P. (2006). 3D people tracking with Gaussian process dynamical models. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Viola, P., Jones, M., & Snow, D. (2003). Detecting pedestrians using patterns of motion and appearance. In IEEE int. conf. on comp. vis. (ICCV).

    Google Scholar 

  • Wang, Y., & Mori, G. (2008). Multiple tree models for occlusion and spatial constraints in human pose estimation. In Eur. conf. on comp. vis. (ECCV).

    Google Scholar 

  • Yao, B., & Fei-Fei, L. (2010). Modeling mutual context of object and human pose in human-object interaction activities. In Eur. conf. on comp. vis. (ECCV).

    Google Scholar 

  • Zhang, J., Luo, J., Collins, R., & Liu, Y. (2006). Body localization in still images using hierarchical models and hybrid search. In IEEE conf. on comp. vis. and pat. recog. (CVPR).

    Google Scholar 

  • Zhang, X., Li, C., Tong, X., Hu, W., Maybank, S., & Zhang, Y. (2009). Efficient human pose estimation via parsing a tree structure based human model. In IEEE int. conf. on comp. vis. (ICCV).

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mykhaylo Andriluka.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Andriluka, M., Roth, S. & Schiele, B. Discriminative Appearance Models for Pictorial Structures. Int J Comput Vis 99, 259–280 (2012). https://doi.org/10.1007/s11263-011-0498-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-011-0498-z

Keywords

Navigation