Abstract
In this paper we consider people detection and articulated pose estimation, two closely related and challenging problems in computer vision. Conceptually, both of these problems can be addressed within the pictorial structures framework (Felzenszwalb and Huttenlocher in Int. J. Comput. Vis. 61(1):55–79, 2005; Fischler and Elschlager in IEEE Trans. Comput. C-22(1):67–92, 1973), even though previous approaches have not shown such generality. A principal difficulty for such a general approach is to model the appearance of body parts. The model has to be discriminative enough to enable reliable detection in cluttered scenes and general enough to capture highly variable appearance. Therefore, as the first important component of our approach, we propose a discriminative appearance model based on densely sampled local descriptors and AdaBoost classifiers. Secondly, we interpret the normalized margin of each classifier as likelihood in a generative model and compute marginal posteriors for each part using belief propagation. Thirdly, non-Gaussian relationships between parts are represented as Gaussians in the coordinate system of the joint between the parts. Additionally, in order to cope with shortcomings of tree-based pictorial structures models, we augment our model with additional repulsive factors in order to discourage overcounting of image evidence. We demonstrate that the combination of these components within the pictorial structures framework results in a generic model that yields state-of-the-art performance for several datasets on a variety of tasks: people detection, upper body pose estimation, and full body pose estimation.
Similar content being viewed by others
References
Andriluka, M., Roth, S., & Schiele, B. (2008). People-tracking-by-detection and people-detection-by-tracking. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Andriluka, M., Roth, S., & Schiele, B. (2009). Pictorial structures revisited: people detection and articulated pose estimation. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Andriluka, M., Roth, S., & Schiele, B. (2010). Monocular 3D pose estimation and tracking by detection. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Belongie, S., Malik, J., & Puzicha, J. (2001). Shape context: a new descriptor for shape matching and object recognition. In Adv. in neur. inf. proc. sys. (NIPS).
Bergtholdt, M., Kappes, J., Schmidt, S., & Schnörr, C. (2009). A study of parts-based object class detection using complete graphs. International Journal of Computer Vision, 87(1–2), 93–117.
Bourdev, L., & Malik, J. (2009). Poselets: body part detectors trained using 3D human pose annotations. In IEEE int. conf. on comp. vis. (ICCV).
Buehler, P., Everingham, M., Huttenlocher, D. P., & Zisserman, A. (2008). Long term arm and hand tracking for continuous sign language TV broadcasts. In Brit. mach. vis. conf. (BMVC).
Crandall, D., Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Spatial priors for part-based recognition using statistical models. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Dalal, N., & Triggs, B. (2005). Histograms of oriented gradients for human detection. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In Brit. mach. vis. conf. (BMVC).
Everingham, M., Van Gool, L., Williams, C. K. I., Winn, J., & Zisserman, A. (2008). The PASCAL visual object classes challenge 2008 (VOC2008) results. http://www.pascal-network.org/challenges/VOC/voc2008/workshop/index.html.
Felzenszwalb, P. F., Girshick, R., McAllester, D., & Ramanan, D. (2010). Object detection with discriminatively trained part based models. IEEE Transactions on Pattern Analysis and Machine Intelligence, 9(32), 1627–1645.
Felzenszwalb, P. F., & Huttenlocher, D. P. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Felzenszwalb, P. F., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Ferrari, V., Marin, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Ferrari, V., Marin, M., & Zisserman, A. (2009a). 2D human pose estimation in TV shows. In D. Cremers, B. Rosenhahn, A. L. Yuille, & F. R. Schmidt (Eds.), Lect. notes in comp. sci.: Vol. 5604. Statistical and geometrical approaches to visual motion analysis (pp. 128–147). Berlin: Springer.
Ferrari, V., Marin, M., & Zisserman, A. (2009b). Pose search: Retrieving people using their pose. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Fischler, M. A., & Elschlager, R. A. (1973). The representation and matching of pictorial structures. IEEE Transactions on Computers, C-22(1), 67–92.
Freund, Y., & Schapire, R. (1997). A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences, 55(1), 119–139.
Gall, J., & Lempitsky, V. (2009). Class-specific hough forests for object detection. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Guan, P., Weiss, A., Balan, A., & Black, M. J. (2009). Estimating human shape and pose from a single image. In IEEE int. conf. on comp. vis. (ICCV).
Ionescu, C., Bo, L., & Sminchisescu, C. (2009). Structural SVM for visual localization and continuous state estimation. In IEEE int. conf. on comp. vis. (ICCV).
Jiang, H. (2009). Human pose estimation using consistent max-covering. In IEEE int. conf. on comp. vis. (ICCV).
Jiang, H., & Martin, D. R. (2008). Global pose estimation using non-tree models. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Jie, L., Caputo, B., & Ferrari, V. (2009). Who’s doing what: joint modeling of names and verbs for simultaneous face and pose annotation. In Adv. in neur. inf. proc. sys. (NIPS).
Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In 2nd IEEE international workshop on machine learning for vision-based motion analysis.
Kschischang, F. R., Frey, B. J., & Loelinger, H.-A. (2001). Factor graphs and the sum-product algorithm. IEEE Transactions on Information Theory, 47(2), 498–519.
Kumar, P., Zisserman, A., & Torr, P. H. S. (2009). Efficient discriminative learning of parts-based models. In IEEE int. conf. on comp. vis. (ICCV).
Lan, X., & Huttenlocher, D. P. (2005). Beyond trees: common-factor models for 2D human pose recovery. In IEEE int. conf. on comp. vis. (ICCV).
Lee, H.-J., & Chen, Z. (1985). Determination of 3D human body postures from a single view. Computer Vision, Graphics, and Image Processing, 30, 148–168.
Lee, M. W., & Cohen, I. (2004). Proposal maps driven MCMC for estimating human body pose in static images. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Leibe, B., Seemann, E., & Schiele, B. (2005). Pedestrian detection in crowded scenes. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Lowe, D. G. (2004). Distinctive image features from scale-invariant keypoints. International Journal of Computer Vision, 60(2), 91–110.
Mikolajczyk, K., Leibe, B., & Schiele, B. (2006). Multiple object class detection with a generative model. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Mikolajczyk, K., & Schmid, C. (2005). A performance evaluation of local descriptors. IEEE Transactions on Pattern Analysis and Machine Intelligence, 27(10), 1615–1630.
Mooij, J. M. (2009). libDAI 0.2.2: a free/open source C++ library for discrete approximate inference. http://www.libdai.org/.
Pearl, Judea (1988). Probabilistic reasoning in intelligent systems: networks of plausible inference (2nd ed.) San Francisco: Morgan Kaufmann.
Ramanan, D. (2007). Learning to parse images of articulated objects. In Adv. in neur. inf. proc. sys. (NIPS).
Ramanan, D., & Sminchisescu, C. (2006). Training deformable models for localization. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Ren, X., Berg, A. C., & Malik, J. (2005). Recovering human body configurations using pairwise constraints between parts. In IEEE int. conf. on comp. vis. (ICCV).
Ronfard, R., Schmid, C., & Triggs, B. (2002). Learning to parse pictures of people. In Eur. conf. on comp. vis. (ECCV).
Roth, S., & Black, M. J. (2009). Fields of experts. International Journal of Computer Vision, 82(2), 205–229.
Rother, C., Kolmogorov, V., & Blake, A. (2004). “Grabcut”: interactive foreground extraction using iterated graph cuts. ACM Transactions on Graphics, 23, 309–314.
Sapp, B., Jordan, C., & Taskar, B. (2010). Adaptive pose priors for pictorial structures. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Sigal, L., & Black, M. J. (2006). Measure locally, reason globally: occlusion-sensitive articulated pose estimation. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Sigal, L., & Black, M. J. (2006). Predicting 3D people from 2D pictures. In AMDO.
Sudderth, E. B., Mandel, M. I., Freeman, W. T., & Willsky, A. S. (2005). Distributed occlusion reasoning for tracking with nonparametric belief propagation. In Adv. in neur. inf. proc. sys. (NIPS).
Taylor, C. J. (2000). Reconstruction of articulated objects from point correspondences in a single uncalibrated image. Computer Vision and Image Understanding, 80, 349–363.
Tran, D., & Forsyth, D. (2008). Configuration estimates improve pedestrian finding. In Adv. in neur. inf. proc. sys. (NIPS).
Tu, Z., Chen, X., Yuille, A. L., & Zhu, S.-C. (2005). Image parsing: unifying segmentation, detection, and recognition. International Journal of Computer Vision, 63(2), 113–140.
Urtasun, R., Fleet, D. J., & Fua, P. (2006). 3D people tracking with Gaussian process dynamical models. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Viola, P., Jones, M., & Snow, D. (2003). Detecting pedestrians using patterns of motion and appearance. In IEEE int. conf. on comp. vis. (ICCV).
Wang, Y., & Mori, G. (2008). Multiple tree models for occlusion and spatial constraints in human pose estimation. In Eur. conf. on comp. vis. (ECCV).
Yao, B., & Fei-Fei, L. (2010). Modeling mutual context of object and human pose in human-object interaction activities. In Eur. conf. on comp. vis. (ECCV).
Zhang, J., Luo, J., Collins, R., & Liu, Y. (2006). Body localization in still images using hierarchical models and hybrid search. In IEEE conf. on comp. vis. and pat. recog. (CVPR).
Zhang, X., Li, C., Tong, X., Hu, W., Maybank, S., & Zhang, Y. (2009). Efficient human pose estimation via parsing a tree structure based human model. In IEEE int. conf. on comp. vis. (ICCV).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Andriluka, M., Roth, S. & Schiele, B. Discriminative Appearance Models for Pictorial Structures. Int J Comput Vis 99, 259–280 (2012). https://doi.org/10.1007/s11263-011-0498-z
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-011-0498-z