Skip to main content
Log in

Automatic and Efficient Human Pose Estimation for Sign Language Videos

  • Published:
International Journal of Computer Vision Aims and scope Submit manuscript

Abstract

We present a fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length. To achieve this, we make contributions in four areas: (i) we show that the overlaid signer can be separated from the background TV broadcast using co-segmentation over all frames with a layered model; (ii) we show that joint positions (shoulders, elbows, wrists) can be predicted per-frame using a random forest regressor given only this segmentation and a colour model; (iii) we show that the random forest can be trained from an existing semi-automatic, but computationally expensive, tracker; and, (iv) introduce an evaluator to assess whether the predicted joint positions are correct for each frame. The method is applied to 20 signing footage videos with changing background, challenging imaging conditions, and for different signers. Our framework outperforms the state-of-the-art long term tracker by Buehler et al. (International Journal of Computer Vision 95:180–197, 2011), does not require the manual annotation of that work, and, after automatic initialisation, performs tracking in real-time. We also achieve superior joint localisation results to those obtained using the pose estimation method of Yang and Ramanan  (Proceedings of the IEEE conference on computer vision and pattern recognition, 2011).

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17
Fig. 18
Fig. 19
Fig. 20
Fig. 21
Fig. 22
Fig. 23

Similar content being viewed by others

Notes

  1. http://www.robots.ox.ac.uk/~vgg/research/sign_language

References

  • Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9(7), 1545–1588.

    Article  Google Scholar 

  • Andriluka, M., Roth, S., & Schiele, B. (2012). Discriminative appearance models for pictorial structures. International Journal of Computer Vision, 99(3), 259–280.

    Article  MathSciNet  Google Scholar 

  • Apostoloff, N. E., & Zisserman, A. (2007). Who are you?—real-time person identification. In Proceedings of the British machine vision conference.

  • Benfold, B., & Reid, I. (2008). Colour invariant head pose classification in low resolution video. In Proceedings of the British machine vision conference.

  • Bosch, A., Zisserman, A., & Munoz, X. (2007). Image classification using random forests and ferns. In Proceedings of the international conference on computer vision.

  • Bowden, R., Windridge, D., Kadir, T., Zisserman, A., & Brady, J. M. (2004). A linguistic feature vector for the visual interpretation of sign language. In Proceedings of the European conference on computer vision. Berlin: Springer.

  • Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.

    Article  MATH  Google Scholar 

  • Buehler, P., Everingham, M., Huttenlocher, D. P., & Zisserman, A. (2011). Upper body detection and tracking in extended signing sequences. International Journal of Computer Vision, 95(2), 180–197.

    Article  Google Scholar 

  • Buehler, P., Everingham, M., & Zisserman, A. (2009). Learning sign language by watching TV (using weakly aligned subtitles). In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Buehler, P., Everingham, M., & Zisserman, A. (2010). Employing signed TV broadcasts for automated learning of British sign language. In Workshop on representation and processing of sign languages.

  • Chai, Y., Lempitsky, V., & Zisserman, A. (2011). BiCoS: A bi-level co-segmentation method for image classification. In Proceedings of the international conference on computer vision.

  • Chai, Y., Rahtu, E., Lempitsky, V., Van Gool, L., & Zisserman, A. (2012). Tricos: A tri-level class-discriminative co-segmentation method for image classification. In European conference on computer vision.

  • Charles, J., Pfister, T., Magee, D., Hogg, D., & Zisserman, A. (2013). Domain adaptation for upper body pose tracking in signed TV broadcasts. In Proceedings of the British machine vision conference.

  • Chunli, W., Wen, G., & Jiyong, M. (2002). A real-time large vocabulary recognition system for Chinese Sign Language. Gesture and sign language in HCI.

  • Cooper, H., & Bowden, R. (2007). Large lexicon detection of sign language. Workshop on human computer interaction.

  • Cooper, H., & Bowden, R. (2009). Learning signs from subtitles: A weakly supervised approach to sign language recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Cootes, T., Ionita, M., Lindner, C., & Sauer, P. (2012). Robust and accurate shape model fitting using random forest regression voting. In Proceedings of the European conference on computer vision.

  • Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Computer Graphics and Vision, 7(2), 81–227.

  • Criminisi, A., Shotton, J., & Robertson, & D., Konukoglu, E., (2011). Regression forests for efficient anatomy detection and localization in CT studies. In International conference on medical image computing and computer assisted intervention workshop on probabilistic models for medical image analysis.

  • Dalal, N., & Triggs, B. (2005). Histogram of Oriented Gradients for Human Detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Dantone, M., Gall, J., Fanelli, G., & Van Gool, L. (2012). Real-time facial feature detection using conditional regression forests. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Dreuw, P., Deselaers, T., Rybach, D., Keysers, D., & Ney, H. (2006). Tracking using dynamic programming for appearance-based sign language recognition. In Proceedings of the IEEE conference on automatic face and gesture recognition.

  • Dreuw, P., Forster, J., & Ney, H. (2012). Tracking benchmark databases for video-based sign language recognition. In Trends and topics in computer vision (pp. 286–297). Berlin: Springer.

  • Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In Proceedings of the British machine vision conference.

  • Eichner, M., Marin-Jimenez, M., Zisserman, A., & Ferrari, V. (2012). 2D articulated human pose estimation and retrieval in (almost) unconstrained still images. International Journal of Computer Vision, 1–25.

  • Fanelli, G., Dantone, M., Gall, J., Fossati, A., & Van Gool, L. (2012). Random forests for real time 3D face analysis. International Journal of Computer Vision, 101(3), 1–22.

    Google Scholar 

  • Fanelli, G., Gall, J., & Van Gool, L. (2011). Real time head pose estimation with random regression forests. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Farhadi, A., & Forsyth, D. (2006). Aligning asl for statistical translation using a discriminative word model. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Farhadi, A., Forsyth, D., & White, R. (2007). Transfer learning in sign language. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Felzenszwalb, P., Girshick, R., & McAllester, D. (2010). Cascade object detection with deformable part models. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.

    Article  Google Scholar 

  • Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Gall, J., & Lempitsky, V. (2009). Class-specific hough forests for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Geremia, E., Clatz, O., Menze, B., Konukoglu, E., Criminisi, A., & Ayache, N. (2011). Spatial decision forests for MS lesion segmentation in multi-channel magnetic resonance images. NeuroImage, 57(2), 378–390.

    Article  Google Scholar 

  • Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In Proceedings of the international conference on computer vision.

  • Hochbaum, D., & Singh, V. (2009). An efficient algorithm for co-segmentation. In Proceedings of the international conference on computer vision.

  • Jammalamadaka, N., Zisserman, A., Eichner, M., Ferrari, V., & Jawahar, C. V. (2012). Has my algorithm succeeded? An evaluator for human pose estimators. In Proceedings of the European conference on computer vision.

  • Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In IEEE international workshop on machine learning for vision-based motion analysis.

  • Jojic, N., & Frey, B. (2001). Learning flexible sprites in video layers. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Joulin, A., Bach, F., & Ponce, J. (2010). Discriminative clustering for image co-segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Kadir, T., Bowden, R., Ong, E., & Zisserman, A. (2004). Minimal training, large lexicon, unconstrained sign language recognition. In Proceedings of the British machine vision conference.

  • Kadir, T., Zisserman, A., & Brady, J. M. (2004). An affine invariant salient region detector. In Proceedings of the European conference on computer vision.

  • Kontschieder, P., Bulò, S., Criminisi, A., Kohli, P., Pelillo, M., & Bischof, H. (2012). Context-sensitive decision forests for object detection. In Advances in neural information processing systems.

  • Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2008). Learning layered motion segmentations of video. International Journal of Computer Vision, 76, 301–319.

    Article  Google Scholar 

  • Lepetit, V., & Fua, P. (2006). Keypoint recognition using randomized trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1465–1479.

    Article  Google Scholar 

  • Liu, C., Gong, S., Loy, C., & Lin, X. (2012). Person re-identification: What features are important?. In Proceedings of the European conference on computer vision.

  • Marée, R., Geurts, P., Piater, J., & Wehenkel, L. (2005). Random subwindows for robust image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Moeslund, T. (2011). Visual analysis of humans: Looking at people. Berlin: Springer.

  • Nowozin, S., Rother, C., Bagon, S., Sharp, T., Yao, B., & Kohli, P. (2011). Decision tree fields. In Proceedings of the international conference on computer vision.

  • Ong, E., & Bowden, R. (2004). A boosted classifier tree for hand shape detection. In Proceedings of the international conference on automatic face and gesture recognition.

  • Ozuysal, M., Calonder, M., Lepetit, V., & Fua, P. (2010). Fast keypoint recognition using random ferns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 448–461.

    Article  Google Scholar 

  • Pfister, T., Charles, J., Everingham, M., & Zisserman, A. (2012). Automatic and efficient long term arm and hand tracking for continuous sign language TV broadcasts. In Proceedings of the British machine vision conference.

  • Pfister, T., Charles, J., & Zisserman, A. (2013). Large-scale learning of sign language by watching TV (using co-occurrences). In Proceedings of the British machine vision conference.

  • Ramanan, D. (2006). Learning to parse images of articulated bodies. In Advances in neural information processing systems.

  • Ramanan, D., Forsyth, D. A., & Zisserman, A. (2007). Tracking people by learning their appearance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 65–81.

    Article  Google Scholar 

  • Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: interactive foreground extraction using iterated graph cuts. In Proceedings of the ACM SIGGRAPH conference on computer graphics.

  • Rother, C., Minka, T., Blake, A., & Kolmogorov, V. (2006). Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Santner, J., Leistner, C., Saffari, A., Pock, T., & Bischof, H. (2010). Prost: Parallel robust online simple tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Sapp, B., Jordan, C., & Taskar, B. (2010). Adaptive pose priors for pictorial structures. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Sapp, B., Weiss, D., & Taskar, B. (2011). Parsing human motion with stretchable models. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Sharp, T. (2008). Implementing decision trees and forests on a GPU. In Proceedings of the European conference on computer vision.

  • Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Sivic, J., Zitnick, C. L., & Szeliski, R. (2006). Finding people in repeated shots of the same scene. In Proceedings of the British machine vision conference, Edinburgh.

  • Starner, T., Weaver, J., & Pentland, A. (1998a). Real-time american sign language recognition using desk- and wearable computer-based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.

    Article  Google Scholar 

  • Starner, T., Weaver, J., & Pentland, A. (1998b). Real-time American Sign Language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.

    Article  Google Scholar 

  • Sun, M., Kohli, P., & Shotton, J. (2012). Conditional regression forests for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Szeliski, R., Avidan, S., & Anandan, P. (2000). Layer extraction from multiple images containing reflections and transparency. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Taylor, J., Shotton, J., Sharp, T., & Fitzgibbon, A. (2012). The vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Tran, D., & Forsyth, D. (2010). Improved human parsing with a full relational model. In Proceedings of the European conference on computer vision.

  • Vogler, C., & Metaxas, D. (1998). ASL recognition based on a coupling between HMMs and 3D motion analysis. In Proceedings of the international conference on computer vision.

  • Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Yin, P., Criminisi, A., Winn, J., & Essa, I. (2007). Tree-based classifiers for bilayer video Segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation, and landmark localization in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition.

  • Zisserman, A., Winn, J., Fitzgibbon, A., van Gool, L., Sivic, J., Williams, C., & Hogg, D. (2012). In memoriam: Mark Everingham. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2081–2082.

    Google Scholar 

Download references

Acknowledgments

We are grateful to Lubor Ladicky for discussions, and to Patrick Buehler for his very generous help. Funding is provided by the Engineering and Physical Sciences Research Council (EPSRC) grant Learning to Recognise Dynamic Visual Content from Broadcast Footage.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tomas Pfister.

Additional information

Mark Everingham, who died in 2012, made a significant contribution to this work. For this reason he is included as a posthumous author. An appreciation of his life and work can be found in Zisserman et al. (2012).

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (mpg 5066 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Charles, J., Pfister, T., Everingham, M. et al. Automatic and Efficient Human Pose Estimation for Sign Language Videos. Int J Comput Vis 110, 70–90 (2014). https://doi.org/10.1007/s11263-013-0672-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11263-013-0672-6

Keywords

Navigation