Abstract
We present a fully automatic arm and hand tracker that detects joint positions over continuous sign language video sequences of more than an hour in length. To achieve this, we make contributions in four areas: (i) we show that the overlaid signer can be separated from the background TV broadcast using co-segmentation over all frames with a layered model; (ii) we show that joint positions (shoulders, elbows, wrists) can be predicted per-frame using a random forest regressor given only this segmentation and a colour model; (iii) we show that the random forest can be trained from an existing semi-automatic, but computationally expensive, tracker; and, (iv) introduce an evaluator to assess whether the predicted joint positions are correct for each frame. The method is applied to 20 signing footage videos with changing background, challenging imaging conditions, and for different signers. Our framework outperforms the state-of-the-art long term tracker by Buehler et al. (International Journal of Computer Vision 95:180–197, 2011), does not require the manual annotation of that work, and, after automatic initialisation, performs tracking in real-time. We also achieve superior joint localisation results to those obtained using the pose estimation method of Yang and Ramanan (Proceedings of the IEEE conference on computer vision and pattern recognition, 2011).
Similar content being viewed by others
References
Amit, Y., & Geman, D. (1997). Shape quantization and recognition with randomized trees. Neural Computation, 9(7), 1545–1588.
Andriluka, M., Roth, S., & Schiele, B. (2012). Discriminative appearance models for pictorial structures. International Journal of Computer Vision, 99(3), 259–280.
Apostoloff, N. E., & Zisserman, A. (2007). Who are you?—real-time person identification. In Proceedings of the British machine vision conference.
Benfold, B., & Reid, I. (2008). Colour invariant head pose classification in low resolution video. In Proceedings of the British machine vision conference.
Bosch, A., Zisserman, A., & Munoz, X. (2007). Image classification using random forests and ferns. In Proceedings of the international conference on computer vision.
Bowden, R., Windridge, D., Kadir, T., Zisserman, A., & Brady, J. M. (2004). A linguistic feature vector for the visual interpretation of sign language. In Proceedings of the European conference on computer vision. Berlin: Springer.
Breiman, L. (2001). Random forests. Machine Learning, 45(1), 5–32.
Buehler, P., Everingham, M., Huttenlocher, D. P., & Zisserman, A. (2011). Upper body detection and tracking in extended signing sequences. International Journal of Computer Vision, 95(2), 180–197.
Buehler, P., Everingham, M., & Zisserman, A. (2009). Learning sign language by watching TV (using weakly aligned subtitles). In Proceedings of the IEEE conference on computer vision and pattern recognition.
Buehler, P., Everingham, M., & Zisserman, A. (2010). Employing signed TV broadcasts for automated learning of British sign language. In Workshop on representation and processing of sign languages.
Chai, Y., Lempitsky, V., & Zisserman, A. (2011). BiCoS: A bi-level co-segmentation method for image classification. In Proceedings of the international conference on computer vision.
Chai, Y., Rahtu, E., Lempitsky, V., Van Gool, L., & Zisserman, A. (2012). Tricos: A tri-level class-discriminative co-segmentation method for image classification. In European conference on computer vision.
Charles, J., Pfister, T., Magee, D., Hogg, D., & Zisserman, A. (2013). Domain adaptation for upper body pose tracking in signed TV broadcasts. In Proceedings of the British machine vision conference.
Chunli, W., Wen, G., & Jiyong, M. (2002). A real-time large vocabulary recognition system for Chinese Sign Language. Gesture and sign language in HCI.
Cooper, H., & Bowden, R. (2007). Large lexicon detection of sign language. Workshop on human computer interaction.
Cooper, H., & Bowden, R. (2009). Learning signs from subtitles: A weakly supervised approach to sign language recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Cootes, T., Ionita, M., Lindner, C., & Sauer, P. (2012). Robust and accurate shape model fitting using random forest regression voting. In Proceedings of the European conference on computer vision.
Criminisi, A., Shotton, J., & Konukoglu, E. (2012). Decision forests: A unified framework for classification, regression, density estimation, manifold learning and semi-supervised learning. Foundations and Trends in Computer Graphics and Vision, 7(2), 81–227.
Criminisi, A., Shotton, J., & Robertson, & D., Konukoglu, E., (2011). Regression forests for efficient anatomy detection and localization in CT studies. In International conference on medical image computing and computer assisted intervention workshop on probabilistic models for medical image analysis.
Dalal, N., & Triggs, B. (2005). Histogram of Oriented Gradients for Human Detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Dantone, M., Gall, J., Fanelli, G., & Van Gool, L. (2012). Real-time facial feature detection using conditional regression forests. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Dreuw, P., Deselaers, T., Rybach, D., Keysers, D., & Ney, H. (2006). Tracking using dynamic programming for appearance-based sign language recognition. In Proceedings of the IEEE conference on automatic face and gesture recognition.
Dreuw, P., Forster, J., & Ney, H. (2012). Tracking benchmark databases for video-based sign language recognition. In Trends and topics in computer vision (pp. 286–297). Berlin: Springer.
Eichner, M., & Ferrari, V. (2009). Better appearance models for pictorial structures. In Proceedings of the British machine vision conference.
Eichner, M., Marin-Jimenez, M., Zisserman, A., & Ferrari, V. (2012). 2D articulated human pose estimation and retrieval in (almost) unconstrained still images. International Journal of Computer Vision, 1–25.
Fanelli, G., Dantone, M., Gall, J., Fossati, A., & Van Gool, L. (2012). Random forests for real time 3D face analysis. International Journal of Computer Vision, 101(3), 1–22.
Fanelli, G., Gall, J., & Van Gool, L. (2011). Real time head pose estimation with random regression forests. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Farhadi, A., & Forsyth, D. (2006). Aligning asl for statistical translation using a discriminative word model. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Farhadi, A., Forsyth, D., & White, R. (2007). Transfer learning in sign language. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Felzenszwalb, P., Girshick, R., & McAllester, D. (2010). Cascade object detection with deformable part models. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Felzenszwalb, P., & Huttenlocher, D. (2005). Pictorial structures for object recognition. International Journal of Computer Vision, 61(1), 55–79.
Felzenszwalb, P., McAllester, D., & Ramanan, D. (2008). A discriminatively trained, multiscale, deformable part model. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Ferrari, V., Marin-Jimenez, M., & Zisserman, A. (2008). Progressive search space reduction for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Gall, J., & Lempitsky, V. (2009). Class-specific hough forests for object detection. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Geremia, E., Clatz, O., Menze, B., Konukoglu, E., Criminisi, A., & Ayache, N. (2011). Spatial decision forests for MS lesion segmentation in multi-channel magnetic resonance images. NeuroImage, 57(2), 378–390.
Girshick, R., Shotton, J., Kohli, P., Criminisi, A., & Fitzgibbon, A. (2011). Efficient regression of general-activity human poses from depth images. In Proceedings of the international conference on computer vision.
Hochbaum, D., & Singh, V. (2009). An efficient algorithm for co-segmentation. In Proceedings of the international conference on computer vision.
Jammalamadaka, N., Zisserman, A., Eichner, M., Ferrari, V., & Jawahar, C. V. (2012). Has my algorithm succeeded? An evaluator for human pose estimators. In Proceedings of the European conference on computer vision.
Johnson, S., & Everingham, M. (2009). Combining discriminative appearance and segmentation cues for articulated human pose estimation. In IEEE international workshop on machine learning for vision-based motion analysis.
Jojic, N., & Frey, B. (2001). Learning flexible sprites in video layers. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Joulin, A., Bach, F., & Ponce, J. (2010). Discriminative clustering for image co-segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Kadir, T., Bowden, R., Ong, E., & Zisserman, A. (2004). Minimal training, large lexicon, unconstrained sign language recognition. In Proceedings of the British machine vision conference.
Kadir, T., Zisserman, A., & Brady, J. M. (2004). An affine invariant salient region detector. In Proceedings of the European conference on computer vision.
Kontschieder, P., Bulò, S., Criminisi, A., Kohli, P., Pelillo, M., & Bischof, H. (2012). Context-sensitive decision forests for object detection. In Advances in neural information processing systems.
Kumar, M. P., Torr, P. H. S., & Zisserman, A. (2008). Learning layered motion segmentations of video. International Journal of Computer Vision, 76, 301–319.
Lepetit, V., & Fua, P. (2006). Keypoint recognition using randomized trees. IEEE Transactions on Pattern Analysis and Machine Intelligence, 28(9), 1465–1479.
Liu, C., Gong, S., Loy, C., & Lin, X. (2012). Person re-identification: What features are important?. In Proceedings of the European conference on computer vision.
Marée, R., Geurts, P., Piater, J., & Wehenkel, L. (2005). Random subwindows for robust image classification. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Moeslund, T. (2011). Visual analysis of humans: Looking at people. Berlin: Springer.
Nowozin, S., Rother, C., Bagon, S., Sharp, T., Yao, B., & Kohli, P. (2011). Decision tree fields. In Proceedings of the international conference on computer vision.
Ong, E., & Bowden, R. (2004). A boosted classifier tree for hand shape detection. In Proceedings of the international conference on automatic face and gesture recognition.
Ozuysal, M., Calonder, M., Lepetit, V., & Fua, P. (2010). Fast keypoint recognition using random ferns. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(3), 448–461.
Pfister, T., Charles, J., Everingham, M., & Zisserman, A. (2012). Automatic and efficient long term arm and hand tracking for continuous sign language TV broadcasts. In Proceedings of the British machine vision conference.
Pfister, T., Charles, J., & Zisserman, A. (2013). Large-scale learning of sign language by watching TV (using co-occurrences). In Proceedings of the British machine vision conference.
Ramanan, D. (2006). Learning to parse images of articulated bodies. In Advances in neural information processing systems.
Ramanan, D., Forsyth, D. A., & Zisserman, A. (2007). Tracking people by learning their appearance. IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(1), 65–81.
Rother, C., Kolmogorov, V., & Blake, A. (2004). Grabcut: interactive foreground extraction using iterated graph cuts. In Proceedings of the ACM SIGGRAPH conference on computer graphics.
Rother, C., Minka, T., Blake, A., & Kolmogorov, V. (2006). Cosegmentation of image pairs by histogram matching-incorporating a global constraint into MRFs. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Santner, J., Leistner, C., Saffari, A., Pock, T., & Bischof, H. (2010). Prost: Parallel robust online simple tracking. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Sapp, B., Jordan, C., & Taskar, B. (2010). Adaptive pose priors for pictorial structures. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Sapp, B., Weiss, D., & Taskar, B. (2011). Parsing human motion with stretchable models. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Sharp, T. (2008). Implementing decision trees and forests on a GPU. In Proceedings of the European conference on computer vision.
Shotton, J., Fitzgibbon, A., Cook, M., Sharp, T., Finocchio, M., Moore, R., et al. (2011). Real-time human pose recognition in parts from single depth images. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Shotton, J., Johnson, M., & Cipolla, R. (2008). Semantic texton forests for image categorization and segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Sivic, J., Zitnick, C. L., & Szeliski, R. (2006). Finding people in repeated shots of the same scene. In Proceedings of the British machine vision conference, Edinburgh.
Starner, T., Weaver, J., & Pentland, A. (1998a). Real-time american sign language recognition using desk- and wearable computer-based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.
Starner, T., Weaver, J., & Pentland, A. (1998b). Real-time American Sign Language recognition using desk and wearable computer based video. IEEE Transactions on Pattern Analysis and Machine Intelligence, 20(12), 1371–1375.
Sun, M., Kohli, P., & Shotton, J. (2012). Conditional regression forests for human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Szeliski, R., Avidan, S., & Anandan, P. (2000). Layer extraction from multiple images containing reflections and transparency. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Taylor, J., Shotton, J., Sharp, T., & Fitzgibbon, A. (2012). The vitruvian manifold: Inferring dense correspondences for one-shot human pose estimation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Tran, D., & Forsyth, D. (2010). Improved human parsing with a full relational model. In Proceedings of the European conference on computer vision.
Vogler, C., & Metaxas, D. (1998). ASL recognition based on a coupling between HMMs and 3D motion analysis. In Proceedings of the international conference on computer vision.
Yang, Y., & Ramanan, D. (2011). Articulated pose estimation with flexible mixtures-of-parts. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Yin, P., Criminisi, A., Winn, J., & Essa, I. (2007). Tree-based classifiers for bilayer video Segmentation. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Zhu, X., & Ramanan, D. (2012). Face detection, pose estimation, and landmark localization in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition.
Zisserman, A., Winn, J., Fitzgibbon, A., van Gool, L., Sivic, J., Williams, C., & Hogg, D. (2012). In memoriam: Mark Everingham. IEEE Transactions on Pattern Analysis and Machine Intelligence, 34(11), 2081–2082.
Acknowledgments
We are grateful to Lubor Ladicky for discussions, and to Patrick Buehler for his very generous help. Funding is provided by the Engineering and Physical Sciences Research Council (EPSRC) grant Learning to Recognise Dynamic Visual Content from Broadcast Footage.
Author information
Authors and Affiliations
Corresponding author
Additional information
Mark Everingham, who died in 2012, made a significant contribution to this work. For this reason he is included as a posthumous author. An appreciation of his life and work can be found in Zisserman et al. (2012).
Electronic supplementary material
Below is the link to the electronic supplementary material.
Supplementary material 1 (mpg 5066 KB)
Rights and permissions
About this article
Cite this article
Charles, J., Pfister, T., Everingham, M. et al. Automatic and Efficient Human Pose Estimation for Sign Language Videos. Int J Comput Vis 110, 70–90 (2014). https://doi.org/10.1007/s11263-013-0672-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11263-013-0672-6