Abstract
The large majority of sign language recognition systems based on deep learning adopt a word model approach. Here we present a system that works with subunits, rather than word models. We propose a pipelined approach to deep learning that uses a factorisation algorithm to derive hand motion features, embedded within a low-rank trajectory space. Recurrent neural networks are then trained on these embedded features for subunit recognition, followed by a second-stage neural network for sign recognition. Our evaluation shows that our proposed solution compares well in accuracy against the state of the art, providing added benefits of better interpretability and phonologically-meaningful subunits that can operate across different signers and sign languages.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
The value for speech is taken from a website which tracks the current state of the art in speech recognition on a number of standard benchmark datasets: http://github.com/syhw/wer_are_we. While the reported value for ASLR is obtained on one of the currently most challenging ‘real-life’ signing datasets available: http://www-i6.informatik.rwth-aachen.de/~koller/RWTH-PHOENIX/.
References
von Agris, U., Knorr, M., Kraiss, K.: The significance of facial features for automatic sign language recognition. In: Proceedings of the 8th International Conference on Automatic Face & Gesture Recognition (FG). IEEE (2008)
Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Nonrigid structure from motion in trajectory space. In: Koller, D., et al. (eds.) Advances in Neural Information Processing Systems (NIPS), p. 41. Curran Associates Inc. (2009)
Akhter, I., Sheikh, Y., Khan, S., Kanade, T.: Trajectory space: a dual representation for nonrigid structure from motion. IEEE TPAMI 33(7), 1442–1456 (2011)
Avola, D., Bernardi, M., Cinque, L., Foresti, G.L., Massaroni, C.: Exploiting recurrent neural networks and leap motion controller for the recognition of sign language and semaphoric hand gestures. IEEE Trans. Multimedia 21, 234–245 (2018)
Awad, G., Han, J., Sutherland, A.: Novel boosting framework for subunit-based sign language recognition. In: Proceedings of the ICIP, pp. 2729–2732. IEEE (2009)
Bauer, B., Karl-Friedrich, K.: Towards an automatic sign language recognition system using subunits. In: Wachsmuth, I., Sowa, T. (eds.) GW 2001. LNCS (LNAI), vol. 2298, pp. 64–75. Springer, Heidelberg (2002). https://doi.org/10.1007/3-540-47873-6_7
Blackman, S.S.: Multiple hypothesis tracking for multiple target tracking. IEEE Aero. Electron. Syst. Mag. 19(1), 5–18 (2004)
Borg, M., Camilleri, K.P.: Towards a transcription system of sign language video resources via motion trajectory factorisation. In: Proceedings of the 2017 ACM Symposium on Document Engineering, DocEng 2017, pp. 163–172. ACM, New York (2017). https://doi.org/10.1145/3103010.3103020
Bowden, R., Windridge, D., Kadir, T., Zisserman, A., Brady, M.: A linguistic feature vector for the visual interpretation of sign language. In: Pajdla, T., Matas, J. (eds.) ECCV 2004. LNCS, vol. 3021, pp. 390–401. Springer, Heidelberg (2004). https://doi.org/10.1007/978-3-540-24670-1_30
Camgöz, N.C., Hadfield, S., Koller, O., Bowden, R.: SubUNets: end-to-end hand shape and continuous sign language recognition. In: Proceedings of the ICCV. IEEE, October 2017
Cao, Z., Hidalgo, G., Simon, T., Wei, S.E., Sheikh, Y.: OpenPose: realtime multi-person 2D pose estimation using Part Affinity Fields. In: arXiv preprint 1812.08008 (2018)
Charles, J., Pfister, T., Magee, D., Hogg, D., Zisserman, A.: Upper body pose estimation with temporal sequential forests. In: Proceedings of the BMVC (2014)
Cheok, M.J., Omar, Z., Hisham Jaward, M.: A review of hand gesture and sign language recognition techniques. Int. J. Mach. Learn. Cybernet. 10 (2017). https://doi.org/10.1007/s13042-017-0705-5
Choi, S., Kim, T., Yu, W.: Performance evaluation of RANSAC family. In: Proceedings of the BMVC (2009)
Cooper, H., Holt, B., Bowden, R.: Sign language recognition. In: Moeslund, T.B., et al. (eds.) Visual Analysis of Humans - Looking at People, pp. 539–562. Springer, London (2011). https://doi.org/10.1007/978-0-85729-997-0_27. No. 231135
Crasborn, O., et al.: ECHO Data Set for Sign Language of the Netherlands (NGT) (2004)
Cui, R., Liu, H., Zhang, C.: Recurrent convolutional neural networks for continuous sign language recognition by staged optimization. In: Proceedings of the CVPR, pp. 1610–1618. IEEE, July 2017. https://doi.org/10.1109/CVPR.2017.175
Efthimiou, E., et al.: Sign Language technologies and resources of the Dicta-Sign project. In: Proceedings of the International Conference on Language Resources and Evaluation (LREC), RPSL Workshop. ELRA (2012)
Fang, B., Co, J., Zhang, M.: DeepASL: enabling ubiquitous and non-intrusive word and sentence-level sign language translation. In: Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems (SenSys). ACM (2017). https://doi.org/10.1145/3131672.3131693
Farag, I., Brock, H.: Learning motion disfluencies for automatic sign language segmentation. In: International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7360–7364, May 2019. https://doi.org/10.1109/ICASSP.2019.8683523
Fenlon, J., Cormier, K., Brentari, D.: The Phonology of Sign Languages, pp. 453–475. Routledge (2017). https://doi.org/10.4324/9781315675428
Gattupalli, S., Ghaderi, A., Athitsos, V.: Evaluation of deep learning based pose estimation for sign language recognition. In: Proceedings of the 9th International Conference on PErvasive Technologies Related to Assistive Environments (PETRA). ACM (2016)
Graves, A.: Supervised Sequence Labelling with Recurrent Neural Networks. Studies in Computational Intelligence, vol. 385. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-24797-2
Graves, A., Fernández, S., Gomez, F.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the International Conference on Machine Learning (ICML), pp. 369–376 (2006)
Guo, D., Tang, S., Wang, M.: Connectionist temporal modeling of video and language: a joint model for translation and sign labeling. In: Proceedings of the 28th International Joint Conference on Artificial Intelligence, IJCAI. pp. 751–757 (2019)
Guo, D., Zhou, W., Li, H., Wang, M.: Hierarchical LSTM for sign language translation. In: Proceedings of the 32nd AAAI Conference on Artificial Intelligence, pp. 6845–6852 (2018)
Guo, J., Wang, J., Bai, R., Zhang, Y., Li, Y.: A new moving object detection method based on frame-difference and background subtraction. IOP Conf. Ser. Mater. Sci. Eng. 242(1), 012115 (2017)
Hanson, V.L.: Computing technologies for deaf and hard of hearing users. In: Sears, A., Jacko, J.A. (eds.) Human-Computer Interaction: Designing for Diverse Users and Domains, chap. 8, pp. 885–893. Taylor & Francis Group (2009). https://doi.org/10.1201/9781420088885
He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectifiers: surpassing human-level performance on ImageNet classification. In: Proceedings of the ICCV, pp. 1026–1034 (2015). https://doi.org/10.1109/ICCV.2015.123
Huang, J., Zhou, W., Zhang, Q., Li, H., Li, W.: Video-based sign language recognition without temporal segmentation. In: 32nd Conference on Artificial Intelligence (AAAI), pp. 2257–2264. AAAI (2018)
Kelly, D., McDonald, J., Markham, C.: Recognition of spatiotemporal gestures in sign language using gesture threshold HMMs. In: Wang L., Zhao G., Cheng L., Pietikäinen M. (eds.) Machine Learning for Vision-Based Motion Analysis. Advances in Pattern Recognition, pp. 307–348. Springer, London (2011). https://doi.org/10.1007/978-0-85729-057-1_12
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. In: ICLR 2015, p. 13 (2015)
Koller, O., Ney, H., Bowden, R.: Deep hand: how to train a CNN on 1 million hand images when your data is continuous and weakly labelled. In: Proceedings of the CVPR, pp. 3793–3802. IEEE, June 2016. https://doi.org/10.1109/CVPR.2016.412
Koller, O., Zargaran, S., Ney, H.: Re-sign: re-aligned end-to-end sequence modelling with deep recurrent CNN-HMMs. In: Proceedings of the CVPR, pp. 3416–3424. IEEE, July 2017. https://doi.org/10.1109/CVPR.2017.364
Koller, O., Forster, J., Ney, H.: Continuous sign language recognition: towards large vocabulary statistical recognition systems handling multiple signers. Comput. Vis. Image Underst. 141, 108–125 (2015)
Koller, O., Zargaran, S., Hermann, N., Bowden, R.: Deep sign: enabling robust statistical continuous sign language recognition via hybrid CNN-HMMs. Int. J. Comput. Vis. 126(12), 1311–1325 (2018)
Koller, O., Zargaran, S., Ney, H., Bowden, R.: Deep sign: hybrid CNN-HMM for continuous sign language recognition. In: Proceedings of the BMVC (2016)
Lüscher, C., et al.: RWTH ASR systems for LibriSpeech: hybrid vs attention. In: Proceedings of the Interspeech 2019, pp. 231–235 (2019). https://doi.org/10.21437/Interspeech.2019-1780
van der Maaten, L., Hinton, G.: Visualizing high-dimensional data using t-SNE. J. Mach. Learn. Res. 9, 2579–2605 (2008)
Masters, D., Luschi, C.: Revisiting small batch training for deep neural networks. CoRR (2018)
Metaxas, D., Dilsizian, M., Neidle, C.: Linguistically-driven framework for computationally efficient and scalable sign recognition. In: Calzolari, N., et al. (eds.) Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC). ELRA (2018)
Oszust, M., Wysocki, M.: Modelling and recognition of signed expressions using subunits obtained by data–driven approach. In: Ramsay, A., Agre, G. (eds.) AIMSA 2012. LNCS (LNAI), vol. 7557, pp. 315–324. Springer, Heidelberg (2012). https://doi.org/10.1007/978-3-642-33185-5_35
Panzner, M., Cimiano, P.: Comparing hidden Markov models and long short term memory neural networks for learning action representations. In: Pardalos, P.M., Conca, P., Giuffrida, G., Nicosia, G. (eds.) MOD 2016. LNCS, vol. 10122, pp. 94–105. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-51469-7_8
Pfister, T., Charles, J., Everingham, M., Zisserman, A.: Automatic and efficient long term arm and hand tracking for continuous sign language TV broadcasts. In: Proceedings of the BMVC (2012)
Pigou, L., Herreweghe, M.V., Dambre, J.: Gesture and sign language recognition with temporal residual networks. In: Proceedings of the ICCV Workshops, pp. 3086–3093, October 2017. https://doi.org/10.1109/ICCVW.2017.365
Pu, J., Zhou, W., Li, H.: Dilated convolutional network with iterative optimization for continuous sign language recognition. In: Proceedings of the 27th International Joint Conference on Artificial Intelligence (IJCAI 2018), pp. 885–891 (2018)
Pu, J., Zhou, W., Zhang, J., Li, H.: Sign language recognition based on trajectory modeling with HMMs. In: Tian, Q., Sebe, N., Qi, G.-J., Huet, B., Hong, R., Liu, X. (eds.) MMM 2016. LNCS, vol. 9516, pp. 686–697. Springer, Cham (2016). https://doi.org/10.1007/978-3-319-27671-7_58
Sako, S., Kitamura, T.: Subunit modeling for japanese sign language recognition based on phonetically depend multi-stream hidden Markov models. In: Stephanidis, C., Antona, M. (eds.) UAHCI 2013. LNCS, vol. 8009, pp. 548–555. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39188-0_59
Schirmer, B.R.: Psychological, Social, and Educational Dimensions of Deafness. Allyn & Bacon, Boston (2001)
Shi, J., Tomasi, C.: Good features to track. In: Proceedings of the CVPR, pp. 593–600 (1994)
Smith, S.L., Kindermans, P.J., Le, Q.V.: Don’t decay the learning rate, increase the batch size. In: International Conference on Learning Representations (2018)
Stokoe, W.C.: Sign language structure. Ann. Rev. Anthropol. 9(1), 365–390 (1980). https://doi.org/10.1146/annurev.an.09.100180.002053
Sun, Z.L., Fang, Y., Shang, L., Zhu, X.G.: A missing data estimation approach for small size image sequence. In: 5th International Conference on Intelligent Control and Information Processing, pp. 479–481. IEEE, August 2014
Tomasi, C., Kanade, T.: Shape and motion from image streams under orthography: a factorization method. Int. J. Comput. Vis. 9(2), 137–154 (1992)
Van Staden, A., Badenhorst, G., Ridge, E.: The benefits of sign language for deaf learners with language challenges. Per Linguam 25(1), 44–60 (2009)
Vogler, C., Goldenstein, S.: Toward computational understanding of sign language. In: Technology and Disability, vol. 20, pp. 109–119. IOS Press (2008)
Wimmer, M., Radig, B.: Adaptive skin color classificator. In: Proceedings of the 1st ICGST International Conference on Graphics, Vision and Image Processing (GVIP), pp. 324–327 (2005)
Yang, R., Sarkar, S., Loeding, B.: Handling movement epenthesis and hand segmentation ambiguities in continuous sign language recognition using nested dynamic programming. IEEE TPAMI 32(3), 462–477 (2010)
Zhang, K., Zhang, Z., Li, Z., Qiao, Y.: Joint face detection and alignment using multitask cascaded convolutional networks. IEEE Signal Process. Lett. 23(10), 1499–1503 (2016)
Zheng, L., Liang, B., Jiang, A.: Recent advances of deep learning for sign language recognition. In: International Conference on Digital Image Computing: Techniques and Applications (DICTA), November 2017
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Switzerland AG
About this paper
Cite this paper
Borg, M., Camilleri, K.P. (2020). Phonologically-Meaningful Subunits for Deep Learning-Based Sign Language Recognition. In: Bartoli, A., Fusiello, A. (eds) Computer Vision – ECCV 2020 Workshops. ECCV 2020. Lecture Notes in Computer Science(), vol 12536. Springer, Cham. https://doi.org/10.1007/978-3-030-66096-3_15
Download citation
DOI: https://doi.org/10.1007/978-3-030-66096-3_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-66095-6
Online ISBN: 978-3-030-66096-3
eBook Packages: Computer ScienceComputer Science (R0)