Skip to main content
Log in

Aligning accumulative representations for sign language recognition

  • Original Paper
  • Published:
Machine Vision and Applications Aims and scope Submit manuscript

Abstract

Accumulative representations provide a method for representing variable-length videos with constant length features. In this study, we present aligned temporal accumulative features (ATAF), a skeleton heatmap-based feature for efficient representation and modeling of isolated sign language videos. Inspired by the movement-hold model in sign linguistics, we extract keyframes, align them using temporal transformer networks (TTNs) and extract descriptors using convolutional neural networks (CNNs). In the proposed approach, the use of aligned keyframes increases the recognition power of accumulative features as linguistically significant parts of signs are represented uniquely. Since we detect keyframes using hand movement, there can be differences from signer to signer. To overcome this challenge, ATAF has been implemented with both alignment of sampled frames and keyframe alignment approaches, using both finger speed differences and hand joint heatmaps to perform end-to-end alignment during classification. Results demonstrate that the proposed method achieves state-of-the-art recognition performance on the public BosphorusSign22k (BSign22k) dataset in combination with 3D-CNNs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  1. Gökgöz, K.: Negation in turkish sign language: the syntax of nonmanual markers. Sign Language Linguistics 14(1), 49–75 (2011)

    Article  Google Scholar 

  2. Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. Science 2, 2556–2563 (2011)

    Google Scholar 

  3. Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)

  4. Kay, W. et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)

  5. Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. IEEE 2, 4724–4733 (2017)

    Google Scholar 

  6. Jiang, S., et al.: Skeleton Aware Multi-modal Sign Language Recognition. Springer, Berlin (2021)

    Book  Google Scholar 

  7. Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: pose motion representation for action recognition. Science 3, 7024–7033 (2018)

    Google Scholar 

  8. Tran, D., et al.: A closer look at spatiotemporal convolutions for action recognition. Science 3, 6450–6459 (2018)

    Google Scholar 

  9. Liddell, S.K., Johnson, R.E.: American sign language: the phonological base. Sign Language Stud. 64(1), 195–277 (1989)

    Article  Google Scholar 

  10. Pitsikalis, V., Theodorakis, S., Vogler, C., Maragos, P.: Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition. In: IEEE, pp. 1–6 (2011)

  11. Cooper, H., Ong, E.-J., Pugeault, N., Bowden, R.: Sign language recognition using sub-units. J. Mach. Learn. Res. 13(1), 2205–2231 (2012)

    Google Scholar 

  12. Bowden, R., Windridge, D., Kadir, T., Zisserman, A., Brady, M.: A Linguistic Feature Vector for the Visual Interpretation of Sign Language, pp. 390–401. Springer, Berlin (2004)

    MATH  Google Scholar 

  13. Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic-static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Image Vis. Comput. 32(8), 533–549 (2014)

    Article  Google Scholar 

  14. Tornay, S.: Explainable Phonology-based Approach for Sign Language Recognition and Assessment. Ph.D. thesis, EPFL (2021)

  15. Borg, M., Camilleri, K.P.: Phonologically-Meaningful Subunits for Deep Learning-Based Sign Language Recognition, pp. 199–217. Springer, Berlin (2020)

    Google Scholar 

  16. Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: End-to-end hand shape and continuous sign language recognition. Subunets 3, 7 (2017)

    Google Scholar 

  17. Tavella, F., Schlegel, V., Romeo, M., Galata, A., Cangelosi, A.: Wlasl-lex: a dataset for recognising phonological properties in american sign language. arXiv preprint arXiv:2203.06096 (2022)

  18. Caselli, N.K., Sehyr, Z.S., Cohen-Goldberg, A.M., Emmorey, K.: Asl-lex: a lexical database of American sign language. Behav. Res. Methods 49(2), 784–801 (2017)

    Article  Google Scholar 

  19. Gao, Z., Lu, G., Lyu, C., Yan, P.: Key-frame selection for automatic summarization of surveillance videos: a method of multiple change-point detection. Mach. Vis. Appl. 29(7), 1101–1117 (2018)

    Article  Google Scholar 

  20. Xiong, W., Lee, C.-M., Ma, R.-H.: Automatic video data structuring through shot partitioning and key-frame computing. Mach. Vis. Appl. 10(2), 51–65 (1997)

    Article  Google Scholar 

  21. Fanfani, M., Bellavia, F., Colombo, C.: Accurate keyframe selection and keypoint tracking for robust visual odometry. Mach. Vis. Appl. 27(6), 833–844 (2016)

    Article  Google Scholar 

  22. Tang, H., Liu, H., Xiao, W., Sebe, N.: Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing 331, 424–433 (2019)

    Article  Google Scholar 

  23. Mo, H., Yamagishi, F., Ide, I., Satoh, S., Sakauchi, M.: Key shot extraction and indexing in a news video archive. IEICE Tech. Rep. 105(118), 55–59 (2005)

    Google Scholar 

  24. Xu, W., Miao, Z., Yu, J., Ji, Q.: Action recognition and localization with spatial and temporal contexts. Neurocomputing 333, 351–363 (2019)

    Article  Google Scholar 

  25. Yang, R., Sarkar, S.: Detecting coarticulation in sign language using conditional random fields. Science 2, 108–112 (2006)

    Google Scholar 

  26. Zhao, Z., Elgammal, A.M.: Information Theoretic Key Frame Selection for Action Recognition, pp. 1–10. Springer, Berlin (2008)

    Google Scholar 

  27. Carlsson, S., Sullivan, J.: Action Recognition by Shape Matching to Key Frames, vol. 1. Citeseer, London (2001)

    Google Scholar 

  28. Lu, G., Zhou, Y., Li, X., Yan, P.: Unsupervised, efficient and scalable key-frame selection for automatic summarization of surveillance videos. Multimedia Tools Appl. 76(5), 6309–6331 (2017)

    Article  Google Scholar 

  29. Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)

    Article  Google Scholar 

  30. Elakkiya, R., Selvamani, K.: Extricating manual and non-manual features for subunit level medical sign modelling in automatic sign language classification and recognition. J. Med. Syst. 41(11), 1–13 (2017)

    Google Scholar 

  31. De Coster, M., Van Herreweghe, M., Dambre, J.: Sign language recognition with transformer networks. In: European Language Resources Association (ELRA), pp. 6018–6024 (2020)

  32. Huang, S., Mao, C., Tao, J., Ye, Z.: A novel Chinese sign language recognition method based on keyframe-centered clips. IEEE Signal Process. Lett. 25(3), 442–446 (2018)

    Article  Google Scholar 

  33. Pan, W., Zhang, X., Ye, Z.: Attention-based sign language recognition network utilizing keyframe sampling and skeletal features. IEEE Access 8, 215592–215602 (2020)

    Article  Google Scholar 

  34. Albanie, S., et al.: Bsl-1k: Scaling Up Co-articulated Sign Language Recognition Using Mouthing Cues, pp. 35–53. Springer, Berlin (2020)

    Google Scholar 

  35. Berndt, D.J., Clifford, J.: Using Dynamic Time Warping to Find Patterns in Time Series, vol. 10, pp. 359–370. Springer, Seattle (1994)

    Google Scholar 

  36. Cuturi, M., Blondel, M.: Soft-dtw: a differentiable loss function for time-series. arXiv preprint arXiv:1703.01541 (2017)

  37. Petitjean, F., Ketterlin, A., Gançarski, P.: A global averaging method for dynamic time warping, with applications to clustering. Pattern Recogn. 44(3), 678–693 (2011)

    Article  MATH  Google Scholar 

  38. Zhou, F., Torre, F.: Canonical time warping for alignment of human behavior. Adv. Neural. Inf. Process. Syst. 22, 2286–2294 (2009)

    Google Scholar 

  39. Trigeorgis, G., Nicolaou, M.A., Zafeiriou, S., Schuller, B.W.: Deep canonical time warping. Science 2, 5110–5118 (2016)

    Google Scholar 

  40. Chang, C.-Y., Huang, D.-A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3tw: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. Science 6, 3546–3555 (2019)

    Google Scholar 

  41. Korbar, B., Tran, D., Torresani, L.: Scsampler: sampling salient clips from video for efficient action recognition. Science 6, 6232–6242 (2019)

    Google Scholar 

  42. Lohit, S., Wang, Q., Turaga, P.K.: Temporal transformer networks: Joint learning of invariant and discriminative time warping. CoRR abs/1906.05947. http://arxiv.org/abs/1906.05947 (2019)

  43. Oh, J., Wang, J. , Wiens, J.: Learning to exploit invariances in clinical time-series data using sequence transformer networks. arXiv preprint arXiv:1808.06725 (2018)

  44. Starner, T., Pentland, A.: Real-Time American Sign Language Recognition from Video Using Hidden Markov Models, pp. 227–243. Springer, Berlin (1997)

    Google Scholar 

  45. Özdemir, O., Camgöz, N.C., Akarun, L.: Isolated sign language recognition using improved dense trajectories. In: IEEE, pp. 1961–1964 (2016)

  46. Camgöz, N.C. et al.: Bosphorussign: a Turkish sign language recognition corpus in health and finance domains (2016)

  47. Ding, L., Martinez, A.M.: Modelling and recognition of the linguistic components in American sign language. Image Vis. Comput. 27(12), 1826–1844 (2009)

    Article  Google Scholar 

  48. Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic-static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Image Vis. Comput. 32(8), 533–549 (2014)

    Article  Google Scholar 

  49. Ong, E.-J., Koller, O., Pugeault, N., Bowden, R.: Sign spotting using hierarchical sequential patterns with temporal intervals, pp. 1923–1930 (2014)

  50. Belgacem, S., Chatelain, C., Paquet, T.: Gesture sequence recognition with one shot learned crf/hmm hybrid model. Image Vis. Comput. 61, 12–21 (2017)

    Article  Google Scholar 

  51. Rastgoo, R., Kiani, K., Escalera, S.: Sign language recognition: a deep survey. Expert Syst. Appl. 164, 113794 (2021)

    Article  Google Scholar 

  52. Vaezi Joze, H.R., Koller, O.: MS-ASL: a large-scale data set and benchmark for understanding American Sign Language (2018)

  53. Li, D., Rodriguez, C., Yu, X., Li, H.: Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison, pp. 1459–1469 (2020)

  54. Chai, X., Wang, H., Chen, X.: The devisign large vocabulary of chinese sign language database and baseline evaluations. Technical report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS (2014)

  55. Neidle, C., Thangali, A., Sclaroff, S.: Challenges in development of the American sign language lexicon video dataset (asllvd) corpus (Citeseer, 2012)

  56. Albanie, S. et al.: BSL-1K: scaling up co-articulated sign language recognition using mouthing cues (2020)

  57. He, J., Liu, Z., Zhang, J.: Chinese sign language recognition based on trajectory and hand shape features. In: IEEE, pp. 1–4 (2016)

  58. Özdemir, O., Kındıroğlu, A.A., Camgöz, N.C., Akarun, L.: Bosphorussign22k sign language recognition dataset. arXiv preprint arXiv:2004.01283 (2020)

  59. Forster, J., Schmidt, C., Koller, O., Bellgardt, M., Ney, H.: Extensions of the sign language recognition and translation corpus rwth-phoenix-weather, pp. 1911–1916 (2014)

  60. Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive hmm. In: IEEE, pp. 1–6 (2016)

  61. Pu, J., Zhou, W., Li, H.: Iterative alignment network for continuous sign language recognition, pp. 4165–4174 (2019)

  62. Donahue, J. et al.: Long-term recurrent convolutional networks for visual recognition and description, pp. 2625–2634 (2015)

  63. Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE, pp. 4489–4497 (2015)

  64. Koller, O., Camgoz, N.C., Ney, H., Bowden, R.: Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. IEEE Trans. Pattern Anal. Mach. Intell. 6, 788 (2019)

    Google Scholar 

  65. Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)

  66. Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455 (2018)

  67. Cheng, K., Zhang, Y., Cao, C., Shi, L., Cheng, J.: Decoupling GCN with Dropgraph Module for Skeleton-based Action Recognition. Springer, Berlin (2021)

    Google Scholar 

  68. Zhu, W. et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. arXiv preprint arXiv:1603.07772 (2016)

  69. Joze, H.R.V., & Koller, O.: Ms-asl: a large-scale data set and benchmark for understanding american sign language. arXiv preprint arXiv:1812.01053 (2018)

  70. Asghari-Esfeden, S., Sznaier, M., Camps, O.: Dynamic motion representation for human action recognition, pp. 557–566 (2020)

  71. Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos, pp. 568–576. MIT Press, London (2014)

  72. Sincan, O.M., Keles, H.Y.: Autsl: a large scale multi-modal Turkish sign language dataset and baseline methods. IEEE Access 8, 181340–181355 (2020)

    Article  Google Scholar 

  73. Han, J., Shao, L., Xu, D., Shotton, J.: Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans. Cybern. 43(5), 1318–1334 (2013)

    Article  Google Scholar 

  74. Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019)

    Google Scholar 

  75. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)

  76. Alp Kindiroglu, A., Ozdemir, O., Akarun, L.: Temporal accumulative features for sign language recognition (2019)

  77. Gökçe, Ç., Özdemir, O., Kındıroğlu, A.A., Akarun, L.: Score-level multi cue fusion for sign language recognition, pp. 294–309, Springer (2020)

  78. Moryossef, A. et al.: Evaluating the immediate applicability of pose estimation for sign language recognition, pp. 3434–3440 (2021)

Download references

Acknowledgements

This work was funded by the Turkish ministry of development under the TAM Project #2007K120610, TUBITAK Project #117E059. The numerical calculations reported in this paper were also performed at TUBITAK ULAKBIM, HPAGCC-TRUBA.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ahmet Alp Kındıroglu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kındıroglu, A.A., Özdemir, O. & Akarun, L. Aligning accumulative representations for sign language recognition. Machine Vision and Applications 34, 12 (2023). https://doi.org/10.1007/s00138-022-01367-x

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1007/s00138-022-01367-x

Keywords

Mathematics Subject Classification

Navigation