Aligning accumulative representations for sign language recognition

Kındıroglu, Ahmet Alp; Özdemir, Oğulcan; Akarun, Lale

doi:10.1007/s00138-022-01367-x

Aligning accumulative representations for sign language recognition

Original Paper
Published: 26 December 2022

Volume 34, article number 12, (2023)
Cite this article

Machine Vision and Applications Aims and scope Submit manuscript

Ahmet Alp Kındıroglu ORCID: orcid.org/0000-0002-5299-1956^1,2,
Oğulcan Özdemir¹ &
Lale Akarun¹

472 Accesses
7 Citations
2 Altmetric
Explore all metrics

Abstract

Accumulative representations provide a method for representing variable-length videos with constant length features. In this study, we present aligned temporal accumulative features (ATAF), a skeleton heatmap-based feature for efficient representation and modeling of isolated sign language videos. Inspired by the movement-hold model in sign linguistics, we extract keyframes, align them using temporal transformer networks (TTNs) and extract descriptors using convolutional neural networks (CNNs). In the proposed approach, the use of aligned keyframes increases the recognition power of accumulative features as linguistically significant parts of signs are represented uniquely. Since we detect keyframes using hand movement, there can be differences from signer to signer. To overcome this challenge, ATAF has been implemented with both alignment of sampled frames and keyframe alignment approaches, using both finger speed differences and hand joint heatmaps to perform end-to-end alignment during classification. Results demonstrate that the proposed method achieves state-of-the-art recognition performance on the public BosphorusSign22k (BSign22k) dataset in combination with 3D-CNNs.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Sign Language Recognition Systems: A Decade Systematic Literature Review

Article 17 December 2019

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

Article Open access 02 January 2020

References

Gökgöz, K.: Negation in turkish sign language: the syntax of nonmanual markers. Sign Language Linguistics 14(1), 49–75 (2011)
Article Google Scholar
Kuehne, H., Jhuang, H., Garrote, E., Poggio, T., Serre, T.: Hmdb: a large video database for human motion recognition. Science 2, 2556–2563 (2011)
Google Scholar
Soomro, K., Zamir, A.R., Shah, M.: Ucf101: a dataset of 101 human actions classes from videos in the wild. arXiv:1212.0402 (2012)
Kay, W. et al.: The kinetics human action video dataset. arXiv:1705.06950 (2017)
Carreira, J., Zisserman, A.: Quo vadis, action recognition? a new model and the kinetics dataset. IEEE 2, 4724–4733 (2017)
Google Scholar
Jiang, S., et al.: Skeleton Aware Multi-modal Sign Language Recognition. Springer, Berlin (2021)
Book Google Scholar
Choutas, V., Weinzaepfel, P., Revaud, J., Schmid, C.: Potion: pose motion representation for action recognition. Science 3, 7024–7033 (2018)
Google Scholar
Tran, D., et al.: A closer look at spatiotemporal convolutions for action recognition. Science 3, 6450–6459 (2018)
Google Scholar
Liddell, S.K., Johnson, R.E.: American sign language: the phonological base. Sign Language Stud. 64(1), 195–277 (1989)
Article Google Scholar
Pitsikalis, V., Theodorakis, S., Vogler, C., Maragos, P.: Advances in phonetics-based sub-unit modeling for transcription alignment and sign language recognition. In: IEEE, pp. 1–6 (2011)
Cooper, H., Ong, E.-J., Pugeault, N., Bowden, R.: Sign language recognition using sub-units. J. Mach. Learn. Res. 13(1), 2205–2231 (2012)
Google Scholar
Bowden, R., Windridge, D., Kadir, T., Zisserman, A., Brady, M.: A Linguistic Feature Vector for the Visual Interpretation of Sign Language, pp. 390–401. Springer, Berlin (2004)
MATH Google Scholar
Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic-static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Image Vis. Comput. 32(8), 533–549 (2014)
Article Google Scholar
Tornay, S.: Explainable Phonology-based Approach for Sign Language Recognition and Assessment. Ph.D. thesis, EPFL (2021)
Borg, M., Camilleri, K.P.: Phonologically-Meaningful Subunits for Deep Learning-Based Sign Language Recognition, pp. 199–217. Springer, Berlin (2020)
Google Scholar
Camgoz, N.C., Hadfield, S., Koller, O., Bowden, R.: End-to-end hand shape and continuous sign language recognition. Subunets 3, 7 (2017)
Google Scholar
Tavella, F., Schlegel, V., Romeo, M., Galata, A., Cangelosi, A.: Wlasl-lex: a dataset for recognising phonological properties in american sign language. arXiv preprint arXiv:2203.06096 (2022)
Caselli, N.K., Sehyr, Z.S., Cohen-Goldberg, A.M., Emmorey, K.: Asl-lex: a lexical database of American sign language. Behav. Res. Methods 49(2), 784–801 (2017)
Article Google Scholar
Gao, Z., Lu, G., Lyu, C., Yan, P.: Key-frame selection for automatic summarization of surveillance videos: a method of multiple change-point detection. Mach. Vis. Appl. 29(7), 1101–1117 (2018)
Article Google Scholar
Xiong, W., Lee, C.-M., Ma, R.-H.: Automatic video data structuring through shot partitioning and key-frame computing. Mach. Vis. Appl. 10(2), 51–65 (1997)
Article Google Scholar
Fanfani, M., Bellavia, F., Colombo, C.: Accurate keyframe selection and keypoint tracking for robust visual odometry. Mach. Vis. Appl. 27(6), 833–844 (2016)
Article Google Scholar
Tang, H., Liu, H., Xiao, W., Sebe, N.: Fast and robust dynamic hand gesture recognition via key frames extraction and feature fusion. Neurocomputing 331, 424–433 (2019)
Article Google Scholar
Mo, H., Yamagishi, F., Ide, I., Satoh, S., Sakauchi, M.: Key shot extraction and indexing in a news video archive. IEICE Tech. Rep. 105(118), 55–59 (2005)
Google Scholar
Xu, W., Miao, Z., Yu, J., Ji, Q.: Action recognition and localization with spatial and temporal contexts. Neurocomputing 333, 351–363 (2019)
Article Google Scholar
Yang, R., Sarkar, S.: Detecting coarticulation in sign language using conditional random fields. Science 2, 108–112 (2006)
Google Scholar
Zhao, Z., Elgammal, A.M.: Information Theoretic Key Frame Selection for Action Recognition, pp. 1–10. Springer, Berlin (2008)
Google Scholar
Carlsson, S., Sullivan, J.: Action Recognition by Shape Matching to Key Frames, vol. 1. Citeseer, London (2001)
Google Scholar
Lu, G., Zhou, Y., Li, X., Yan, P.: Unsupervised, efficient and scalable key-frame selection for automatic summarization of surveillance videos. Multimedia Tools Appl. 76(5), 6309–6331 (2017)
Article Google Scholar
Rodriguez, A., Laio, A.: Clustering by fast search and find of density peaks. Science 344(6191), 1492–1496 (2014)
Article Google Scholar
Elakkiya, R., Selvamani, K.: Extricating manual and non-manual features for subunit level medical sign modelling in automatic sign language classification and recognition. J. Med. Syst. 41(11), 1–13 (2017)
Google Scholar
De Coster, M., Van Herreweghe, M., Dambre, J.: Sign language recognition with transformer networks. In: European Language Resources Association (ELRA), pp. 6018–6024 (2020)
Huang, S., Mao, C., Tao, J., Ye, Z.: A novel Chinese sign language recognition method based on keyframe-centered clips. IEEE Signal Process. Lett. 25(3), 442–446 (2018)
Article Google Scholar
Pan, W., Zhang, X., Ye, Z.: Attention-based sign language recognition network utilizing keyframe sampling and skeletal features. IEEE Access 8, 215592–215602 (2020)
Article Google Scholar
Albanie, S., et al.: Bsl-1k: Scaling Up Co-articulated Sign Language Recognition Using Mouthing Cues, pp. 35–53. Springer, Berlin (2020)
Google Scholar
Berndt, D.J., Clifford, J.: Using Dynamic Time Warping to Find Patterns in Time Series, vol. 10, pp. 359–370. Springer, Seattle (1994)
Google Scholar
Cuturi, M., Blondel, M.: Soft-dtw: a differentiable loss function for time-series. arXiv preprint arXiv:1703.01541 (2017)
Petitjean, F., Ketterlin, A., Gançarski, P.: A global averaging method for dynamic time warping, with applications to clustering. Pattern Recogn. 44(3), 678–693 (2011)
Article MATH Google Scholar
Zhou, F., Torre, F.: Canonical time warping for alignment of human behavior. Adv. Neural. Inf. Process. Syst. 22, 2286–2294 (2009)
Google Scholar
Trigeorgis, G., Nicolaou, M.A., Zafeiriou, S., Schuller, B.W.: Deep canonical time warping. Science 2, 5110–5118 (2016)
Google Scholar
Chang, C.-Y., Huang, D.-A., Sui, Y., Fei-Fei, L., Niebles, J.C.: D3tw: discriminative differentiable dynamic time warping for weakly supervised action alignment and segmentation. Science 6, 3546–3555 (2019)
Google Scholar
Korbar, B., Tran, D., Torresani, L.: Scsampler: sampling salient clips from video for efficient action recognition. Science 6, 6232–6242 (2019)
Google Scholar
Lohit, S., Wang, Q., Turaga, P.K.: Temporal transformer networks: Joint learning of invariant and discriminative time warping. CoRR abs/1906.05947. http://arxiv.org/abs/1906.05947 (2019)
Oh, J., Wang, J. , Wiens, J.: Learning to exploit invariances in clinical time-series data using sequence transformer networks. arXiv preprint arXiv:1808.06725 (2018)
Starner, T., Pentland, A.: Real-Time American Sign Language Recognition from Video Using Hidden Markov Models, pp. 227–243. Springer, Berlin (1997)
Google Scholar
Özdemir, O., Camgöz, N.C., Akarun, L.: Isolated sign language recognition using improved dense trajectories. In: IEEE, pp. 1961–1964 (2016)
Camgöz, N.C. et al.: Bosphorussign: a Turkish sign language recognition corpus in health and finance domains (2016)
Ding, L., Martinez, A.M.: Modelling and recognition of the linguistic components in American sign language. Image Vis. Comput. 27(12), 1826–1844 (2009)
Article Google Scholar
Theodorakis, S., Pitsikalis, V., Maragos, P.: Dynamic-static unsupervised sequentiality, statistical subunits and lexicon for sign language recognition. Image Vis. Comput. 32(8), 533–549 (2014)
Article Google Scholar
Ong, E.-J., Koller, O., Pugeault, N., Bowden, R.: Sign spotting using hierarchical sequential patterns with temporal intervals, pp. 1923–1930 (2014)
Belgacem, S., Chatelain, C., Paquet, T.: Gesture sequence recognition with one shot learned crf/hmm hybrid model. Image Vis. Comput. 61, 12–21 (2017)
Article Google Scholar
Rastgoo, R., Kiani, K., Escalera, S.: Sign language recognition: a deep survey. Expert Syst. Appl. 164, 113794 (2021)
Article Google Scholar
Vaezi Joze, H.R., Koller, O.: MS-ASL: a large-scale data set and benchmark for understanding American Sign Language (2018)
Li, D., Rodriguez, C., Yu, X., Li, H.: Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison, pp. 1459–1469 (2020)
Chai, X., Wang, H., Chen, X.: The devisign large vocabulary of chinese sign language database and baseline evaluations. Technical report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS (2014)
Neidle, C., Thangali, A., Sclaroff, S.: Challenges in development of the American sign language lexicon video dataset (asllvd) corpus (Citeseer, 2012)
Albanie, S. et al.: BSL-1K: scaling up co-articulated sign language recognition using mouthing cues (2020)
He, J., Liu, Z., Zhang, J.: Chinese sign language recognition based on trajectory and hand shape features. In: IEEE, pp. 1–4 (2016)
Özdemir, O., Kındıroğlu, A.A., Camgöz, N.C., Akarun, L.: Bosphorussign22k sign language recognition dataset. arXiv preprint arXiv:2004.01283 (2020)
Forster, J., Schmidt, C., Koller, O., Bellgardt, M., Ney, H.: Extensions of the sign language recognition and translation corpus rwth-phoenix-weather, pp. 1911–1916 (2014)
Zhang, J., Zhou, W., Xie, C., Pu, J., Li, H.: Chinese sign language recognition with adaptive hmm. In: IEEE, pp. 1–6 (2016)
Pu, J., Zhou, W., Li, H.: Iterative alignment network for continuous sign language recognition, pp. 4165–4174 (2019)
Donahue, J. et al.: Long-term recurrent convolutional networks for visual recognition and description, pp. 2625–2634 (2015)
Tran, D., Bourdev, L., Fergus, R., Torresani, L., Paluri, M.: Learning spatiotemporal features with 3d convolutional networks. In: IEEE, pp. 4489–4497 (2015)
Koller, O., Camgoz, N.C., Ney, H., Bowden, R.: Weakly supervised learning with multi-stream cnn-lstm-hmms to discover sequential parallelism in sign language videos. IEEE Trans. Pattern Anal. Mach. Intell. 6, 788 (2019)
Google Scholar
Cao, Z., Hidalgo, G., Simon, T., Wei, S.-E., Sheikh, Y.: Openpose: realtime multi-person 2d pose estimation using part affinity fields. arXiv preprint arXiv:1812.08008 (2018)
Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. arXiv preprint arXiv:1801.07455 (2018)
Cheng, K., Zhang, Y., Cao, C., Shi, L., Cheng, J.: Decoupling GCN with Dropgraph Module for Skeleton-based Action Recognition. Springer, Berlin (2021)
Google Scholar
Zhu, W. et al.: Co-occurrence feature learning for skeleton based action recognition using regularized deep lstm networks. arXiv preprint arXiv:1603.07772 (2016)
Joze, H.R.V., & Koller, O.: Ms-asl: a large-scale data set and benchmark for understanding american sign language. arXiv preprint arXiv:1812.01053 (2018)
Asghari-Esfeden, S., Sznaier, M., Camps, O.: Dynamic motion representation for human action recognition, pp. 557–566 (2020)
Simonyan, K., Zisserman, A.: Two-Stream Convolutional Networks for Action Recognition in Videos, pp. 568–576. MIT Press, London (2014)
Sincan, O.M., Keles, H.Y.: Autsl: a large scale multi-modal Turkish sign language dataset and baseline methods. IEEE Access 8, 181340–181355 (2020)
Article Google Scholar
Han, J., Shao, L., Xu, D., Shotton, J.: Enhanced computer vision with microsoft kinect sensor: a review. IEEE Trans. Cybern. 43(5), 1318–1334 (2013)
Article Google Scholar
Paszke, A., et al.: Pytorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8024–8035 (2019)
Google Scholar
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101 (2017)
Alp Kindiroglu, A., Ozdemir, O., Akarun, L.: Temporal accumulative features for sign language recognition (2019)
Gökçe, Ç., Özdemir, O., Kındıroğlu, A.A., Akarun, L.: Score-level multi cue fusion for sign language recognition, pp. 294–309, Springer (2020)
Moryossef, A. et al.: Evaluating the immediate applicability of pose estimation for sign language recognition, pp. 3434–3440 (2021)

Download references

Acknowledgements

This work was funded by the Turkish ministry of development under the TAM Project #2007K120610, TUBITAK Project #117E059. The numerical calculations reported in this paper were also performed at TUBITAK ULAKBIM, HPAGCC-TRUBA.

Author information

Authors and Affiliations

Department of Computer Engineering, Bogazici University, Istanbul, Turkey
Ahmet Alp Kındıroglu, Oğulcan Özdemir & Lale Akarun
Wireless DC Department, Huawei, Istanbul, Turkey
Ahmet Alp Kındıroglu

Authors

Ahmet Alp Kındıroglu
View author publications
You can also search for this author in PubMed Google Scholar
Oğulcan Özdemir
View author publications
You can also search for this author in PubMed Google Scholar
Lale Akarun
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ahmet Alp Kındıroglu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Kındıroglu, A.A., Özdemir, O. & Akarun, L. Aligning accumulative representations for sign language recognition. Machine Vision and Applications 34, 12 (2023). https://doi.org/10.1007/s00138-022-01367-x

Download citation

Received: 13 November 2021
Revised: 30 November 2022
Accepted: 08 December 2022
Published: 26 December 2022
DOI: https://doi.org/10.1007/s00138-022-01367-x

Keywords

Mathematics Subject Classification

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Aligning accumulative representations for sign language recognition

Abstract

Access this article

Similar content being viewed by others

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Sign Language Recognition Systems: A Decade Systematic Literature Review

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Mathematics Subject Classification

Navigation

Aligning accumulative representations for sign language recognition

Abstract

Access this article

Similar content being viewed by others

Multi-scale Dilated Attention Graph Convolutional Network for Skeleton-Based Action Recognition

Sign Language Recognition Systems: A Decade Systematic Literature Review

Text2Sign: Towards Sign Language Production Using Neural Machine Translation and Generative Adversarial Networks

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Mathematics Subject Classification

Search

Navigation