Skip to main content
Log in

A shapelet-based framework for large-scale word-level sign language database auto-construction

  • Review
  • Published:
Neural Computing and Applications Aims and scope Submit manuscript

Abstract

Sign language recognition is a challenging and often underestimated problem that includes the asynchronous integration of multimodal articulators. Learning powerful applied statistical models requires much training data. However, well-labelled sign language databases are a scarce resource due to the high cost of manual labelling and performing. On the other hand, there exist a lot of sign language-interpreted videos on the Internet. This work aims to propose a framework to automatically learn a large-scale sign language database from sign language-interpreted videos. We achieved this by exploring the correspondence between subtitles and motions by discovering shapelets which are the most discriminative subsequences within the data sequences. In this paper, two modified shapelet methods were used to identify the target signs for 1000 words from 89 (96 h, 8 naive signers) sign language-interpreted videos in terms of brute force search and parameter learning. Then, an augmented (3–5 times larger) large-scale word-level sign database was finally constructed using an adaptive sample augmentation strategy that collected all similar video clips of the target sign as valid samples. Experiments on a subset of 100 words revealed a considerable speedup and 14% improvement in recall rate. The evaluation of three state-of-the-art sign language classifiers demonstrates the good discrimination of the database, and the sample augmentation strategy can significantly increase the recognition accuracy of all classifiers by 10–33% by increasing the number, variety, and balance of the data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Code availability

The database, models, and code are available at https://github.com/hitmaxiang/SPBSL.

Notes

  1. https://github.com/CMU-Perceptual-Computing-Lab/openpose.

  2. https://external.parliament.scot/help/109625.aspx.

  3. http://domedb.perception.cs.cmu.edu/.

  4. https://github.com/hitmaxiang/SPBSL.

References

  1. Vos T, Barber RM, Bell B, Bertozzi-Villa A, Biryukov S, Bolliger I, Charlson F, Davis A, Degenhardt L, Dicker D (2015) Global, regional, and national incidence, prevalence, and years lived with disability for 301 acute and chronic diseases and injuries in 188 countries, 1990–2013: a systematic analysis for the global burden of disease study 2013. The Lancet 386(9995):743–800. https://doi.org/10.1016/S0140-6736(15)60692-4

    Article  Google Scholar 

  2. Olusanya BO, Neumann KJ, Saunders JE (2014) The global burden of disabling hearing impairment: a call to action. Bull World Health Organ 92:367–373. https://doi.org/10.2471/BLT.13.128728

    Article  Google Scholar 

  3. Stokoe J, William C (2005) Sign language structure: an outline of the visual communication systems of the American deaf. J Deaf Studi Deaf Educ 10(1):3–37. https://doi.org/10.1093/deafed/eni001

    Article  Google Scholar 

  4. Rabiner LR (1989) Tutorial on hidden Markov models and selected applications in speech recognition. In: Proceedings of the IEEE, vol 77, pp 257–286. https://doi.org/10.1109/5.18626

  5. McCallum A, Freitag D, Pereira FCN (2000) Maximum entropy Markov models for information extraction and segmentation. In: Proceedings of the seventeenth international conference on machine learning (ICML 2000), pp 591–598

  6. Lafferty JD, McCallum A, Pereira FCN (2001) Conditional random fields: probabilistic models for segmenting and labeling sequence data. In: Proceedings of the eighteenth international conference on machine learning (ICML 2001), pp 282–289

  7. Yu S-H, Huang C-L, Hsu S-C, Lin H-W, Wang H-W (2011) Vision-based continuous sign language recognition using product hmm. In: The first Asian conference on pattern recognition, pp 510–514. https://doi.org/10.1109/ACPR.2011.6166631

  8. Wu C-H, Lin J-C, Wei W-L (2013) Two-level hierarchical alignment for semi-coupled hmm-based audiovisual emotion recognition with temporal course. IEEE Trans Multimedia 15(8):1880–1895. https://doi.org/10.1109/TMM.2013.2269314

    Article  Google Scholar 

  9. Cho K, van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder–decoder for statistical machine translation. In: Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp 1724–1734. https://doi.org/10.3115/v1/D14-1179

  10. Li D, Opazo CR, Yu X, Li H (2020) Word-level deep sign language recognition from video: a new large-scale dataset and methods comparison. In: 2020 IEEE winter conference on applications of computer vision (WACV), pp 1448–1458. https://doi.org/10.1109/WACV45572.2020.9093512

  11. Carreira J, Zisserman A (2017) Quo Vadis, action recognition? A new model and the kinetics dataset. In: 2017 IEEE conference on computer vision and pattern recognition (CVPR), pp 4724–4733. https://doi.org/10.1109/CVPR.2017.502

  12. Kadous MW (2002) Temporal classification: extending the classification paradigm to multivariate time series. PhD thesis, School of Computer Science and Engineering, University of New South Wales

  13. Fels SS, Hinton GE (1993) Glove-talk: a neural network interface between a data-glove and a speech synthesizer. IEEE Trans Neural Netw 4(1):2–8. https://doi.org/10.1109/72.182690

    Article  Google Scholar 

  14. Gao W, Ma J, Shan S, Chen X, Zeng W, Zhang H, Yan J, Wu J (2000) Handtalker: a multimodal dialog system using sign language and 3-d virtual human. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 1948. Beijing, China, pp 564–571. https://doi.org/10.1007/3-540-40063-x_74

  15. Chai X, Wang H, Chen X (2014) The Devisign large vocabulary of Chinese sign language database and baseline evaluations. Technical report VIPL-TR-14-SLR-001. Key Lab of Intelligent Information Processing of Chinese Academy of Sciences (CAS), Institute of Computing Technology, CAS

  16. Ginsberg J, Mohebbi MH, Patel RS, Brammer L, Smolinski MS, Brilliant L (2009) Detecting influenza epidemics using search engine query data. Nature 457(7232):1012–1014. https://doi.org/10.1038/nature07634

    Article  Google Scholar 

  17. Xu E, Nemati S, Tremoulet AH (2022) A deep convolutional neural network for Kawasaki disease diagnosis. Sci Rep 12(1):1–6. https://doi.org/10.1038/s41598-022-15495-x

    Article  Google Scholar 

  18. Morales J, Yoshimura N, Xia Q, Wada A, Namioka Y, Maekawa T (2022) Acceleration-based human activity recognition of packaging tasks using motif-guided attention networks. In: 2022 IEEE international conference on pervasive computing and communications (PerCom), pp 1–12. https://doi.org/10.1109/PerCom53586.2022.9762388

  19. Kumar P, Roy PP, Dogra DP (2018) Independent Bayesian classifier combination based sign language recognition using facial expression. Inf Sci 428:30–48. https://doi.org/10.1016/j.ins.2017.10.046

    Article  MathSciNet  Google Scholar 

  20. Saeed S, Mahmood MK, Khan YD (2018) An exposition of facial expression recognition techniques. Neural Comput Appl 29(9):425–443. https://doi.org/10.1007/s00521-016-2522-2

    Article  Google Scholar 

  21. Shao Z, Li YF (2013) A new descriptor for multiple 3d motion trajectories recognition. In: 2013 IEEE international conference on robotics and automation, pp 4749–4754. https://doi.org/10.1109/ICRA.2013.6631253

  22. Shao Z, Li Y (2015) Integral invariants for space motion trajectory matching and recognition. Pattern Recogn 48(8):2418–2432. https://doi.org/10.1016/j.patcog.2015.02.029

    Article  MATH  Google Scholar 

  23. Wang H, Chai X, Chen X (2016) Sparse observation (so) alignment for sign language recognition. Neurocomputing 175:674–685. https://doi.org/10.1016/j.neucom.2015.10.112

    Article  Google Scholar 

  24. Kumar EK, Kishore PVV, Kiran Kumar MT, Kumar DA (2020) 3d sign language recognition with joint distance and angular coded color topographical descriptor on a 2 stream CNN. Neurocomputing 372:40–54. https://doi.org/10.1016/j.neucom.2019.09.059

    Article  Google Scholar 

  25. Ma X, Yuan L, Wen R, Wang Q (2020) Sign language recognition based on concept learning. In: 2020 IEEE international instrumentation and measurement technology conference (I2MTC), pp 1–6. https://doi.org/10.1109/I2MTC43012.2020.9128734

  26. Wadhawan A, Kumar P (2020) Deep learning-based sign language recognition system for static signs. Neural Comput Appl 32(12):7957–7968. https://doi.org/10.1007/s00521-019-04691-y

    Article  Google Scholar 

  27. Güney S, Erkuş M (2021) A real-time approach to recognition of Turkish sign language by using convolutional neural networks. Neural Comput Appl. https://doi.org/10.1007/s00521-021-06664-6

    Article  Google Scholar 

  28. Huang J, Zhou W, Zhang Q, Li H, Li W (2018) Video-based sign language recognition without temporal segmentation. In: 32nd AAAI conference on artificial intelligence, AAAI 2018, vol 32, pp 2257–2264

  29. Kumar P, Gauba H, Pratim Roy P, Prosad Dogra D (2017) A multimodal framework for sensor based sign language recognition. Neurocomputing 259:21–38. https://doi.org/10.1016/j.neucom.2016.08.132

    Article  Google Scholar 

  30. Gao L, Li H, Liu Z, Liu Z, Wan L, Feng W (2021) RNN-transducer based Chinese sign language recognition. Neurocomputing 434:45–54. https://doi.org/10.1016/j.neucom.2020.12.006

    Article  Google Scholar 

  31. Cihan Camgöz N, Koller O, Hadfield S, Bowden R (2020) Sign language transformers: joint end-to-end sign language recognition and translation. In: 2020 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 10020–10030. https://doi.org/10.1109/CVPR42600.2020.01004

  32. Liu Y, Zhang H, Xu D, He K (2022) Graph transformer network with temporal kernel attention for skeleton-based action recognition. Knowl Based Syst 240:108146. https://doi.org/10.1016/j.knosys.2022.108146

    Article  Google Scholar 

  33. Sun M, Savarese S (2011) Articulated part-based model for joint object detection and pose estimation. In: Proceedings of the IEEE international conference on computer vision, Barcelona, Spain, pp 723–730. https://doi.org/10.1109/ICCV.2011.6126309

  34. Tompson J, Goroshin R, Jain A, LeCun Y, Bregler C (2015) Efficient object localization using convolutional networks. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 07-12-June-2015, Boston, MA, USA, pp 648–656. https://doi.org/10.1109/CVPR.2015.7298664

  35. Wei S-E, Ramakrishna V, Kanade T, Sheikh Y (2016) Convolutional pose machines. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, vol 2016-December, Las Vegas, NV, USA, pp 4724–4732. https://doi.org/10.1109/CVPR.2016.511

  36. Simon T, Joo H, Matthews I, Sheikh Y (2017) Hand keypoint detection in single images using multiview bootstrapping. In: Proceedings-30th IEEE conference on computer vision and pattern recognition, CVPR 2017, vol 2017-January, Honolulu, HI, USA, pp 4645–4653. https://doi.org/10.1109/CVPR.2017.494

  37. JOZE HV, Koller O (2016) Ms-asl: a large-scale data set and benchmark for understanding American sign language. In: Proceedings of the British machine vision conference, pp 41–14116. https://doi.org/10.5244/C.33.41

  38. Albanie S, Varol G, Momeni L, Afouras T, Chung JS, Fox N, Zisserman A (2020) Bsl-1k: scaling up co-articulated sign language recognition using mouthing cues. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 12356 LNCS, Glasgow, UK, pp 35–53. https://doi.org/10.1007/978-3-030-58621-8_3

  39. Momeni L, Varol G, Albanie S, Afouras T, Zisserman A (2021) Watch, read and lookup: learning to spot signs from multiple supervisors. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 12627 LNCS, pp 291–308. https://doi.org/10.1007/978-3-030-69544-6_18

  40. Barbara L, Loeding AP, Sudeep Sarkar, Karshmer AI (2004) Progress in automated computer recognition of sign language. In: Computers helping people with special needs, 9th international conference, ICCHP 2004, Paris, France, July 7–9, 2004, Proceedings. Lecture notes in computer science, vol 3118, pp 1079–1087. https://doi.org/10.1007/978-3-540-27817-7_159

  41. Martinez AM, Wilbur RB, Shay R, Kak AC (2002) Purdue RVL-SLLL ASL database for automatic recognition of American sign language. In: Proceedings 4th IEEE international conference on multimodal interfaces, ICMI 2002, pp 167–172. https://doi.org/10.1109/ICMI.2002.1166987

  42. Zahedi M, Keysers D, Deselaers T, Ney H (2005) Combination of tangent distance and an image distortion model for appearance-based sign language. In: Lecture notes in computer science, vol 3663, Vienna, Austria, pp 401–408. https://doi.org/10.1007/11550518_50

  43. Liu B, Xiao Y, Hao Z (2018) A selective multiple instance transfer learning method for text categorization problems. Knowl Based Syst 141:178–187. https://doi.org/10.1016/j.knosys.2017.11.019

    Article  Google Scholar 

  44. Zhang Y, Zhang H, Tian Y (2020) Sparse multiple instance learning with non-convex penalty. Neurocomputing 391:142–156. https://doi.org/10.1016/j.neucom.2020.01.100

    Article  Google Scholar 

  45. Buehler P, Everingham M, Zisserman A (2009) Learning sign language by watching tv (using weakly aligned subtitles). In: 2009 IEEE computer society conference on computer vision and pattern recognition workshops, CVPR workshops 2009, pp 2961–2968. https://doi.org/10.1109/CVPRW.2009.5206523

  46. Pfister T, Charles J, Zisserman A (2013) Large-scale learning of sign language by watching tv (using co-occurrences). In: Proceedings of the British machine vision conference, pp 20–12011. https://doi.org/10.5244/C.27.20

  47. Andrews S, Tsochantaridis I, Hofmann T (2002) Support vector machines for multiple-instance learning. In: Advances in neural information processing systems, vol 15, pp 561–568

  48. Cooper H, Bowden R (2009) Learning signs from subtitles: a weakly supervised approach to sign language recognition. In: 2009 IEEE conference on computer vision and pattern recognition, pp 2568–2574. https://doi.org/10.1109/CVPR.2009.5206647

  49. Varol G, Momeni L, Albanie S, Afouras T, Zisserman A (2021) Read and attend: temporal localisation in sign language videos. In: 2021 IEEE/CVF conference on computer vision and pattern recognition (CVPR), pp 16852–16861. https://doi.org/10.1109/CVPR46437.2021.01658

  50. Miech A, Alayrac J-B, Smaira L, Laptev I, Sivic J, Zisserman A (2020) End-to-end learning of visual representations from uncurated instructional videos. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition, pp 9876–9886. https://doi.org/10.1109/CVPR42600.2020.00990

  51. Ye L, Keogh E (2009) Time series shapelets: a new primitive for data mining. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, Paris, France, pp 947–955. https://doi.org/10.1145/1557019.1557122

  52. Mueen A, Keogh E, Young N (2011) Logical-shapelets: an expressive primitive for time series classification. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, pp 1154–1162. https://doi.org/10.1145/2020408.2020587

  53. Rakthanmanon T, Keogh E (2013) Fast shapelets: a scalable algorithm for discovering time series shapelets. In: SIAM international conference on data mining 2013, SMD 2013, Austin, TX, USA, pp 668–676

  54. Chang K-W, Deka B, Hwu W-MW, Roth D (2012) Efficient pattern-based time series classification on GPU. In: Proceedings-IEEE international conference on data mining, ICDM, Brussels, Belgium, pp 131–140. https://doi.org/10.1109/ICDM.2012.132

  55. Ji C, Zhao C, Liu S, Yang C, Pan L, Wu L, Meng X (2019) A fast shapelet selection algorithm for time series classification. Comput Netw 148:231–240. https://doi.org/10.1016/j.comnet.2018.11.031

    Article  Google Scholar 

  56. Hu Y, Zhan P, Xu Y, Zhao J, Li Y, Li X (2021) Temporal representation learning for time series classification. Neural Comput Appl 33(8):3169–3182. https://doi.org/10.1007/s00521-020-05179-w

    Article  Google Scholar 

  57. Grabocka J, Schilling N, Wistuba M, Schmidt-Thieme L (2014) Learning time-series shapelets. In: Proceedings of the ACM SIGKDD international conference on knowledge discovery and data mining, New York, NY, USA, pp 392–401. https://doi.org/10.1145/2623330.2623613

  58. Zhang Z, Zhang H, Wen Y, Zhang Y, Yuan X (2018) Discriminative extraction of features from time series. Neurocomputing 275:2317–2328. https://doi.org/10.1016/j.neucom.2017.11.002

    Article  Google Scholar 

  59. Shah M, Grabocka J, Schilling N, Wistuba M, Schmidt-Thieme L (2016) Learning DTW-Shapelets for time-series classification. In: Proceedings of the 3rd IKDD conference on data science, 2016, pp 1–8. https://doi.org/10.1145/2888451.2888456

  60. Ma Q, Zhuang W, Li S, Huang D, Cottrell G (2020) Adversarial dynamic shapelet networks. In: Proceedings of the AAAI conference on artificial intelligence, vol 34, pp 5069–5076

  61. Pfister T, Charles J, Zisserman A (2014) Domain-adaptive discriminative one-shot learning of gestures. In: Lecture notes in computer science (including subseries lecture notes in artificial intelligence and lecture notes in bioinformatics), vol 8694 LNCS, Zurich, Switzerland, pp 814–829. https://doi.org/10.1007/978-3-319-10599-4_52

  62. Yeh C-CM, Zhu Y, Ulanova L, Begum N, Ding Y, Dau HA, Silva DF, Mueen A, Keogh E (2016) Matrix profile i: all pairs similarity joins for time series: a unifying view that includes motifs, discords and shapelets. In: Proceedings-IEEE international conference on data mining, ICDM, vol 0, Barcelona, Catalonia, Spain, pp 1317–1322. https://doi.org/10.1109/ICDM.2016.89

  63. Zhu Y, Zimmerman Z, Senobari NS, Yeh C-CM, Funning G, Mueen A, Brisk P, Keogh E (2016) Matrix profile ii: exploiting a novel algorithm and GPUS to break the one hundred million barrier for time series motifs and joins. In: Proceedings-IEEE international conference on data mining, ICDM, vol 0, Barcelona, Catalonia, Spain, pp 739–748. https://doi.org/10.1109/ICDM.2016.126

  64. Parliament S (2021) The playlist of BSL videos. https://youtube.com/playlist?list=PL4l0q4AbG0mmB3AEL6F-DCjK7uhRp0ll_. Accessed 21 July

  65. Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M, Berg AC, Fei-Fei L (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252. https://doi.org/10.1007/s11263-015-0816-y

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work was supported by the National Natural Science Foundation of China under Grant No. 61876054.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xiang Ma.

Ethics declarations

Conflict of interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ma, X., Wang, Q., Zheng, T. et al. A shapelet-based framework for large-scale word-level sign language database auto-construction. Neural Comput & Applic 35, 253–274 (2023). https://doi.org/10.1007/s00521-022-08018-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00521-022-08018-2

Keywords

Navigation