Abstract
Sign language is a natural language widely used by Deaf and hard of hearing (DHH) individuals. Advanced wearables are developed to recognize sign language automatically. However, they are limited by the lack of labeled data, which leads to a small vocabulary and unsatisfactory performance even though laborious efforts are put into data collection. Here we propose SignRing, an IMU-based system that breaks through the traditional data augmentation method, makes use of online videos to generate the virtual IMU (v-IMU) data, and pushes the boundary of wearable-based systems by reaching the vocabulary size of 934 with sentences up to 16 glosses. The v-IMU data is generated by reconstructing 3D hand movements from two-view videos and calculating 3-axis acceleration data, by which we are able to achieve a word error rate (WER) of 6.3% with a mix of half v-IMU and half IMU training data (2339 samples for each), and a WER of 14.7% with 100% v-IMU training data (6048 samples), compared with the baseline performance of the 8.3% WER (trained with 2339 samples of IMU data). We have conducted comparisons between v-IMU and IMU data to demonstrate the reliability and generalizability of the v-IMU data. This interdisciplinary work covers various areas such as wearable sensor development, computer vision techniques, deep learning, and linguistics, which can provide valuable insights to researchers with similar research objectives.
- Vassilis Athitsos, Carol Neidle, Stan Sclaroff, Joan Nash, Alexandra Stefan, Quan Yuan, and Ashwin Thangali. 2008. The American sign language lexicon video dataset. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. IEEE, 1--8.Google ScholarCross Ref
- Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems 33 (2020), 12449--12460.Google Scholar
- Donald J Berndt and James Clifford. 1994. Using dynamic time warping to find patterns in time series. In Proceedings of the 3rd International Conference on Knowledge Discovery and Data Mining, Vol. 10. 359--370.Google Scholar
- Matyáš Boháček and Marek Hrúz. 2022. Sign pose-based transformer for word-level sign language recognition. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. 182--191.Google ScholarCross Ref
- Matyáš Bohacek and Marek Hrúz. 2023. Learning from What is Already Out There: Few-shot Sign Language Recognition with Online Dictionaries. In 2023 IEEE 17th International Conference on Automatic Face and Gesture Recognition (FG). IEEE, 1--6.Google ScholarDigital Library
- Danielle Bragg, Oscar Koller, Mary Bellard, Larwan Berke, Patrick Boudreault, Annelies Braffort, Naomi Caselli, Matt Huenerfauth, Hernisa Kacorri, Tessa Verhoef, et al. 2019. Sign language recognition, generation, and translation: An interdisciplinary perspective. In The 21st International ACM SIGACCESS Conference on Computers and Accessibility. 16--31.Google ScholarDigital Library
- Necati Cihan Camgoz, Simon Hadfield, Oscar Koller, Hermann Ney, and Richard Bowden. 2018. Neural sign language translation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 7784--7793.Google ScholarCross Ref
- Necati Cihan Camgoz, Oscar Koller, Simon Hadfield, and Richard Bowden. 2020. Sign language transformers: Joint end-to-end sign language recognition and translation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10023--10033.Google ScholarCross Ref
- Ke-Yu Chen, Shwetak N Patel, and Sean Keller. 2016. Finexus: Tracking precise motions of multiple fingertips using magnetic sensing. In Proceedings of the 2016 CHI Conference on Human Factors in Computing Systems. 1504--1514.Google ScholarDigital Library
- Seokmin Choi, Yang Gao, Yincheng Jin, Se jun Kim, Jiyang Li, Wenyao Xu, and Zhanpeng Jin. 2022. PPGface: Like What You Are Watching? Earphones Can" Feel" Your Facial Expressions. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 2 (2022), 1--32.Google ScholarDigital Library
- Cao Dong, Ming C Leu, and Zhaozheng Yin. 2015. American sign language alphabet recognition using microsoft kinect. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops. 44--52.Google Scholar
- Philippe Dreuw, Carol Neidle, Vassilis Athitsos, Stan Sclaroff, and Hermann Ney. 2008. Benchmark databases for video-based automatic sign language recognition. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC'08).Google Scholar
- Philippe Dreuw, David Rybach, Thomas Deselaers, Morteza Zahedi, and Hermann Ney. 2007. Speech recognition techniques for a sign language recognition system. Hand 60 (2007), 80.Google Scholar
- Deniz Ekiz, Gamze Ege Kaya, Serkan Buğur, Sıla Güler, Buse Buz, Bilgin Kosucu, and Bert Arnrich. 2017. Sign sentence recognition with smart watches. In 2017 25th Signal Processing and Communications Applications Conference (SIU). IEEE, 1--4.Google ScholarCross Ref
- Biyi Fang, Jillian Co, and Mi Zhang. 2017. Deepasl: Enabling ubiquitous and non-intrusive word and sentence-level sign language translation. In Proceedings of the 15th ACM Conference on Embedded Network Sensor Systems. 1--13.Google ScholarDigital Library
- Jens Forster, Christoph Schmidt, Thomas Hoyoux, Oscar Koller, Uwe Zelle, Justus Piater, and Hermann Ney. 2012. Rwth-phoenix-weather: A large vocabulary sign language recognition and translation corpus. In Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC'12). 3785--3789.Google Scholar
- Yang Gao, Yincheng Jin, Seokmin Choi, Jiyang Li, Junjie Pan, Lin Shu, Chi Zhou, and Zhanpeng Jin. 2021. SonicFace: Tracking Facial Expressions Using a Commodity Microphone Array. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 4 (2021), 1--33.Google ScholarDigital Library
- Felix A Gers, Nicol N Schraudolph, and Jürgen Schmidhuber. 2002. Learning precise timing with LSTM recurrent networks. Journal of Machine Learning Research 3, Aug (2002), 115--143.Google Scholar
- Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of 23rd Int'l Conference on Machine Learning. 369--376.Google ScholarDigital Library
- Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. 2013. Speech recognition with deep recurrent neural networks. In Proc. IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6645--6649.Google ScholarCross Ref
- Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory. Neural Computation 9, 8 (1997), 1735--1780.Google ScholarDigital Library
- Jiahui Hou, Xiang-Yang Li, Peide Zhu, Zefan Wang, Yu Wang, Jianwei Qian, and Panlong Yang. 2019. Signspeaker: A real-time, high-precision smartwatch-based sign language translator. In Proceedings of 25th Int'l Conf. on Mobile Computing and Networking. 1--15.Google ScholarDigital Library
- Jie Huang, Wengang Zhou, Qilin Zhang, Houqiang Li, and Weiping Li. 2018. Video-based sign language recognition without temporal segmentation. In Thirty-Second AAAI Conference on Artificial Intelligence.Google ScholarCross Ref
- Umar Iqbal, Pavlo Molchanov, Thomas Breuel Juergen Gall, and Jan Kautz. 2018. Hand pose estimation via latent 2.5 d heatmap regression. In Proceedings of the European Conference on Computer Vision (ECCV). 118--134.Google Scholar
- Yincheng Jin, Seokmin Choi, Yang Gao, Jiyang Li, Zhengxiong Li, and Zhanpeng Jin. 2023. TransASL: A Smart Glass Based Comprehensive ASL Recognizer in Daily Life. In Proceedings of the 28th International Conference on Intelligent User Interfaces (IUI '23). Association for Computing Machinery, New York, NY, USA, 802--818. https://doi.org/10.1145/3581641.3584071Google ScholarDigital Library
- Yincheng Jin, Yang Gao, Yanjun Zhu, Wei Wang, Jiyang Li, Seokmin Choi, Zhangyu Li, Jagmohan Chauhan, Anind K. Dey, and Zhanpeng Jin. 2021. SonicASL: An Acoustic-Based Sign Language Gesture Recognizer Using Earphones. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 5, 2, Article 67 (jun 2021), 30 pages. https://doi.org/10.1145/3463519Google ScholarDigital Library
- Yincheng Jin, Shibo Zhang, Yang Gao, Xuhai Xu, Seokmin Choi, Zhengxiong Li, Henry J. Adler, and Zhanpeng Jin. 2023. SmartASL: "Point-of-Care" Comprehensive ASL Interpreter Using Wearables. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 2, Article 60 (jun 2023), 21 pages. https://doi.org/10.1145/3596255Google ScholarDigital Library
- Shigeki Karita, Nanxin Chen, Tomoki Hayashi, Takaaki Hori, Hirofumi Inaguma, Ziyan Jiang, Masao Someki, Nelson Enrique Yalta Soplin, Ryuichi Yamamoto, Xiaofei Wang, et al. 2019. A comparative study on transformer vs RNN in speech applications. In Proc. IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 449--456.Google ScholarCross Ref
- Dietrich Klakow and Jochen Peters. 2002. Testing the correlation of word error rate and perplexity. Speech Communication 38, 1-2 (2002), 19--28.Google ScholarDigital Library
- Oscar Koller. 2020. Quantitative survey of the state of the art in sign language recognition. arXiv preprint arXiv:2008.09918 (2020).Google Scholar
- Pradeep Kumar, Himaanshu Gauba, Partha Pratim Roy, and Debi Prosad Dogra. 2017. A multimodal framework for sensor based sign language recognition. Neurocomputing 259 (2017), 21--38.Google ScholarCross Ref
- Hyeokhyen Kwon, Gregory D Abowd, and Thomas Plötz. 2021. Complex Deep Neural Networks from Large Scale Virtual IMU Data for Effective Human Activity Recognition Using Wearables. Sensors 21, 24 (2021), 8337.Google ScholarCross Ref
- Hyeokhyen Kwon, Catherine Tong, Harish Haresamudram, Yan Gao, Gregory D Abowd, Nicholas D Lane, and Thomas Ploetz. 2020. Imutube: Automatic extraction of virtual on-body accelerometry from video for human activity recognition. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 3 (2020), 1--29.Google ScholarDigital Library
- Hyeokhyen Kwon, Bingyao Wang, Gregory D Abowd, and Thomas Plötz. 2021. Approaching the real-world: Supporting activity recognition training with virtual imu data. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 5, 3 (2021), 1--32.Google ScholarDigital Library
- Kehuang Li, Zhengyu Zhou, and Chin-Hui Lee. 2016. Sign transition modeling and a scalable solution to continuous sign language recognition for real-world applications. ACM Transactions on Accessible Computing (TACCESS) 8, 2 (2016), 1--23.Google ScholarDigital Library
- Yilin Liu, Fengyang Jiang, and Mahanth Gowda. 2020. Finger Gesture Tracking for Interactive Applications: A Pilot Study with Sign Languages. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 4, 3 (2020), 1--21.Google ScholarDigital Library
- Yilin Liu, Shijia Zhang, and Mahanth Gowda. 2021. When Video meets Inertial Sensors: Zero-shot Domain Adaptation for Finger Motion Analytics with Inertial Sensors. In Proceedings of the International Conference on Internet-of-Things Design and Implementation. 182--194.Google ScholarDigital Library
- Signing Savvy LLC. 2019. Signing savvy -- ASL sign language video dictionary. https://www.signingsavvy.com/Google Scholar
- Michele Miozzo and Francesca Peressotti. 2022. How the hand has shaped sign languages. Scientific Reports 11980 (2022), 1--12.Google Scholar
- Mohamed Mohandes, Mohamed Deriche, and Junzhao Liu. 2014. Image-based and sensor-based approaches to Arabic sign language recognition. IEEE Transactions on Human-Machine Systems 44, 4 (2014), 551--557.Google ScholarCross Ref
- Meinard Müller. 2007. Dynamic time warping. Information Retrieval for Music and Motion (2007), 69--84.Google ScholarDigital Library
- Chaithanya Kumar Mummadi, Frederic Philips Peter Leo, Keshav Deep Verma, Shivaji Kasireddy, Philipp Marcel Scholl, and Kristof Van Laerhoven. 2017. Real-time embedded recognition of sign language alphabet fingerspelling in an imu-based glove. In Proceedings of the 4th international Workshop on Sensor-based Activity Recognition and Interaction. 1--6.Google ScholarDigital Library
- Carol Neidle, Stan Sclaroff, and Vassilis Athitsos. 2001. SignStream: A tool for linguistic and computer vision research on visual-gestural language data. Behavior Research Methods, Instruments, & Computers 33, 3 (2001), 311--320.Google ScholarCross Ref
- Carol Neidle, Ashwin Thangali, and Stan Sclaroff. 2012. Challenges in Development of the American Sign Language Lexicon Video Dataset (ASLLVD) Corpus. In Language Resources and Evaluation Conference (LREC). Citeseer.Google Scholar
- Iason Oikonomidis, Nikolaos Kyriazis, and Antonis A Argyros. 2011. Efficient model-based 3D tracking of hand articulations using Kinect.. In Proceedings of the British Machine Vision Conference. 101.1--101.11.Google ScholarCross Ref
- W. H. Organization. 2021. Deafness and hearing loss. https://www.who.int/news-room/fact-sheets/detail/deafness-and-hearing-lossGoogle Scholar
- Mingzhang Pan, Yingzhe Tang, and Hongqi Li. 2023. State-of-the-Art in Data Gloves: A Review of Hardware, Algorithms, and Applications. IEEE Transactions on Instrumentation and Measurement (2023).Google Scholar
- Daniel S Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D Cubuk, and Quoc V Le. 2019. Specaugment: A simple data augmentation method for automatic speech recognition. arXiv preprint arXiv:1904.08779 (2019).Google Scholar
- Georgios Pavlakos, Vasileios Choutas, Nima Ghorbani, Timo Bolkart, Ahmed AA Osman, Dimitrios Tzionas, and Michael J Black. 2019. Expressive body capture: 3D hands, face, and body from a single image. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10975--10985.Google ScholarCross Ref
- Javier Romero, Dimitrios Tzionas, and Michael J. Black. 2017. Embodied Hands: Modeling and Capturing Hands and Bodies Together. ACM Trans. Graph. 36, 6, Article 245 (nov 2017), 17 pages. https://doi.org/10.1145/3130800.3130883Google ScholarDigital Library
- Abdou Shalaby, Mohammed Elmogy, and Ahmed Abo El-Fetouh. 2017. Algorithms and applications of structure from motion (SFM): A survey. Algorithms 6, 06 (2017).Google Scholar
- Jiacheng Shang and Jie Wu. 2017. A robust sign language recognition system with multiple Wi-Fi devices. In Proceedings of the Workshop on Mobility in the Evolving Internet Architecture. 19--24.Google ScholarDigital Library
- Toby Sharp, Cem Keskin, Duncan Robertson, Jonathan Taylor, Jamie Shotton, David Kim, Christoph Rhemann, Ido Leichter, Alon Vinnikov, Yichen Wei, et al. 2015. Accurate, robust, and flexible real-time hand tracking. In Proceedings of the 33rd Annual ACM Conference on Human Factors in Computing Systems. 3633--3642.Google ScholarDigital Library
- Baoguang Shi, Xiang Bai, and Cong Yao. 2016. An end-to-end trainable neural network for image-based sequence recognition and its application to scene text recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 39, 11 (2016), 2298--2304.Google ScholarDigital Library
- Mohammad Sharif Shourijeh, Reza Sharif Razavian, and John McPhee. 2017. Estimation of maximum finger tapping frequency using musculoskeletal dynamic simulations. Journal of Computational and Nonlinear Dynamics 12, 5 (2017).Google ScholarCross Ref
- Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).Google Scholar
- Thad Starner and Alex Pentland. 1997. Real-time american sign language recognition from video using hidden markov models. In Motion-based recognition. Springer, 227--243.Google Scholar
- American Sign Languange University. [n.d.]. ASL Classifiers Level 1. https://www.lifeprint.com/asl101/pages-signs/classifiers/classifiers-main.htmGoogle Scholar
- Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. Advances in Neural Information Processing Systems 30 (2017).Google Scholar
- William G Vicars. 1997. ASL University. Lifeprint Institute.Google Scholar
- William G. Vicars. 2017. American Sign Language University. http://www.lifeprint.com/index.htmGoogle Scholar
- Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Katerina Fragkiadaki. 2017. Sfm-net: Learning of structure and motion from video. arXiv preprint arXiv:1704.07804 (2017).Google Scholar
- Zhibo Wang, Tengda Zhao, Jinxin Ma, Hongkai Chen, Kaixin Liu, Huajie Shao, Qian Wang, and Ju Ren. 2020. Hear sign language: A real-time end-to-end sign language recognition system. IEEE Transactions on Mobile Computing 21, 7 (2020), 2398--2410.Google Scholar
- American Sign Language University William G. Vicars. 2017. First 100 Signs. http://www.lifeprint.com/asl101/pages-layout/concepts.htmGoogle Scholar
- Jian Wu, Lu Sun, and Roozbeh Jafari. 2016. A wearable system for recognizing American sign language in real-time using IMU and surface EMG sensors. IEEE Journal of Biomedical and Health Informatics 20, 5 (2016), 1281--1290.Google ScholarCross Ref
- Zahoor Zafrulla, Helene Brashear, Thad Starner, Harley Hamilton, and Peter Presti. 2011. American sign language recognition with the kinect. In Proceedings of the 13th International Conference on Multimodal Interfaces. 279--286.Google ScholarDigital Library
- Morteza Zahedi, Daniel Keysers, and Hermann Ney. 2005. Pronunciation clustering and modeling of variability for appearance-based sign language recognition. In International Gesture Workshop. Springer, 68--79.Google Scholar
- Cheng Zhang, Qiuyue Xue, Anandghan Waghmare, Ruichen Meng, Sumeet Jain, Yizeng Han, Xinyu Li, Kenneth Cunefare, Thomas Ploetz, Thad Starner, et al. 2018. FingerPing: Recognizing fine-grained hand poses using active acoustic on-body sensing. In Proceedings of the 2018 CHI Conference on Human Factors in Computing Systems. 1--10.Google ScholarDigital Library
- Fan Zhang, Valentin Bazarevsky, Andrey Vakunov, Andrei Tkachenka, George Sung, Chuo-Ling Chang, and Matthias Grundmann. 2020. Mediapipe hands: On-device real-time hand tracking. arXiv preprint arXiv:2006.10214 (2020).Google Scholar
- Qian Zhang, JiaZhen Jing, Dong Wang, and Run Zhao. 2022. WearSign: Pushing the Limit of Sign Language Translation Using Inertial and EMG Wearables. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 6, 1 (2022), 1--27.Google ScholarDigital Library
- Qian Zhang, Dong Wang, Run Zhao, and Yinggang Yu. 2019. MyoSign: enabling end-to-end sign language recognition with wearables. In Proceedings of the 24th International Conference on Intelligent User Interfaces. 650--660.Google ScholarDigital Library
- Christian Zimmermann and Thomas Brox. 2017. Learning to estimate 3D hand pose from single RGB images. In Proceedings of the IEEE International Conference on Computer Vision. 4903--4911.Google ScholarCross Ref
Index Terms
- SignRing: Continuous American Sign Language Recognition Using IMU Rings and Virtual IMU Data
Recommendations
Synthetic Smartwatch IMU Data Generation from In-the-wild ASL Videos
The scarcity of training data available for IMUs in wearables poses a serious challenge for IMU-based American Sign Language (ASL) recognition. In this paper, we ask the following question: can we "translate" the large number of publicly available, in-...
Use of the Wii balance board system in vestibular rehabilitation
INTERACCION '12: Proceedings of the 13th International Conference on Interacción Persona-OrdenadorTherapies for the treatment of vestibular rehabilitation are focused on postural control, the risk of falls, and the inclusion of patients in activities of daily living (ADL). Recent studies show that virtual reality provides a low-cost system for ...
Improving Data Augmentation for Low-Resource NMT Guided by POS-Tagging and Paraphrase Embedding
Data augmentation is an approach for several text generation tasks. Generally, in the machine translation paradigm, mainly in low-resource language scenarios, many data augmentation methods have been proposed. The most used approaches for generating ...
Comments