research-article

StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin

Authors:

Neha Sahipjohn,

Vishal Tambrahalli,

Ramanathan Subramanian,

Vineet GandhiAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 8, Issue 3

Article No.: 123, Pages 1 - 21

https://doi.org/10.1145/3678515

Published: 09 September 2024 Publication History

Abstract

We introduce StethoSpeech, a silent speech interface that transforms flesh-conducted vibrations behind the ear into speech. This innovation is designed to improve social interactions for those with voice disorders, and furthermore enable discreet public communication. Unlike prior efforts, StethoSpeech does not require (a) paired-speech data for recorded vibrations and (b) a specialized device for recording vibrations, as it can work with an off-the-shelf clinical stethoscope. The novelty of our framework lies in the overall design, simulation of the ground-truth speech, and a sequence-to-sequence translation network, which works in the latent space. We present comprehensive experiments on the existing CSTR NAM TIMIT Plus corpus and our proposed StethoText: a large-scale synchronized database of non-audible murmur and text for speech research. Our results show that StethoSpeech provides natural-sounding and intelligible speech, significantly outperforming existing methods on several quantitative and qualitative metrics. Additionally, we showcase its capacity to extend its application to speakers not encountered during training and its effectiveness in challenging, noisy environments. Speech samples are available at https://stethospeech.github.io/StethoSpeech/.

References

[1]

Sercan Arik, Jitong Chen, Kainan Peng, Wei Ping, and Yanqi Zhou. 2018. Neural Voice Cloning with a Few Samples. In Advances in Neural Information Processing Systems, S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett (Eds.), Vol. 31. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2018/file/4559912e7a94a9c32b09d894f2bc3c82-Paper.pdf

[2]

Alexei Baevski and Abdelrahman Mohamed. 2020. Effectiveness of Self-Supervised Pre-Training for ASR. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7694--7698. https://doi.org/10.1109/ICASSP40776.2020.9054224

[3]

Alexei Baevski, Henry Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: a framework for self-supervised learning of speech representations. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS '20). Curran Associates Inc., Red Hook, NY, USA, Article 1044, 12 pages. https://dl.acm.org/doi/abs/10.5555/3495724.3496768

[4]

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, and Neil Zeghidour. 2023. AudioLM: A Language Modeling Approach to Audio Generation. 31 (jun 2023), 2523--2533. https://doi.org/10.1109/TASLP.2023.3288409

Digital Library

[5]

Itzhak Brook and Joseph F Goodman. 2020. Tracheoesophageal voice prosthesis use and maintenance in laryngectomees. International Archives of Otorhinolaryngology 24, 04 (2020), 535--538. https://www.scielo.br/j/iao/a/X6VSZFNCS4VSHwDYBH4WXqp/?lang=en#

[6]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, et al. 2022. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing 16, 6 (2022), 1505--1518. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=9814838

[7]

Steven R. Cox, Julie A. Theurer, Sandi J. Spaulding, and Philip C. Doyle. 2015. The multidimensional impact of total laryngectomy on women. Journal of Communication Disorders 56 (2015), 59--75. https://www.sciencedirect.com/science/article/abs/pii/S002199241500043X

[8]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers). Association for Computational Linguistics, 4171--4186. https://doi.org/10.18653/V1/N19-1423

[9]

Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Haithem Boussaid, Ebtessam Almazrouei, and Merouane Debbah. 2023. Lip2Vec: Efficient and Robust Visual Speech Recognition via Latent-to-Latent Visual to Audio Representation Mapping. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 13790--13801. https://openaccess.thecvf.com/content/ICCV2023/papers/Djilali_Lip2Vec_Efficient_and_Robust_Visual_Speech_Recognition_via_Latent-to-Latent_Visual_ICCV_2023_paper.pdf

[10]

Carol Y. Espy-Wilson, Venkatesh R. Chari, Joel M. MacAuslan, Caroline B. Huang, and Michael J. Walsh. 1998. Enhancement of Electrolaryngeal Speech by Adaptive Filtering. Journal of Speech, Language, and Hearing Research 41, 6 (1998), 1253--1264. https://doi.org/10.1044/jslhr.4106.1253

[11]

John S Garofolo, Lori F Lamel, William M Fisher, Jonathan G Fiscus, and David S Pallett. 1993. DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. NIST speech disc 1-1.1. NASA STI/Recon technical report n 93 (1993), 27403. https://ui.adsabs.harvard.edu/abs/1993STIN...9327403G/abstract

[12]

Jose A. Gonzalez, Lam A. Cheah, Angel M. Gomez, Phil D. Green, James M. Gilbert, Stephen R. Ell, Roger K. Moore, and Ed Holdsworth. 2017. Direct Speech Reconstruction From Articulatory Sensor Data by Machine Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 12 (2017), 2362--2374. https://doi.org/10.1109/TASLP.2017.2757263

Digital Library

[13]

Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber. 2006. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning (Pittsburgh, Pennsylvania, USA) (ICML '06). Association for Computing Machinery, New York, NY, USA, 369--376. https://doi.org/10.1145/1143844.1143891

Digital Library

[14]

Horn H, Göz G, Bacher M, Müllauer M, Kretschmer I, and Axmann-Krcmar D. 1997. Reliability of electromagnetic articulography recording during speaking sequences. European journal of orthodontics 19, 6 (1997), 647--655. https://doi.org/10.1093/ejo/19.6.647

[15]

Panikos Heracleous, Tomomi Kaino, Hiroshi Saruwatari, and Kiyohiro Shikano. 2006. Unvoiced speech recognition using tissue-conductive acoustic sensor. EURASIP Journal on Advances in Signal Processing 2007 (2006), 1--11. https://asp-eurasipjournals.springeropen.com/articles/10.1155/2007/94068

Digital Library

[16]

Tatsuya Hirahara, Shota Shimizu, and Makoto Otani. 2007. Acoustic characteristics of non-audible murmur. In The Japan China Joint Conference of Acoustics, Vol. 100. 4000. https://www.researchgate.net/profile/Tatsuya-Hirahara/publication/251737500_ACOUSTIC_CHARACTERISTICS_OF_NON-AUDIBLE_MURMUR/links/00b7d52a80765613f2000000/ACOUSTIC-CHARACTERISTICS-OF-NON-AUDIBLE-MURMUR.pdf

[17]

Wei-Ning Hsu, Benjamin Bolte, Yao-Hung Hubert Tsai, Kushal Lakhotia, Ruslan Salakhutdinov, and Abdelrahman Mohamed. 2021. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. IEEE/ACM Trans. Audio, Speech and Lang. Proc. 29 (oct 2021), 3451--3460. https://doi.org/10.1109/TASLP.2021.3122291

Digital Library

[18]

Wei-Ning Hsu, Tal Remez, Bowen Shi, Jacob Donley, and Yossi Adi. 2023. ReVISE: Self-Supervised Speech Resynthesis With Visual Input for Universal and Generalized Speech Regeneration. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 18795--18805. https://openaccess.thecvf.com/content/CVPR2023/html/Hsu_ReVISE_Self-Supervised_Speech_Resynthesis_With_Visual_Input_for_Universal_and_CVPR_2023_paper.html

[19]

Keith Ito and Linda Johnson. 2017. The LJ Speech Dataset. https://keithito.com/LJ-Speech-Dataset/.

[20]

M. Janke, M. Wand, T. Heistermann, T. Schultz, and K. Prahallad. 2014. Fundamental frequency generation for whisper-to-audible speech conversion. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2579--2583. https://doi.org/10.1109/ICASSP.2014.6854066

[21]

Arnav Kapur, Shreyas Kapur, and Pattie Maes. 2018. AlterEgo: A Personalized Wearable Silent Speech Interface. In Proceedings of the 23rd International Conference on Intelligent User Interfaces (Tokyo, Japan) (IUI '18). Association for Computing Machinery, New York, NY, USA, 43--53. https://doi.org/10.1145/3172944.3172977

Digital Library

[22]

Yoshinobu Kikuchi and Hideki Kasuya. 2004. Development and evaluation of pitch adjustable electrolarynx. In Speech Prosody 2004, International Conference (Nara, Japan). https://www.isca-archive.org/speechprosody_2004/kikuchi04_speechprosody.pdf

[23]

Naoki Kimura, Tan Gemicioglu, Jonathan Womack, Richard Li, Yuhui Zhao, Abdelkareem Bedri, Alex Olwal, Jun Rekimoto, and Thad Starner. 2021. Mobile, Hands-free, Silent Speech Texting Using SilentSpeller. In Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems (Yokohama, Japan) (CHI EA '21). Association for Computing Machinery, New York, NY, USA, Article 178, 5 pages. https://doi.org/10.1145/3411763.3451552

Digital Library

[24]

Juerg Kollbrunner, Anne-Dorine Menet, and Eberhard Seifert. 2010. Psychogenic aphonia: no fixation even after a lengthy period of aphonia. Swiss medical weekly 140, 1--2 (2010), 12--17. https://boris.unibe.ch/1509/1/smw-12776.pdf

[25]

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. 2020. HiFi-GAN: generative adversarial networks for efficient and high fidelity speech synthesis. In Proceedings of the 34th International Conference on Neural Information Processing Systems (Vancouver, BC, Canada) (NIPS '20). Curran Associates Inc., Red Hook, NY, USA, Article 1428, 12 pages. https://dl.acm.org/doi/abs/10.5555/3495724.3497152

[26]

Felix Kreuk, Adam Polyak, Jade Copet, Eugene Kharitonov, Tu Anh Nguyen, Morgan Rivière, Wei-Ning Hsu, Abdelrahman Mohamed, Emmanuel Dupoux, and Yossi Adi. 2022. Textless Speech Emotion Conversion using Discrete & Decomposed Representations. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (Eds.). Association for Computational Linguistics, Abu Dhabi, United Arab Emirates, 11200--11214. https://doi.org/10.18653/v1/2022.emnlp-main.769

[27]

Harshit Malaviya, Jui Shah, Maitreya Patel, Jalansh Munshi, and Hemant A. Patil. 2020. Mspec-Net: Multi-Domain Speech Conversion Network. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 7764--7768. https://doi.org/10.1109/ICASSP40776.2020.9052966

[28]

Héctor A. Cordourier Maruri, Paulo Lopez-Meyer, Jonathan Huang, Willem Beltman, Lama Nachman, and Hong Lu. 2020. V-Speech: Noise-Robust Speech Capturing Glasses Using Vibration Sensors. GetMobile: Mobile Comp. and Comm. 24, 2 (sep 2020), 18--24. https://doi.org/10.1145/3427384.3427392

Digital Library

[29]

Yoshitaka Nakajima, Hideki Kashioka, Kiyohiro Shikano, and Nick Campbell. 2003. Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Vol. 5. V-708. https://doi.org/10.1109/ICASSP.2003.1200069

[30]

Yoshitaka Nakajima and Kiyohiro Shikano. 2006. Methods of fitting a nonaudible murmur microphone for daily use and development of urethane elastmer duplex structure type nonaudible murmur microphone. The Journal of the Acoustical Society of America 120, 5 (2006), 3330--3330. https://www.researchgate.net/publication/272259200_Methods_of_fitting_a_nonaudible_murmur_microphone_for_daily_use_and_development_of_urethane_elastmer_duplex_structure_type_nonaudible_murmur_microphone

[31]

Yuto Otani, Shun Sawada, Hidefumi Ohmura, and Kouichi Katsurada. 2023. Speech Synthesis from Articulatory Movements Recorded by Real-time MRI.In Proc. INTERSPEECH 2023. 127--131. https://doi.org/10.21437/Interspeech.2023-286

[32]

Myle Ott, Sergey Edunov, Alexei Baevski, Angela Fan, Sam Gross, Nathan Ng, David Grangier, and Michael Auli. 2019. fairseq: A Fast, Extensible Toolkit for Sequence Modeling. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics (Demonstrations), Waleed Ammar, Annie Louis, and Nasrin Mostafazadeh (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 48--53. https://doi.org/10.18653/v1/N19-4009

[33]

Mihir Parmar, Savan Doshi, Nirmesh J. Shah, Maitreya Patel, and Hemant A. Patil. 2019. Effectiveness of Cross-Domain Architectures for Whisper-to-Normal Speech Conversion. In 2019 27th European Signal Processing Conference (EUSIPCO). 1--5. https://doi.org/10.23919/EUSIPCO.2019.8902961

[34]

Adam Polyak, Yossi Adi, Jade Copet, Eugene Kharitonov, Kushal Lakhotia, Wei-Ning Hsu, Abdelrahman Mohamed, and Emmanuel Dupoux. 2021. Speech Resynthesis from Discrete Disentangled Self-Supervised Representations. In Proc. Interspeech 2021. 3615--3619. https://doi.org/10.21437/Interspeech.2021-475

[35]

K R Prajwal, Rudrabha Mukhopadhyay, Vinay P. Namboodiri, and C.V. Jawahar. 2020. Learning Individual Speaking Styles for Accurate Lip to Speech Synthesis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). https://openaccess.thecvf.com/content_CVPR_2020/papers/Prajwal_Learning_Individual_Speaking_Styles_for_Accurate_Lip_to_Speech_Synthesis_CVPR_2020_paper.pdf

[36]

Thomas F. Quatieri. 2002. Discrete-time speech signal processing: principles and practice. Pearson Education India. https://books.google.co.in/books/about/Discrete_time_Speech_Signal_Processing.html?id=5KYeAQAAIAAJ

[37]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine Mcleavey, and Ilya Sutskever. 2023. Robust Speech Recognition via Large-Scale Weak Supervision. In Proceedings of the 40th International Conference on Machine Learning (Proceedings of Machine Learning Research, Vol. 202), Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (Eds.). PMLR, 28492--28518. https://proceedings.mlr.press/v202/radford23a.html

[38]

Jun Rekimoto. 2023. WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (CHI '23). Association for Computing Machinery, New York, NY, USA, Article 700, 12 pages. https://doi.org/10.1145/3544548.3580706

Digital Library

[39]

Yi Ren, Chenxu Hu, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2021. FastSpeech 2: Fast and High-Quality End-to-End Text to Speech. In International Conference on Learning Representations. https://openreview.net/forum?id=piLPYqxtWuA

[40]

Yi Ren, Yangjun Ruan, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, and Tie-Yan Liu. 2019. FastSpeech: fast, robust and controllable text to speech. Curran Associates Inc., Red Hook, NY, USA. https://dl.acm.org/doi/abs/10.5555/3454287.3454572

[41]

Neha Sahipjohn, Neil Shah, Vishal Tambrahalli, and Vineet Gandhi. 2023. RobustL2S: Speaker-Specific Lip-to-Speech Synthesis exploiting Self-Supervised Representations. In 2023 Asia Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 1492--1499. https://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=10317357

[42]

Steffen Schneider, Alexei Baevski, Ronan Collobert, and Michael Auli. 2019. wav2vec: Unsupervised Pre-Training for Speech Recognition. In Proc. Interspeech 2019. 3465--3469. https://doi.org/10.21437/Interspeech.2019-1873

[43]

Neil Shah, Saiteja Kosgi, Vishal Tambrahalli, Neha S, Anil Nelakanti, and Vineet Gandhi. 2024. ParrotTTS: Text-to-speech synthesis exploiting disentangled self-supervised representations. In Findings of the Association for Computational Linguistics: EACL 2024. Association for Computational Linguistics, St. Julian's, Malta, 79--91. https://aclanthology.org/2024.findings-eacl.6

[44]

Neil Shah, Nirmesh Shah, and Hemant Patil. 2018. Effectiveness of Generative Adversarial Network for Non-Audible Murmur-to-Whisper Speech Conversion. In Proc. Interspeech 2018. 3157--3161. https://doi.org/10.21437/Interspeech.2018-1565

[45]

Nirmesh J Shah, Mihir Parmar, Neil Shah, and Hemant A Patil. 2018. Novel MMSE DiscoGAN for cross-domain whisper-to-speech conversion. In Machine Learning in Speech and Language Processing (MLSLP) Workshop. Google Office, 1--3. https://www.researchgate.net/publication/326668619_Novel_MMSE_DiscoGAN_for_Cross-Domain_Whisper-to-Speech_Conversion

[46]

Shota Shimizu, Makoto Otani, and Tatsuya Hirahara. 2009. Frequency characteristics of several non-audible murmur (NAM) microphones. Acoustical science and technology 30, 2 (2009), 139--142. https://www.jstage.jst.go.jp/article/ast/30/2/30_2_139/_pdf

[47]

Abhayjeet Singh, Amala Nagireddi, Anjali Jayakumar, Deekshitha G, Jesuraja Bandekar, Roopa R, Sandhya Badiger, Sathvik Udupa, Saurabh Kumar, Prasanta Kumar Ghosh, Hema A Murthy, et al. 2024. Lightweight, Multi-Speaker, Multi-Lingual Indic Text-to-Speech. IEEE Open Journal of Signal Processing 5 (2024), 790--798. https://doi.org/10.1109/OJSP.2024.3379092

[48]

Zixiong Su, Shitao Fang, and Jun Rekimoto. 2023. LipLearner: Customizable Silent Speech Interactions on Mobile Devices. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems (Hamburg, Germany) (CHI '23). Association for Computing Machinery, New York, NY, USA, Article 696, 21 pages. https://doi.org/10.1145/3544548.3581465

Digital Library

[49]

Tomoki Toda, Alan W. Black, and Keiichi Tokuda. 2007. Voice Conversion Based on Maximum-Likelihood Estimation of Spectral Parameter Trajectory. IEEE Transactions on Audio, Speech, and Language Processing 15, 8 (2007), 2222--2235. https://doi.org/10.1109/TASL.2007.907344

Digital Library

[50]

Tomoki Toda, Keigo Nakamura, Hidehiko Sekimoto, and Kiyohiro Shikano. 2009. Voice conversion for various types of body transmitted speech. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing. 3601--3604. https://doi.org/10.1109/ICASSP.2009.4960405

Digital Library

[51]

Tomoki Toda and Kiyohiro Shikano. 2005. NAM-to-speech conversion with Gaussian mixture models. In Proc. Interspeech 2005. 1957--1960. https://doi.org/10.21437/Interspeech.2005-611

[52]

Viet-Anh Tran, Gérard Bailly, Hélène L&oelig;venbruck, and Tomoki Toda. 2010. Improvement to a NAM-captured whisper-to-speech system. Speech Communication 52, 4 (2010), 314--326. https://doi.org/10.1016/j.specom.2009.11.005 Silent Speech Interfaces.

Digital Library

[53]

László Tóth, Gábor Gosztolya, Tamás Grósz, Alexandra Markó, and Tamás Gábor Csapó. 2018. Multi-Task Learning of Speech Recognition and Speech Synthesis Parameters for Ultrasound-based Silent Speech Interfaces. In Proc. Interspeech 2018. 3172--3176. https://doi.org/10.21437/Interspeech.2018-1078

[54]

Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. 2017. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6309--6318. https://dl.acm.org/doi/10.5555/3295222.3295378

Digital Library

[55]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Proceedings of the 31st International Conference on Neural Information Processing Systems (Long Beach, California, USA) (NIPS'17). Curran Associates Inc., Red Hook, NY, USA, 6000--6010. https://dl.acm.org/doi/10.5555/3295222.3295349

Digital Library

[56]

Patrick von Platen. 2022. Fine-tuning Wav2Vec2 for English ASR. https://colab.research.google.com/github/patrickvonplaten/notebooks/blob/master/Fine_tuning_Wav2Vec2_for_English_ASR.ipynb Google Colab Notebook, Last accessed on 06 September 2023.

[57]

Jingxian Wang, Chengfeng Pan, Haojian Jin, Vaibhav Singh, Yash Jain, Jason I. Hong, Carmel Majidi, and Swarun Kumar. 2020. RFID Tattoo: A Wireless Platform for Speech Recognition. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 3, 4, Article 155 (sep 2020), 24 pages. https://doi.org/10.1145/3369812

Digital Library

[58]

Chen-Yu Yang, Georgina Brown, Liang Lu, Junichi Yamagishi, and Simon King. 2012. Noise-robust whispered speech recognition using a non-audible-murmur microphone with VTS compensation. In 2012 8th International Symposium on Chinese Spoken Language Processing. 220--223. https://doi.org/10.1109/ISCSLP.2012.6423522

[59]

Shang Zeng, Haoran Wan, Shuyu Shi, and Wei Wang. 2023. mSilent: Towards General Corpus Silent Speech Recognition Using COTS mmWave Radar. Proc. ACM Interact. Mob. Wearable Ubiquitous Technol. 7, 1, Article 39 (mar 2023), 28 pages. https://doi.org/10.1145/3580838

Digital Library

[60]

zeta chicken. 2017. toWhisper. https://github.com/zeta-chicken/toWhisper 2017 github.

Index Terms

StethoSpeech: Speech Generation Through a Clinical Stethoscope Attached to the Skin
1. Computing methodologies
  1. Machine learning
2. Human-centered computing
  1. Accessibility

Recommendations

WESPER: Zero-shot and Realtime Whisper to Normal Voice Conversion for Whisper-based Speech Interactions
CHI '23: Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems

Recognizing whispered speech and converting it to normal speech creates many possibilities for speech interaction. Because the sound pressure of whispered speech is significantly lower than that of normal speech, it can be used as a semi-silent speech ...
Effects of Speaking Rate on Speech and Silent Speech Recognition
CHI EA '22: Extended Abstracts of the 2022 CHI Conference on Human Factors in Computing Systems

Speaking rate or the speed at which a person speaks is a fundamental user characteristic. This work investigates the rate in which users speak when interacting with speech and silent speech-based methods. Results revealed that native users speak about ...
Development of a silent speech interface driven by ultrasound and optical images of the tongue and lips

This article presents a segmental vocoder driven by ultrasound and optical images (standard CCD camera) of the tongue and lips for a ''silent speech interface'' application, usable either by a laryngectomized patient or for silent communication. The ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 8, Issue 3

September 2024

1782 pages

EISSN:2474-9567

DOI:10.1145/3695755

Issue’s Table of Contents

Copyright © 2024 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 September 2024

Published in IMWUT Volume 8, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
258
Total Downloads

Downloads (Last 12 months)258
Downloads (Last 6 weeks)52

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents