research-article

Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands

Authors:

Yongzhao Zhang,

Wei-Hsiang Huang,

Chuang-Wen You,

Jiadi YuAuthors Info & Claims

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies, Volume 4, Issue 1

Article No.: 37, Pages 1 - 26

https://doi.org/10.1145/3381008

Published: 18 March 2020 Publication History

Abstract

Using silent speech to issue commands has received growing attention, as users can utilize existing command sets from voice-based interfaces without attracting other people's attention. Such interaction maintains privacy and social acceptance from others. However, current solutions for recognizing silent speech mainly rely on camera-based data or attaching sensors to the throat. Camera-based solutions require 5.82 times larger power consumption or have potential privacy issues; attaching sensors to the throat is not practical for commercial-off-the-shell (COTS) devices because additional sensors are required. In this paper, we propose a sensing technique that only needs a microphone and a speaker on COTS devices, which not only consumes little power but also has fewer privacy concerns. By deconstructing the received acoustic signals, a 2D motion profile can be generated. We propose a classifier based on convolutional neural networks (CNN) to identify the corresponding silent command from the 2D motion profiles. The proposed classifier can adapt to users and is robust when tested by environmental factors. Our evaluation shows that the system achieves 92.5% accuracy in classifying 20 commands.

References

[1]

Enrique Alameda-Hernandez, Des C McLernon, Aldo G Orozco-Lugo, M Mauricio Lara, and Mounir Ghogho. 2007. Frame/training sequence synchronization and DC-offset removal for (data-dependent) superimposed training based channel estimation. IEEE Transactions on Signal Processing 55, 6 (2007), 2557--2569.

[2]

Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2017. LipNet: End-to-End Sentence-level Lipreading. arXiv: Learning (2017).

[3]

Adeola Bannis, Shijia Pan, and Pei Zhang. 2014. Adding directional context to gestures using doppler effect. Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing Adjunct Publication - UbiComp '14 Adjunct (2014), 5--8. https://doi.org/10.1145/2638728.2638774

Digital Library

[4]

James V Bradley. 1958. Complete counterbalancing of immediate sequential effects in aLatin square design. J. Amer. Statist. Assoc. 53, 282 (1958), 525--528.

[5]

Michael Brandstein and Darren Ward. 2013. Microphone arrays: signal processing techniques and applications. Springer Science & Business Media.

[6]

Jonathan S. Brumberg, Alfonso Nieto-Castanon, Philip R. Kennedy, and Frank H. Guenther. 2010. Brain-computer Interfaces for Speech Communication. Speech Commun. 52, 4 (April 2010), 367--379. https://doi.org/10.1016/j.specom.2010.01.001

Digital Library

[7]

Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I Jordan. 2018. Partial transfer learning with selective adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2724--2732.

[8]

Huawei Chen, Wee Ser, and Zhu Liang Yu. 2007. Optimal design of near-field wideband beamformers robust against errors in microphone array characteristics. IEEE Transactions on Circuits and Systems I: Regular Papers 54, 9 (2007), 1950--1959.

[9]

Joon Son Chung and Andrew Zisserman. 2016. Lip Reading in the Wild. (2016), 87--103.

[10]

Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120, 5 (2006), 2421--2424.

[11]

B. Denby, T. Schultz, K. Honda, T. Hueber, J.M. Gilbert, and J.S. Brumberg. 2010. Silent speech interfaces. Speech Communication 52, 4 (2010), 270--287. https://doi.org/10.1016/j.specom.2009.08.002 Silent Speech Interfaces.

Digital Library

[12]

digikey.com. 2020. 8 Ohms General Purpose Speaker 700mW 100Hz 20kHz Top Rectangular. Retrieved January 20, 2020 from https://www.digikey.com/product-detail/en/cui-inc/CMS-15113-078SP/102-5644-ND/8581915

[13]

digikey.com. 2020. Knowles SPH0641LU4H-1 Microphone. Retrieved January 20, 2020 from https://www.digikey.com/product-detail/en/knowles/SPH0641LU4H-1/423-1402-1-ND/5332430

[14]

Björn Engquist, Anna-Karin Tornberg, and Richard Tsai. 2005. Discretization of Dirac delta functions in level set methods. J. Comput. Phys. 207, 1 (2005), 28--51.

Digital Library

[15]

EN ETSI. [n.d.]. 300 908 (GSM 05.02), Digital Cellular Telecommunications System. Multiplexing and Multiple Access on the Radio Path ([n.d.]).

[16]

GDJR Forney. 1972. Maximum-likelihood sequence estimation of digital sequences in the presence of intersymbol interference. IEEE Transactions on Information theory 18, 3 (1972), 363--378.

Digital Library

[17]

Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 237--246. https://doi.org/10.1145/3242587.3242603

Digital Library

[18]

S Golomb and R Scholtz. 1965. Generalized barker sequences. IEEE Transactions on Information theory 11, 4 (1965), 533--537.

Digital Library

[19]

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.

[20]

Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silent-speech enhancement using body-conducted vocal-tract resonance signals. Speech Communication 52, 4 (2010), 301--313. https://doi.org/10.1016/j.specom.2009.12.001 Silent Speech Interfaces.

Digital Library

[21]

Robin Hofe, Stephen R. Ell, Michael J. Fagan, James M. Gilbert, Phil D. Green, Roger K. Moore, and Sergey I. Rybchenko. 2013. Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing. Speech Communication 55, 1 (2013), 22--32. https://doi.org/10.1016/j.specom.2012.02.001

Digital Library

[22]

Harry Hollien, Donald Dew, and Patricia Philips. 1971. Phonational frequency ranges of adults. Journal of Speech and Hearing research 14, 4 (1971), 755--760.

[23]

Walter L Kellermann. 2001. Acoustic echo cancellation for beamforming microphone arrays. In Microphone Arrays. Springer, 281--306.

[24]

Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA.

Digital Library

[25]

Swarun Kumar, Stephanie Gil, Dina Katabi, and Daniela Rus. 2014. Accurate indoor localization with zero start-up cost. In Proceedings of the 20th annual international conference on Mobile computing and networking. ACM, 483--494.

Digital Library

[26]

Richard Li, Jason Wu, and Thad Starner. 2019. TongueBoard: An Oral Interface for Subtle Input. In Proceedings of the 10th Augmented Human International Conference 2019 (AH2019). ACM, New York, NY, USA, Article 1, 9 pages. https://doi.org/10.1145/3311823.3311831

Digital Library

[27]

L. Lu, J. Yu, Y. Chen, H. Liu, Y. Zhu, L. Kong, and M. Li. 2019. Lip Reading-Based User Authentication Through Acoustic Sensing on Smartphones. IEEE/ACM Transactions on Networking 27, 1 (Feb 2019), 447--460. https://doi.org/10.1109/TNET.2019.2891733

Digital Library

[28]

Li Lu, Jiadi Yu, Yingying Chen, Hongbo Liu, Yanmin Zhu, Yunfei Liu, and Minglu Li. 2018. Lippass: Lip reading-based user authentication on smartphones leveraging acoustic signals. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 1466--1474.

Digital Library

[29]

Wenguang Mao, Jian He, and Lili Qiu. 2016. CAT: High-Precision Acoustic Motion Tracking. Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking - MobiCom '16 (2016), 69--81. https://doi.org/10.1145/2973750.2973755

Digital Library

[30]

Wenguang Mao, Mei Wang, and Lili Qiu. 2018. Aim: acoustic imaging on a mobile. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 468--481.

Digital Library

[31]

Wenguang Mao, Zaiwei Zhang, Lili Qiu, Jian He, Yuchen Cui, and Sangki Yun. 2017. Indoor Follow Me Drone. Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services - MobiSys '17 (2017), 345--358. https://doi.org/10.1145/3081333.3081362

Digital Library

[32]

G. S. Meltzner, J. T. Heaton, Y. Deng, G. De Luca, S. H. Roy, and J. C. Kline. 2017. Silent Speech Recognition as an Alternative Communication Device for Persons With Laryngectomy. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 12 (Dec 2017), 2386--2398. https://doi.org/10.1109/TASLP.2017.2740000

Digital Library

[33]

Saeid Motiian, Quinn Jones, Seyed Iranmanesh, and Gianfranco Doretto. 2017. Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems. 6670--6680.

[34]

Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807--814.

Digital Library

[35]

Yoshitaka Nakajima, Hideki Kashioka, Nick Campbell, and Kiyohiro Shikano. 2006. Non-Audible Murmur (NAM) Recognition. IEICE -- Trans. Inf. Syst. E89-D, 1 (Jan. 2006), 1--4. https://doi.org/10.1093/ietisy/e89-d.1.1

Digital Library

[36]

Y. Nakajima, H. Kashioka, K. Shikano, and N. Campbell. 2003. Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Vol. 5. V708. https://doi.org/10.1109/ICASSP.2003.1200069

[37]

Chuong H Nguyen, George K Karavas, and Panagiotis Artemiadis. 2017. Inferring imagined speech using EEG signals: a new approach using Riemannian manifold features. Journal of Neural Engineering 15, 1 (nov 2017), 016002. https://doi.org/10.1088/1741-2552/aa8235

[38]

Chunyi Peng, Guobin Shen, Yongguang Zhang, Yanlin Li, and Kun Tan. 2007. BeepBeep: A High Accuracy Acoustic Ranging System Using COTS Mobile Devices. In Proceedings of the 5th International Conference on Embedded Networked Sensor Systems (SenSys '07). ACM, New York, NY, USA, 1--14. https://doi.org/10.1145/1322263.1322265

Digital Library

[39]

Branislav M Popovic. 1992. Generalized chirp-like polyphase sequences with optimum correlation properties. IEEE Transactions on Information Theory 38, 4 (1992), 1406--1409.

Digital Library

[40]

Swadhin Pradhan, Ghufran Baig, Wenguang Mao, Lili Qiu, Guohai Chen, and Bo Yang. 2018. Smartphone-based Acoustic Indoor Space Mapping. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 2 (2018), 75.

Digital Library

[41]

Markku Pukkila. 2000. Channel estimation modeling. Nokia Research Center 17 (2000), 66.

[42]

Theodore S Rappaport et al. 1996. Wireless communications: principles and practice. Vol. 2. prentice hall PTR New Jersey.

[43]

Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 806--813.

Digital Library

[44]

Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929--1958.

Digital Library

[45]

st.com. 2020. Getting started with sigma-delta digital interface on applicable STM32 microcontrollers. Retrieved January 5, 2020 from https://www.st.com/content/ccc/resource/technical/document/application_note/group0/b2/44/42/9d/46/b4/4d/34/DM00354333/files/DM00354333.pdf/jcr:content/translations/en.DM00354333.pdf

[46]

Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 581--593. https://doi.org/10.1145/3242587.3242599

Digital Library

[47]

Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. 2018. Vskin: Sensing touch gestures on surfaces of mobile devices using acoustic signals. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. ACM, 591--605.

Digital Library

[48]

Jiayao Tan, Xiaoliang Wang, Cam-Tu Nguyen, and Yu Shi. 2018. SilentKey: A new authentication framework through ultrasonic-based lip reading. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1 (2018), 36.

Digital Library

[49]

taobao.com. 2020. STM43L476 Mini Development Board. Retrieved January 20, 2020 from https://item.taobao.com/item.htm?spm=a230r.1.14.298.499c2265yYV2qF&id=582824201272&ns=1&abbucket=20#detail

[50]

VICON vero. 2019. Vero X, large field of view. Retrieved August 10, 2019 from https://www.vicon.com/products/camera-systems/vero

[51]

Anran Wang and Shyamnath Gollakota. 2019. MilliSonic: Pushing the Limits of Acoustic Motion Tracking. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA, Article 18, 11 pages. https://doi.org/10.1145/3290605.3300248

Digital Library

[52]

Wei Wang, Alex X Liu, and Ke Sun. 2016. Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. ACM, 82--94.

[53]

Sangki Yun, Yi-Chao Chen, and Lili Qiu. 2015. Turning a mobile device into a mouse in the air. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 15--29.

Digital Library

[54]

Sangki Yun, Yi-Chao Chen, Huihuang Zheng, Lili Qiu, and Wenguang Mao. 2017. Strata: Fine-grained acoustic-based device-free tracking. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 15--28.

Digital Library

[55]

Bing Zhou, Jay Lohokare, Ruipeng Gao, and Fan Ye. 2018. EchoPrint: Two-factor Authentication Using Acoustics and Vision on Smartphones. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (MobiCom '18). ACM, New York, NY, USA, 321--336. https://doi.org/10.1145/3241539.3241575

Digital Library

Cited By

Gao MTong XChen JChen YXiao FHan JShu YLiu JTan RHe YChen J(2024)Eternity in a Second: Quick-pass Continuous Authentication Using Out-ear MicrophonesProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699366(675-688)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3666025.3699366
Zhang QLan YGuo KWang D(2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659614
Zhang QLiu KWang D(2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659598
Show More Cited By

Index Terms

Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands
1. Human-centered computing
  1. Ubiquitous and mobile computing
    1. Ubiquitous and mobile computing systems and tools

Recommendations

Speech recognition on mobile devices
Mobile Multimedia Processing

The enthusiasm of deploying automatic speech recognition (ASR) on mobile devices is driven both by remarkable advances in ASR technology and by the demand for efficient user interfaces on such devices as mobile phones and personal digital assistants (...
A user study of visual versus sonically-enhanced interfaces for use while walking
MM '10: Proceedings of the 18th ACM international conference on Multimedia

This paper presents a user study on interaction with a mobile device. We investigate the use of non-speech sound in mobile interfaces and design a sonically-enhanced interface. The sonically-enhanced interface is compared to a visual interface when ...
Video streaming to mobile handheld devices: challenges in decoding, adaptation, and browsing
MCAM'07: Proceedings of the 2007 international conference on Multimedia content analysis and mining

Growing popularity and richer functionality of contemporary mobile handheld devices such as PDAs and smart phones have enabled emerging video streaming applications to these devices via various wireless networks. However, these handheld devices are ...

Comments

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies

Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies Volume 4, Issue 1

March 2020

1006 pages

EISSN:2474-9567

DOI:10.1145/3388993

Issue’s Table of Contents

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 18 March 2020

Published in IMWUT Volume 4, Issue 1

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Ministry of Science and Technology of Taiwan
National Chiao Tung University
Startup Fund for Youngman Research at SJTU
Joint Key Project of the NSFC
National Key R&D Program of China

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

31
Total Citations
View Citations
1,160
Total Downloads

Downloads (Last 12 months)126
Downloads (Last 6 weeks)10

Reflects downloads up to 17 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Gao MTong XChen JChen YXiao FHan JShu YLiu JTan RHe YChen J(2024)Eternity in a Second: Quick-pass Continuous Authentication Using Out-ear MicrophonesProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699366(675-688)Online publication date: 4-Nov-2024
https://dl.acm.org/doi/10.1145/3666025.3699366
Zhang QLan YGuo KWang D(2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659614
Zhang QLiu KWang D(2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
https://dl.acm.org/doi/10.1145/3659598
Lu YDing DPan HLi YZhou JFu YZhang YChen YXue G(2024)HandPad: Make Your Hand an On-the-go Writing Pad via Human CapacitanceProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676328(1-16)Online publication date: 13-Oct-2024
https://dl.acm.org/doi/10.1145/3654777.3676328
Hiraki HKanazawa SMiura TYoshida MMochimaru MRekimoto J(2024)WhisperMask: a noise suppressive mask-type microphone for whisper speechProceedings of the Augmented Humans International Conference 202410.1145/3652920.3652925(1-14)Online publication date: 4-Apr-2024
https://dl.acm.org/doi/10.1145/3652920.3652925
Lian JYuan XLou JChen LWang HTzeng N(2024)Room-Scale Location Trace Tracking via Continuous Acoustic WavesACM Transactions on Sensor Networks10.1145/3649136Online publication date: 20-Feb-2024
https://doi.org/10.1145/3649136
Li KZhang RChen SChen BSakashita MGuimbretiere FZhang C(2024)EyeEcho: Continuous and Low-power Facial Expression Tracking on GlassesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642613(1-24)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3613904.3642613
Fu YWang SZhong LChen LRen JZhang Y(2024)UltraSR: Silent Speech Reconstruction via Acoustic SensingIEEE Transactions on Mobile Computing10.1109/TMC.2024.341917023:12(12848-12865)Online publication date: Dec-2024
https://doi.org/10.1109/TMC.2024.3419170
Sun XXiong JFeng CLi HWu YFang DChen X(2024)EarSSR: Silent Speech Recognition via EarphonesIEEE Transactions on Mobile Computing10.1109/TMC.2024.335671923:8(8493-8507)Online publication date: Aug-2024
https://doi.org/10.1109/TMC.2024.3356719
Yin YWang ZXia KXie LLu S(2024)Acoustic-based Lip Reading for Mobile Devices: Dataset, Benchmark and A Self Distillation-based ApproachIEEE Transactions on Mobile Computing10.1109/TMC.2023.3294416(1-18)Online publication date: 2024
https://doi.org/10.1109/TMC.2023.3294416
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Issue’s Table of Contents