skip to main content
research-article

Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands

Published: 18 March 2020 Publication History

Abstract

Using silent speech to issue commands has received growing attention, as users can utilize existing command sets from voice-based interfaces without attracting other people's attention. Such interaction maintains privacy and social acceptance from others. However, current solutions for recognizing silent speech mainly rely on camera-based data or attaching sensors to the throat. Camera-based solutions require 5.82 times larger power consumption or have potential privacy issues; attaching sensors to the throat is not practical for commercial-off-the-shell (COTS) devices because additional sensors are required. In this paper, we propose a sensing technique that only needs a microphone and a speaker on COTS devices, which not only consumes little power but also has fewer privacy concerns. By deconstructing the received acoustic signals, a 2D motion profile can be generated. We propose a classifier based on convolutional neural networks (CNN) to identify the corresponding silent command from the 2D motion profiles. The proposed classifier can adapt to users and is robust when tested by environmental factors. Our evaluation shows that the system achieves 92.5% accuracy in classifying 20 commands.

References

[1]
Enrique Alameda-Hernandez, Des C McLernon, Aldo G Orozco-Lugo, M Mauricio Lara, and Mounir Ghogho. 2007. Frame/training sequence synchronization and DC-offset removal for (data-dependent) superimposed training based channel estimation. IEEE Transactions on Signal Processing 55, 6 (2007), 2557--2569.
[2]
Yannis M Assael, Brendan Shillingford, Shimon Whiteson, and Nando De Freitas. 2017. LipNet: End-to-End Sentence-level Lipreading. arXiv: Learning (2017).
[3]
Adeola Bannis, Shijia Pan, and Pei Zhang. 2014. Adding directional context to gestures using doppler effect. Proceedings of the 2014 ACM International Joint Conference on Pervasive and Ubiquitous Computing Adjunct Publication - UbiComp '14 Adjunct (2014), 5--8. https://doi.org/10.1145/2638728.2638774
[4]
James V Bradley. 1958. Complete counterbalancing of immediate sequential effects in aLatin square design. J. Amer. Statist. Assoc. 53, 282 (1958), 525--528.
[5]
Michael Brandstein and Darren Ward. 2013. Microphone arrays: signal processing techniques and applications. Springer Science & Business Media.
[6]
Jonathan S. Brumberg, Alfonso Nieto-Castanon, Philip R. Kennedy, and Frank H. Guenther. 2010. Brain-computer Interfaces for Speech Communication. Speech Commun. 52, 4 (April 2010), 367--379. https://doi.org/10.1016/j.specom.2010.01.001
[7]
Zhangjie Cao, Mingsheng Long, Jianmin Wang, and Michael I Jordan. 2018. Partial transfer learning with selective adversarial networks. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2724--2732.
[8]
Huawei Chen, Wee Ser, and Zhu Liang Yu. 2007. Optimal design of near-field wideband beamformers robust against errors in microphone array characteristics. IEEE Transactions on Circuits and Systems I: Regular Papers 54, 9 (2007), 1950--1959.
[9]
Joon Son Chung and Andrew Zisserman. 2016. Lip Reading in the Wild. (2016), 87--103.
[10]
Martin Cooke, Jon Barker, Stuart Cunningham, and Xu Shao. 2006. An audio-visual corpus for speech perception and automatic speech recognition. The Journal of the Acoustical Society of America 120, 5 (2006), 2421--2424.
[11]
B. Denby, T. Schultz, K. Honda, T. Hueber, J.M. Gilbert, and J.S. Brumberg. 2010. Silent speech interfaces. Speech Communication 52, 4 (2010), 270--287. https://doi.org/10.1016/j.specom.2009.08.002 Silent Speech Interfaces.
[12]
digikey.com. 2020. 8 Ohms General Purpose Speaker 700mW 100Hz 20kHz Top Rectangular. Retrieved January 20, 2020 from https://www.digikey.com/product-detail/en/cui-inc/CMS-15113-078SP/102-5644-ND/8581915
[13]
digikey.com. 2020. Knowles SPH0641LU4H-1 Microphone. Retrieved January 20, 2020 from https://www.digikey.com/product-detail/en/knowles/SPH0641LU4H-1/423-1402-1-ND/5332430
[14]
Björn Engquist, Anna-Karin Tornberg, and Richard Tsai. 2005. Discretization of Dirac delta functions in level set methods. J. Comput. Phys. 207, 1 (2005), 28--51.
[15]
EN ETSI. [n.d.]. 300 908 (GSM 05.02), Digital Cellular Telecommunications System. Multiplexing and Multiple Access on the Radio Path ([n.d.]).
[16]
GDJR Forney. 1972. Maximum-likelihood sequence estimation of digital sequences in the presence of intersymbol interference. IEEE Transactions on Information theory 18, 3 (1972), 363--378.
[17]
Masaaki Fukumoto. 2018. SilentVoice: Unnoticeable Voice Input by Ingressive Speech. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 237--246. https://doi.org/10.1145/3242587.3242603
[18]
S Golomb and R Scholtz. 1965. Generalized barker sequences. IEEE Transactions on Information theory 11, 4 (1965), 533--537.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[20]
Tatsuya Hirahara, Makoto Otani, Shota Shimizu, Tomoki Toda, Keigo Nakamura, Yoshitaka Nakajima, and Kiyohiro Shikano. 2010. Silent-speech enhancement using body-conducted vocal-tract resonance signals. Speech Communication 52, 4 (2010), 301--313. https://doi.org/10.1016/j.specom.2009.12.001 Silent Speech Interfaces.
[21]
Robin Hofe, Stephen R. Ell, Michael J. Fagan, James M. Gilbert, Phil D. Green, Roger K. Moore, and Sergey I. Rybchenko. 2013. Small-vocabulary speech recognition using a silent speech interface based on magnetic sensing. Speech Communication 55, 1 (2013), 22--32. https://doi.org/10.1016/j.specom.2012.02.001
[22]
Harry Hollien, Donald Dew, and Patricia Philips. 1971. Phonational frequency ranges of adults. Journal of Speech and Hearing research 14, 4 (1971), 755--760.
[23]
Walter L Kellermann. 2001. Acoustic echo cancellation for beamforming microphone arrays. In Microphone Arrays. Springer, 281--306.
[24]
Naoki Kimura, Michinari Kono, and Jun Rekimoto. 2019. SottoVoce: An Ultrasound Imaging-Based Silent Speech Interaction Using Deep Neural Networks. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA.
[25]
Swarun Kumar, Stephanie Gil, Dina Katabi, and Daniela Rus. 2014. Accurate indoor localization with zero start-up cost. In Proceedings of the 20th annual international conference on Mobile computing and networking. ACM, 483--494.
[26]
Richard Li, Jason Wu, and Thad Starner. 2019. TongueBoard: An Oral Interface for Subtle Input. In Proceedings of the 10th Augmented Human International Conference 2019 (AH2019). ACM, New York, NY, USA, Article 1, 9 pages. https://doi.org/10.1145/3311823.3311831
[27]
L. Lu, J. Yu, Y. Chen, H. Liu, Y. Zhu, L. Kong, and M. Li. 2019. Lip Reading-Based User Authentication Through Acoustic Sensing on Smartphones. IEEE/ACM Transactions on Networking 27, 1 (Feb 2019), 447--460. https://doi.org/10.1109/TNET.2019.2891733
[28]
Li Lu, Jiadi Yu, Yingying Chen, Hongbo Liu, Yanmin Zhu, Yunfei Liu, and Minglu Li. 2018. Lippass: Lip reading-based user authentication on smartphones leveraging acoustic signals. In IEEE INFOCOM 2018-IEEE Conference on Computer Communications. IEEE, 1466--1474.
[29]
Wenguang Mao, Jian He, and Lili Qiu. 2016. CAT: High-Precision Acoustic Motion Tracking. Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking - MobiCom '16 (2016), 69--81. https://doi.org/10.1145/2973750.2973755
[30]
Wenguang Mao, Mei Wang, and Lili Qiu. 2018. Aim: acoustic imaging on a mobile. In Proceedings of the 16th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 468--481.
[31]
Wenguang Mao, Zaiwei Zhang, Lili Qiu, Jian He, Yuchen Cui, and Sangki Yun. 2017. Indoor Follow Me Drone. Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services - MobiSys '17 (2017), 345--358. https://doi.org/10.1145/3081333.3081362
[32]
G. S. Meltzner, J. T. Heaton, Y. Deng, G. De Luca, S. H. Roy, and J. C. Kline. 2017. Silent Speech Recognition as an Alternative Communication Device for Persons With Laryngectomy. IEEE/ACM Transactions on Audio, Speech, and Language Processing 25, 12 (Dec 2017), 2386--2398. https://doi.org/10.1109/TASLP.2017.2740000
[33]
Saeid Motiian, Quinn Jones, Seyed Iranmanesh, and Gianfranco Doretto. 2017. Few-shot adversarial domain adaptation. In Advances in Neural Information Processing Systems. 6670--6680.
[34]
Vinod Nair and Geoffrey E Hinton. 2010. Rectified linear units improve restricted boltzmann machines. In Proceedings of the 27th international conference on machine learning (ICML-10). 807--814.
[35]
Yoshitaka Nakajima, Hideki Kashioka, Nick Campbell, and Kiyohiro Shikano. 2006. Non-Audible Murmur (NAM) Recognition. IEICE -- Trans. Inf. Syst. E89-D, 1 (Jan. 2006), 1--4. https://doi.org/10.1093/ietisy/e89-d.1.1
[36]
Y. Nakajima, H. Kashioka, K. Shikano, and N. Campbell. 2003. Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings. (ICASSP '03)., Vol. 5. V708. https://doi.org/10.1109/ICASSP.2003.1200069
[37]
Chuong H Nguyen, George K Karavas, and Panagiotis Artemiadis. 2017. Inferring imagined speech using EEG signals: a new approach using Riemannian manifold features. Journal of Neural Engineering 15, 1 (nov 2017), 016002. https://doi.org/10.1088/1741-2552/aa8235
[38]
Chunyi Peng, Guobin Shen, Yongguang Zhang, Yanlin Li, and Kun Tan. 2007. BeepBeep: A High Accuracy Acoustic Ranging System Using COTS Mobile Devices. In Proceedings of the 5th International Conference on Embedded Networked Sensor Systems (SenSys '07). ACM, New York, NY, USA, 1--14. https://doi.org/10.1145/1322263.1322265
[39]
Branislav M Popovic. 1992. Generalized chirp-like polyphase sequences with optimum correlation properties. IEEE Transactions on Information Theory 38, 4 (1992), 1406--1409.
[40]
Swadhin Pradhan, Ghufran Baig, Wenguang Mao, Lili Qiu, Guohai Chen, and Bo Yang. 2018. Smartphone-based Acoustic Indoor Space Mapping. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 2 (2018), 75.
[41]
Markku Pukkila. 2000. Channel estimation modeling. Nokia Research Center 17 (2000), 66.
[42]
Theodore S Rappaport et al. 1996. Wireless communications: principles and practice. Vol. 2. prentice hall PTR New Jersey.
[43]
Ali Sharif Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. 2014. CNN features off-the-shelf: an astounding baseline for recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition workshops. 806--813.
[44]
Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. 2014. Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research 15, 1 (2014), 1929--1958.
[45]
st.com. 2020. Getting started with sigma-delta digital interface on applicable STM32 microcontrollers. Retrieved January 5, 2020 from https://www.st.com/content/ccc/resource/technical/document/application_note/group0/b2/44/42/9d/46/b4/4d/34/DM00354333/files/DM00354333.pdf/jcr:content/translations/en.DM00354333.pdf
[46]
Ke Sun, Chun Yu, Weinan Shi, Lan Liu, and Yuanchun Shi. 2018. Lip-Interact: Improving Mobile Device Interaction with Silent Speech Commands. In Proceedings of the 31st Annual ACM Symposium on User Interface Software and Technology (UIST '18). ACM, New York, NY, USA, 581--593. https://doi.org/10.1145/3242587.3242599
[47]
Ke Sun, Ting Zhao, Wei Wang, and Lei Xie. 2018. Vskin: Sensing touch gestures on surfaces of mobile devices using acoustic signals. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking. ACM, 591--605.
[48]
Jiayao Tan, Xiaoliang Wang, Cam-Tu Nguyen, and Yu Shi. 2018. SilentKey: A new authentication framework through ultrasonic-based lip reading. Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies 2, 1 (2018), 36.
[49]
taobao.com. 2020. STM43L476 Mini Development Board. Retrieved January 20, 2020 from https://item.taobao.com/item.htm?spm=a230r.1.14.298.499c2265yYV2qF&id=582824201272&ns=1&abbucket=20#detail
[50]
VICON vero. 2019. Vero X, large field of view. Retrieved August 10, 2019 from https://www.vicon.com/products/camera-systems/vero
[51]
Anran Wang and Shyamnath Gollakota. 2019. MilliSonic: Pushing the Limits of Acoustic Motion Tracking. In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems (CHI '19). ACM, New York, NY, USA, Article 18, 11 pages. https://doi.org/10.1145/3290605.3300248
[52]
Wei Wang, Alex X Liu, and Ke Sun. 2016. Device-free gesture tracking using acoustic signals. In Proceedings of the 22nd Annual International Conference on Mobile Computing and Networking. ACM, 82--94.
[53]
Sangki Yun, Yi-Chao Chen, and Lili Qiu. 2015. Turning a mobile device into a mouse in the air. In Proceedings of the 13th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 15--29.
[54]
Sangki Yun, Yi-Chao Chen, Huihuang Zheng, Lili Qiu, and Wenguang Mao. 2017. Strata: Fine-grained acoustic-based device-free tracking. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. ACM, 15--28.
[55]
Bing Zhou, Jay Lohokare, Ruipeng Gao, and Fan Ye. 2018. EchoPrint: Two-factor Authentication Using Acoustics and Vision on Smartphones. In Proceedings of the 24th Annual International Conference on Mobile Computing and Networking (MobiCom '18). ACM, New York, NY, USA, 321--336. https://doi.org/10.1145/3241539.3241575

Cited By

View all
  • (2024)Eternity in a Second: Quick-pass Continuous Authentication Using Out-ear MicrophonesProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699366(675-688)Online publication date: 4-Nov-2024
  • (2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
  • (2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
  • Show More Cited By

Index Terms

  1. Endophasia: Utilizing Acoustic-Based Imaging for Issuing Contact-Free Silent Speech Commands

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies
    Proceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies  Volume 4, Issue 1
    March 2020
    1006 pages
    EISSN:2474-9567
    DOI:10.1145/3388993
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 18 March 2020
    Published in IMWUT Volume 4, Issue 1

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. acoustic-based imaging
    2. mobile devices
    3. silent command

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    • Ministry of Science and Technology of Taiwan
    • National Chiao Tung University
    • Startup Fund for Youngman Research at SJTU
    • Joint Key Project of the NSFC
    • National Key R&D Program of China

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)126
    • Downloads (Last 6 weeks)10
    Reflects downloads up to 17 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Eternity in a Second: Quick-pass Continuous Authentication Using Out-ear MicrophonesProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699366(675-688)Online publication date: 4-Nov-2024
    • (2024)Lipwatch: Enabling Silent Speech Recognition on Smartwatches using Acoustic SensingProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36596148:2(1-29)Online publication date: 15-May-2024
    • (2024)Sensing to Hear through MemoryProceedings of the ACM on Interactive, Mobile, Wearable and Ubiquitous Technologies10.1145/36595988:2(1-31)Online publication date: 15-May-2024
    • (2024)HandPad: Make Your Hand an On-the-go Writing Pad via Human CapacitanceProceedings of the 37th Annual ACM Symposium on User Interface Software and Technology10.1145/3654777.3676328(1-16)Online publication date: 13-Oct-2024
    • (2024)WhisperMask: a noise suppressive mask-type microphone for whisper speechProceedings of the Augmented Humans International Conference 202410.1145/3652920.3652925(1-14)Online publication date: 4-Apr-2024
    • (2024)Room-Scale Location Trace Tracking via Continuous Acoustic WavesACM Transactions on Sensor Networks10.1145/3649136Online publication date: 20-Feb-2024
    • (2024)EyeEcho: Continuous and Low-power Facial Expression Tracking on GlassesProceedings of the 2024 CHI Conference on Human Factors in Computing Systems10.1145/3613904.3642613(1-24)Online publication date: 11-May-2024
    • (2024)UltraSR: Silent Speech Reconstruction via Acoustic SensingIEEE Transactions on Mobile Computing10.1109/TMC.2024.341917023:12(12848-12865)Online publication date: Dec-2024
    • (2024)EarSSR: Silent Speech Recognition via EarphonesIEEE Transactions on Mobile Computing10.1109/TMC.2024.335671923:8(8493-8507)Online publication date: Aug-2024
    • (2024)Acoustic-based Lip Reading for Mobile Devices: Dataset, Benchmark and A Self Distillation-based ApproachIEEE Transactions on Mobile Computing10.1109/TMC.2023.3294416(1-18)Online publication date: 2024
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media