Abstract
This paper develops, simulates and experimentally evaluates a novel method based on non-contact low frequency (LF) ultrasound which can determine, from airborne reflection, whether the lips of a subject are open or closed. The method is capable of accurately distinguishing between open and closed lip states through the use of a low-complexity detection algorithm, and is highly robust to interfering audible noise. A novel voice activity detector is implemented and evaluated using the proposed method and shown to detect voice activity with high accuracy, even in the presence of high levels of background noise. The lip state detector is evaluated at a number of angles of incidence to the mouth and under various conditions of background noise. The underlying mouth state detection technique relies upon an inaudible LF ultrasonic excitation, generated in front of the face of a user, either reflecting back from their face as a simple echo in the closed mouth state or resonating inside the open mouth and vocal tract, affecting the spectral response of the reflected wave when the mouth is open. The difference between echo and resonance behaviours is used as the basis for automated lip opening detection, which implies determining whether the mouth is open or closed at the lips. Apart from this, potential applications include use in voice generation prosthesis for speech impaired patients, or as a hands-free control for electrolarynx and similar rehabilitation devices. It is also applicable to silent speech interfaces and may have use for speech authentication.
Similar content being viewed by others
Notes
The speed of sound is approximated to 1,600 m/s in muscle and 343 m/s in air.
The six vowel geometries are: /i/, /æ/, /u/, /ɛ/, /ɔ/, /o/ as in heed, had, who, head, paw, and hoe respectively.
The conversion was made using the lpcaa2rf() and lpcrf2ar() functions from the excellent Voicebox package [9]
Office and Car recordings were obtained as 96 kHz, 24- and 32-bit sample files from Freesound.org (nos. 108695 and 193780 respectively), recorded on Tascam DR-100 mk-II using on board directional condenser microphones (TEAC Corp., Tokyo, Japan). Other recordings were made by the author using the on board directional condenser microphones of a Zoom H4n (Zoom Corp., Tokyo, Japan), recorded at a 96 kHz sample rate with 16-bit resolution. The original recordings are available upon request.
Note that, since the system detects lip opening rather than speaking, it is possible that some of these false detections did actually correspond to non-speech lip opening events if the subject opened their lips, for example to breathe through their mouth.
References
F. Ahmadi, Voice replacement for the severely speech impaired through sub-ultrasonic excitation of the vocal tract. Ph.D. Thesis, Nanyang Technological University (2013). http://repository.ntu.edu.sg/handle/10356/52661
F. Ahmadi, M. Ahmadi, I.V. McLoughlin, Human mouth state detection using low frequency ultrasound, in INTERSPEECH, (2013) pp. 1806–1810
F. Ahmadi, I.V. McLoughlin, The use of low-frequency ultrasonics in speech processing, in Signal Processing, ed. by Sebastian Miron (InTech, 2010). ISBN: 978-953-7619-91-6
F. Ahmadi, I.V. McLoughlin, Measuring resonances of the vocal tract using frequency sweeps at the lips, in 2012 5th International Symposium on Communications Control and Signal Processing (ISCCSP) (2012)
F. Ahmadi, I.V. McLoughlin, S. Chauhan, G. ter Haar, Bio-effects and safety of low-intensity, low-frequency ultrasonic exposure. Progr. Biophys. Mol. Biol. 108, 3 (2012)
F. Ahmadi, I.V. McLoughlin, H.R. Sharifzadeh, Autoregressive modelling for linear prediction of ultrasonic speech, in INTERSPEECH, (2010), pp. 1616–1619
S.P. Arjunan, H. Weghorn, D.K. Kumar, W.C. Yau, Vowel recognition of English and German language using facial movement (SEMG) for speech control based HCI, in Proceedings of the HCSNet workshop on Use of vision in human–computer interaction—Volume 56, VisHCI ’06, ( Australian Computer Society, Inc. 2006), pp. 13–18
D. Beautemps, P. Badin, R. Laboissihere, Deriving vocal-tract area functions from midsagittal profiles and formant frequencies: a new model for vowels and fricative consonants based on experimental data. Speech Commun. 16, 27–47 (1995)
M. Brookes, et al., Voicebox: Speech processing toolbox for matlab. Software, available [Mar. 2011] from www.ee.ic.ac.uk/hp/staff/dmb/voicebox/voicebox.html (1997)
G.L. Calhoun, G.R. McMillan, Hands-free input devices for wearable computers, in Proceedings of the Fourth Symposium on Human Interaction with Complex Systems, HICS ’98, (IEEE Computer Society 1998) p. 118
B.G. Douglass, Apparatus and Method for Detecting Speech Using Acoustic Signals Outside the Audible Frequency Range (United States Patent and Trademark Office, United States, 2006)
J. Epps, J.R. Smith, J. Wolfe, A novel instrument to measure acoustic resonances of the vocal tract during speech. Meas. Sci. Technol. 8, 1112–1121 (1997)
L.J. Eriksson, Higher order mode effects in circular ducts and expansion chambers. J. Acoust. Soc. Am. 68(2), 545–550 (1980)
J.-P. Fouque, J. Garnier, G. Papanicolaou, K. Solna, Wave Propagation and Time Reversal in Randomly Layered Media (Springer, 2010)
J. Freitas, A. Teixeira, M.S. Dias, Towards a silent speech interface for Portuguese: surface electromyography and the nasality challenge, in Proceedings of the International Conference on Bio-inspired Systems and Signal Processing BIOSIGNALS 2012 (Vilamoura, Algarve, Portugal, 2012)
C. Jorgensen, S. Dusan, Speech interfaces based upon surface electromyography. Speech Commun. 52(4), 354–366 (2010)
K. Kalgaonkar, R. Hu, B. Raj, Ultrasonic doppler sensor for voice activity detection. IEEE Signal Process. Lett. 14(10), 754–757 (2007)
R. Kaucic, B. Dalton, A. Blake, Real-time lip tracking for audio-visual speech recognition applications, in Computer Vision ECCV ’96, vol. 1065, Lecture Notes in Computer Science, ed. by B. Buxton, R. Cipolla (Springer, Berlin / Heidelberg, 1996), pp. 376–387
M. Kob, C. Neuschaefer-Rube, A method for measurement of the vocal tract impedance at the mouth. Med. Eng. Phys. 24, 467–471 (2002)
R.J. Lahr, Head-worn, Trimodal Device to Increase Transcription Accuracy in a Voice Recognition System and to Process Unvocalized Speech (United States Patent and Trademark Office, United States, 2002)
I. McLoughlin, Super-audible voice activity detection. IEEE/ACM Trans. Audio Speech Lang. Process. 22(9), 1424–1433 (2014). doi:10.1109/TASLP.2014.2335055
I.V. McLoughlin, Applied Speech and Audio Processing (Cambridge University Press, Cambridge, 2009)
I.V. McLoughlin, F. Ahmadi, Method and apparatus for determining mouth state using low frequency ultrasonics. UK Patent Office (pending) (2012)
I.V. McLoughlin, F. Ahmadi, A new mechanical index for gauging the human bioeffects of low frequency ultrasound, in Proceedings of the IEEE Engineering in Medicine and Biology Conference, (2013), pp. 1964–1967
B. Rivet, L. Girin, C. Jutten, Mixing audiovisual speech processing and blind source separation for the extraction of speech signals from convolutive mixtures. IEEE Trans. Audio Speech Lang. Process. 15(1), 96–108 (2007)
H.R. Sharifzadeh, I.V. McLoughlin, F. Ahmadi, Speech rehabilitation methods for laryngectomised patients, in Electronic Engineering and Computing Technology, vol. 60, Lecture Notes in Electrical Engineering, ed. by S.I. Ao, L. Gelman (Springer, Netherlands, 2010), pp. 597–607
D.J. Sinder, Speech synthesis using an aeroacoustic fricative model (PhD Thesis). The State University of New Jersey (1999)
M.M. Sondhi, B. Gopinath, Determination of vocal-tract shape from impulse response at the lips. J. Acoust. Soc. Am. 49(6), 1867–1873 (1971)
B.H. Story, Physiologically-based speech simulation using an enhanced wave-reflection model of the vocal tract (PhD Thesis). The University of Iowa (1995)
B.H. Story, I.R. Titze, E.A. Hoffman, Vocal tract area functions from magnetic resonance imaging. J. Acoust. Soc. Am. 100, 1 (1996)
Texas Instruments: TIMIT database (Texas Instruments and MIT). a CD-ROM database of phonetically classified recordings of sentences spoken by a number of different male and female speakers (1990)
C.A. Tosaya, J.W. Sliwa, Signal Injection Coupling into the Human Vocal Tract for Robust Audible and Inaudible Voice Recognition (United States Patent and Trademark Office, United States, 1999)
H.K. Vorperian, S. Wang, M.K. Chung, E.M. Schimek, R.B. Durtschi, R.D. Kent, A.J. Ziegert, L.R. Gentry, Anatomic development of the oral and pharyngeal portions of the vocal tract: an imaging study. J. Acoust. Soc. Am. 125, 1666 (2009)
J. Wolfe, M. Garnier, J. Smith, Vocal tract resonances in speech, singing and playing musical instruments. Hum. Front. Sci. Progr. J. 3, 6–23 (2009)
J.A. Zagzebski, Essentials of Ultrasound Physics (Mosby, Elsevier, St. Louis, 1996)
A.J. Zuckerwar, Speed of sound in fluids, in Handbook of Acoustics, ed. by M.J. Crocker (Wiley, New York, 1998)
Acknowledgments
Some of the data for this paper was recorded and processed at the School of Computer Engineering, Nanyang Technological University (NTU), Singapore by student assistants Farzaneh Ahmadi, Mark Huan, and Chu Thanh Minh. Their contribution to this work is gratefully acknowledged, particularly the PhD research of Farzaneh Ahmadi [1]. Thanks are also due to Prof. Eng Siong Chng of NTU, and Jingjie Li of USTC for their assistance with the experimental work.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
McLoughlin, I.V., Song, Y. Mouth State Detection From Low-Frequency Ultrasonic Reflection. Circuits Syst Signal Process 34, 1279–1304 (2015). https://doi.org/10.1007/s00034-014-9904-4
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00034-014-9904-4