skip to main content
10.1145/3488932.3517420acmconferencesArticle/Chapter ViewAbstractPublication Pagesasia-ccsConference Proceedingsconference-collections
research-article
Public Access

SUPERVOICE: Text-Independent Speaker Verification Using Ultrasound Energy in Human Speech

Published: 30 May 2022 Publication History

Abstract

Voice-activated systems are integrated into a variety of desktop, mobile, and Internet-of-Things (IoT) devices. However, voice spoofing attacks, such as impersonation and replay attacks, in which malicious attackers synthesize the voice of a victim or simply replay it, have brought growing security concerns. Existing speaker verification techniques distinguish individual speakers via the spectrographic features extracted from an audible frequency range of voice commands. However, they often have high error rates and/or long delays. In this paper, we explore a new direction of human voice research by scrutinizing the unique characteristics of human speech at the ultrasound frequency band. Our research indicates that the high-frequency ultrasound components (e.g. speech fricatives) from 20 to 48 kHz can significantly enhance the security and accuracy of speaker verification. We propose a speaker verification system, SUPERVOICE that uses a two-stream DNN architecture with a feature fusion mechanism to generate distinctive speaker models. To test the system, we create a speech dataset with 12 hours of audio (8,950 voice samples) from 127 participants. In addition, we create a second spoofed voice dataset to evaluate its security. In order to balance between controlled recordings and real-world applications, the audio recordings are collected from two quiet rooms by 8 different recording devices, including 7 smartphones and an ultrasound microphone. Our evaluation shows that SUPERVOICE achieves 0.58% equal error rate in the speaker verification task, which reduces the best equal error rate of the existing systems by 86.1%. SUPERVOICE only takes 120 ms for testing an incoming utterance, outperforming all existing speaker verification systems. Moreover, within 91 ms processing time, SUPERVOICE achieves 0% equal error rate in detecting replay attacks launched by 5 different loudspeakers. Finally, we demonstrate that SUPERVOICE can be used in retail smartphones by integrating an off-the-shelf ultrasound microphone.

Supplementary Material

MP4 File (SUPERVOICE.mp4)
Presentation video

References

[1]
1993. TIMIT Acoustic-Phonetic Continuous Speech Corpus. https://catalog.ldc.upenn.edu/LDC93S1. Accessed: 2020-05-04.
[2]
2020. FFmpeg. https://www.ffmpeg.org/. Accessed: 2021-11-02.
[3]
Muhammad Ejaz Ahmed, Il-Youp Kwak, Jun Ho Huh, Iljoo Kim, Taekkyung Oh, and Hyoungshick Kim. 2020. Void: A fast and light voice liveness detection system. In USENIX Security.
[4]
Federico Alegre, Ravichander Vipperla, Nicholas Evans, and Benoït Fauve. 2012. On the vulnerability of automatic speaker recognition to spoofing attacks with artificial signals. In 2012 Proceedings of the 20th european signal processing conference (EUSIPCO). IEEE, 36--40.
[5]
Avisoft. [n.d.] a. http://www.avisoft.com/ultrasound-microphones/cm16-cmpa/.
[6]
Avisoft. [n.d.] b. http://www.avisoft.com/playback/vifa/.
[7]
Bose. [n.d.]. https://www.bose.com/en_us/support/products/bose_speakers_support/bose_smarthome_speakers_support/soundtouch-10-wireless-system.html.
[8]
William M Campbell, Douglas E Sturim, and Douglas A Reynolds. 2006. Support vector machines using GMM supervectors for speaker verification. IEEE signal processing letters, Vol. 13, 5 (2006), 308--311.
[9]
Yu-hsin Chen, Ignacio Lopez-Moreno, Tara N Sainath, Mirkó Visontai, Raziel Alvarez, and Carolina Parada. 2015. Locally-connected and convolutional neural networks for small footprint speaker recognition. In Sixteenth Annual Conference of the International Speech Communication Association.
[10]
Noam Chomsky and Morris Halle. 1968. The sound pattern of English. (1968).
[11]
Anurag Chowdhury and Arun Ross. 2019. Fusing MFCC and LPC Features using 1D Triplet CNN for Speaker Recognition in Severely Degraded Audio Signals. IEEE Transactions on Information Forensics and Security (2019).
[12]
Joon Son Chung, Arsha Nagrani, and Andrew Zisserman. 2018. Voxceleb2: Deep speaker recognition. arXiv preprint arXiv:1806.05622 (2018).
[13]
Najim Dehak, Patrick J Kenny, Réda Dehak, Pierre Dumouchel, and Pierre Ouellet. 2010. Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, Vol. 19, 4 (2010), 788--798.
[14]
Huan Feng, Kassem Fawaz, and Kang G Shin. 2017. Continuous authentication for voice assistants. In Proceedings of the 23rd Annual International Conference on Mobile Computing and Networking. 343--355.
[15]
Todor Ganchev, Nikos Fakotakis, and George Kokkinakis. 2005. Comparative evaluation of various MFCC implementations on the speaker verification task. In Proceedings of the SPECOM, Vol. 1. 191--194.
[16]
John S Garofolo. 1993. TIMIT acoustic phonetic continuous speech corpus. Linguistic Data Consortium, 1993 (1993).
[17]
Shoji Hayakawa and Fumitada Itakura. 1994. Text-dependent speaker recognition using the information in the higher frequency band. In Proceedings of ICASSP'94. IEEE International Conference on Acoustics, Speech and Signal Processing, Vol. 1. IEEE, I--137.
[18]
Shoji Hayakawa and Fumitada Itakura. 1995. The influence of noise on the speaker recognition performance using the higher frequency band. In 1995 International Conference on Acoustics, Speech, and Signal Processing, Vol. 1. IEEE, 321--324.
[19]
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition. 770--778.
[20]
Georg Heigold, Ignacio Moreno, Samy Bengio, and Noam Shazeer. 2016. End-to-end text-dependent speaker verification. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 5115--5119.
[21]
Yedid Hoshen, Ron J Weiss, and Kevin W Wilson. 2015. Speech acoustic modeling from raw multichannel waveforms. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4624--4628.
[22]
HOTENDA. 2013. https://www.hotenda.com/datasheet-html/2493/1/SPU0410LR5H-QB.html.
[23]
Roman Jakobson, C Gunnar Fant, and Morris Halle. 1951. Preliminaries to speech analysis: The distinctive features and their correlates. (1951).
[24]
Artur Janicki, Federico Alegre, and Nicholas Evans. 2016. An assessment of automatic speaker verification vulnerabilities to replay spoofing attacks. Security and Communication Networks, Vol. 9, 15 (2016), 3030--3044.
[25]
N. S. Jayant and Peter Noll. 1984. Digital Coding of Waveforms, Principles and Applications to Speech and Video. Prentice-Hall, Englewood Cliffs NJ, USA, 688. N. S. Jayant: Bell Laboratories; ISBN 0-13-211913-7.
[26]
Allard Jongman, Ratree Wayland, and Serena Wong. 2000. Acoustic characteristics of English fricatives. The Journal of the Acoustical Society of America, Vol. 108, 3 (2000), 1252--1263.
[27]
Patrick Kenny. 2005. Joint factor analysis of speaker and session variability: Theory and algorithms. CRIM, Montreal,(Report) CRIM-06/08-13, Vol. 14 (2005), 28--29.
[28]
Patrick Kenny, Themos Stafylakis, Pierre Ouellet, Md Jahangir Alam, and Pierre Dumouchel. 2013. PLDA for speaker verification with utterances of arbitrary duration. In 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7649--7653.
[29]
Tomi Kinnunen and Haizhou Li. 2010. An overview of text-independent speaker recognition: From features to supervectors. Speech communication, Vol. 52, 1 (2010), 12--40.
[30]
Tomi Kinnunen, Md Sahidullah, Héctor Delgado, Massimiliano Todisco, Nicholas Evans, Junichi Yamagishi, and Kong Aik Lee. 2017. The ASVspoof 2017 challenge: Assessing the limits of replay spoofing attack detection. (2017).
[31]
Galina Lavrentyeva, Sergey Novoselov, Egor Malykh, Alexander Kozlov, Oleg Kudashev, and Vadim Shchemelinin. 2017. Audio Replay Attack Detection with Deep Learning Frameworks. In Interspeech. 82--86.
[32]
Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data using t-SNE. Journal of machine learning research, Vol. 9, Nov (2008), 2579--2605.
[33]
Yan Meng, Zichang Wang, Wei Zhang, Peilin Wu, Haojin Zhu, Xiaohui Liang, and Yao Liu. 2018. Wivo: Enhancing the security of voice control system via wireless signal in iot environment. In Proceedings of the Eighteenth ACM International Symposium on Mobile Ad Hoc Networking and Computing. 81--90.
[34]
Brian B Monson, Eric J Hunter, Andrew J Lotto, and Brad H Story. 2014. The perceptual significance of high-frequency energy in the human voice. Frontiers in psychology, Vol. 5 (2014), 587.
[35]
Arsha Nagrani, Joon Son Chung, Weidi Xie, and Andrew Zisserman. 2020. Voxceleb: Large-scale speaker verification in the wild. Computer Speech & Language, Vol. 60 (2020), 101027.
[36]
Arsha Nagrani, Joon Son Chung, and Andrew Zisserman. 2017. Voxceleb: a large-scale speaker identification dataset. arXiv preprint arXiv:1706.08612 (2017).
[37]
Mahesh Kumar Nandwana, Julien van Hout, Mitchell McLaren, Allen R Stauffer, Colleen Richey, Aaron Lawson, and Martin Graciarena. 2018. Robust Speaker Recognition from Distant Speech under Real Reverberant Environments Using Speaker Embeddings. In Interspeech. 1106--1110.
[38]
Mirco Ravanelli and Yoshua Bengio. 2018. Speaker recognition from raw waveform with sincnet. In 2018 IEEE Spoken Language Technology Workshop (SLT). IEEE, 1021--1028.
[39]
Douglas A Reynolds, Thomas F Quatieri, and Robert B Dunn. 2000. Speaker verification using adapted Gaussian mixture models. Digital signal processing, Vol. 10, 1--3 (2000), 19--41.
[40]
Nirupam Roy, Haitham Hassanieh, and Romit Roy Choudhury. 2017. Backdoor: Making microphones hear inaudible sounds. In Proceedings of the 15th Annual International Conference on Mobile Systems, Applications, and Services. 2--14.
[41]
Sada. [n.d.]. https://www.aliexpress.com/item/4001241222763.html.
[42]
Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wilson, and Oriol Vinyals. 2015. Learning the speech front-end with raw waveform CLDNNs. In Sixteenth Annual Conference of the International Speech Communication Association.
[43]
Martin F Schwartz. 1968. Identification of speaker sex from isolated, voiceless fricatives. The Journal of the Acoustical Society of America, Vol. 43, 5 (1968), 1178--1179.
[44]
Stephen Shum, Najim Dehak, Reda Dehak, and James R Glass. 2010. Unsupervised Speaker Adaptation based on the Cosine Similarity for Text-Independent Speaker Verification. In Odyssey. 16.
[45]
Karen Simonyan and Andrew Zisserman. 2014. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014).
[46]
SiSonic. [n.d.]. https://www.digikey.com/product-detail/en/knowles/SPU0410LR5H-QB-7/423-1139-1-ND/2420983.
[47]
David Snyder, Daniel Garcia-Romero, Daniel Povey, and Sanjeev Khudanpur. 2017. Deep Neural Network Embeddings for Text-Independent Speaker Verification. In Interspeech. 999--1003.
[48]
Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. 2015. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition. 1--9.
[49]
Marija Tabain. 2001. Variability in fricative production and spectra: Implications for the hyper-and hypo-and quantal theories of speech production. Language and speech, Vol. 44, 1 (2001), 57--93.
[50]
Siri Team. 2017. Hey siri: An on-device dnn-powered voice trigger for apple's personal assistant. Apple Machine Learning Journal, Vol. 1, 6 (2017).
[51]
Francis Tom, Mohit Jain, and Prasenjit Dey. 2018. End-To-End Audio Replay Attack Detection Using Deep Convolutional Networks with Attention. In Interspeech. 681--685.
[52]
usbank. 2020. How voice-activated devices work with banks. https://www.usbank.com/financialiq/manage-your-household/personal-finance/how-voice-activated-devices-work-with-banks.html.
[53]
Ehsan Variani, Xin Lei, Erik McDermott, Ignacio Lopez Moreno, and Javier Gonzalez-Dominguez. 2014. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE ICASSP. IEEE, 4052--4056.
[54]
Jesús Villalba and Eduardo Lleida. 2011. Preventing replay attacks on speaker verification systems. In 2011 Carnahan Conference on Security Technology. IEEE, 1--8.
[55]
Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. 2018. Generalized end-to-end loss for speaker verification. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 4879--4883.
[56]
Vincent Wan and Steve Renals. 2005. Speaker verification using sequence discriminant support vector machines. IEEE transactions on speech and audio processing, Vol. 13, 2 (2005), 203--210.
[57]
WILDLIFE. [n.d.]. https://www.wildlifeacoustics.com/products/echo-meter-touch-2-pro-ios.
[58]
Zhizheng Wu, Nicholas Evans, Tomi Kinnunen, Junichi Yamagishi, Federico Alegre, and Haizhou Li. 2015. Spoofing and countermeasures for speaker verification: A survey. speech communication, Vol. 66 (2015), 130--153.
[59]
Chen Yan, Yan Long, Xiaoyu Ji, and Wenyuan Xu. 2019. The Catcher in the Field: A Fieldprint based Spoofing Detection for Text-Independent Speaker Verification. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and Communications Security. 1215--1229.
[60]
Qiben Yan, Kehai Liu, Qin Zhou, Hanqing Guo, and Ning Zhang. 2020. SurfingAttack: Interactive Hidden Attack on Voice Assistants Using Ultrasonic Guided Wave. In Network and Distributed Systems Security (NDSS) Symposium.
[61]
Matthew D Zeiler and Rob Fergus. 2014. Visualizing and understanding convolutional networks. In European conference on computer vision. Springer, 818--833.
[62]
Guoming Zhang, Chen Yan, Xiaoyu Ji, Tianchen Zhang, Taimin Zhang, and Wenyuan Xu. 2017b. DolphinAttack: Inaudible voice commands. In Proceedings of the 2017 ACM CCS. 103--117.
[63]
Linghan Zhang, Sheng Tan, and Jie Yang. 2017a. Hearing your voice is not enough: An articulatory gesture based liveness detection for voice authentication. In Proceedings of the 2017 ACM CCS. 57--71.
[64]
Linghan Zhang, Sheng Tan, Jie Yang, and Yingying Chen. 2016. Voicelive: A phoneme localization based liveness detection for voice authentication on smartphones. In Proceedings of the 2016 ACM CCS. 1080--1091.

Cited By

View all
  • (2024)PiezoBud: A Piezo-Aided Secure Earbud with Practical Speaker AuthenticationProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699358(564-577)Online publication date: 4-Nov-2024
  • (2024)Toward Pitch-Insensitive Speaker Verification via SoundfieldIEEE Internet of Things Journal10.1109/JIOT.2023.329000111:1(1175-1189)Online publication date: 1-Jan-2024
  • (2023)PhantomSound: Black-Box, Query-Efficient Audio Adversarial Attack via Split-Second Phoneme InjectionProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3607199.3607240(366-380)Online publication date: 16-Oct-2023
  • Show More Cited By

Index Terms

  1. SUPERVOICE: Text-Independent Speaker Verification Using Ultrasound Energy in Human Speech

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    ASIA CCS '22: Proceedings of the 2022 ACM on Asia Conference on Computer and Communications Security
    May 2022
    1291 pages
    ISBN:9781450391405
    DOI:10.1145/3488932
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 30 May 2022

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. speaker verification
    2. ultrasound
    3. voice authentication

    Qualifiers

    • Research-article

    Funding Sources

    Conference

    ASIA CCS '22
    Sponsor:

    Acceptance Rates

    Overall Acceptance Rate 418 of 2,322 submissions, 18%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)117
    • Downloads (Last 6 weeks)11
    Reflects downloads up to 13 Feb 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)PiezoBud: A Piezo-Aided Secure Earbud with Practical Speaker AuthenticationProceedings of the 22nd ACM Conference on Embedded Networked Sensor Systems10.1145/3666025.3699358(564-577)Online publication date: 4-Nov-2024
    • (2024)Toward Pitch-Insensitive Speaker Verification via SoundfieldIEEE Internet of Things Journal10.1109/JIOT.2023.329000111:1(1175-1189)Online publication date: 1-Jan-2024
    • (2023)PhantomSound: Black-Box, Query-Efficient Audio Adversarial Attack via Split-Second Phoneme InjectionProceedings of the 26th International Symposium on Research in Attacks, Intrusions and Defenses10.1145/3607199.3607240(366-380)Online publication date: 16-Oct-2023
    • (2023)MASTERKEY: Practical Backdoor Attack Against Speaker Verification SystemsProceedings of the 29th Annual International Conference on Mobile Computing and Networking10.1145/3570361.3613261(1-15)Online publication date: 2-Oct-2023

    View Options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Login options

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media