ABSTRACT
A prerequisite for field-research on audio data are privacy preserving recordings that exclusively contain the target speaker who gave consent. For this purpose, we investigated the potential of a simple but robust wearable technology consisting of three parts: first, a standard air-conduction microphone providing the necessary audio quality for speech analysis; second, a throat microphone used as a speech activity filter; third, a custom ESP32 based recording device enabling on-device real-time processing. The system was evaluated in two challenging free discussion settings with two and four participants each (total N=16). Results from manual annotations show an Equal Error Rate of M=23.4-29.69 %. Based on simple instructions, our participants managed to maintain a False Acceptance Rate below 5 % while recording more than half of their utterances.
Supplemental Material
Available for Download
- Melissa M Baese-Berk and Tuuli H Morrill. 2015. Speaking rate consistency in native and non-native speakers of English. The Journal of the Acoustical Society of America 138, 3 (2015), EL223–EL228.Google ScholarCross Ref
- Ishan Chatterjee, Maruchi Kim, Vivek Jayaram, Shyamnath Gollakota, Ira Kemelmacher, Shwetak Patel, and Steven M Seitz. 2022. ClearBuds: wireless binaural earbuds for learning-based speech enhancement. In Proceedings of the 20th Annual International Conference on Mobile Systems, Applications and Services. ACM New York, NY, USA, 384–396.Google ScholarDigital Library
- Jacob Cohen. 1960. A coefficient of agreement for nominal scales. Educational and psychological measurement 20, 1 (1960), 37–46.Google Scholar
- U.S. Congress. 2002. 18 USC 2511: Interception and disclosure of wire, oral, or electronic communications prohibited. https://uscode.house.gov/view.xhtml?req=%28title:18%20section:2511%29. Accessed: 2022-07-09.Google Scholar
- Council of Europe. 2018. 128th Session of the Committee of Ministers (Elsinore, Denmark, 17-18 May 2018). Modernised Convention for the Protection of Individuals with Regard to the Processing of Personal Data – Consolidated text. https://search.coe.int/cm/Pages/result_details.aspx?ObjectId=09000016807c65bf.Google Scholar
- Engin Erzin. 2009. Improving throat microphone speech recognition by joint analysis of throat and acoustic microphone recordings. IEEE transactions on audio, speech, and language processing 17, 7(2009), 1316–1324.Google ScholarCross Ref
- Daniel Garcia-Romero and Carol Y Espy-Wilson. 2011. Analysis of i-vector length normalization in speaker recognition systems. In Twelfth annual conference of the international speech communication association.Google Scholar
- Aleksei Gusev, Vladimir Volokhov, Alisa Vinogradova, Tseren Andzhukaev, Andrey Shulipa, Sergey Novoselov, Timur Pekhovsky, and Alexander Kozlov. 2020. STC-Innovation Speaker Recognition Systems for Far-Field Speaker Verification Challenge 2020.. In INTERSPEECH. 3466–3470.Google Scholar
- Maximilian Haas, Matthias R Mehl, Nicola Ballhausen, Sascha Zuber, Matthias Kliegel, and Alexandra Hering. 2022. The Sounds of Memory: Extending the Age–Prospective Memory Paradox to Everyday Behavior and Conversations. The Journals of Gerontology: Series B 77, 4 (01 2022), 695–703. https://doi.org/10.1093/geronb/gbac012 arXiv:https://academic.oup.com/psychsocgerontology/article-pdf/77/4/695/43224411/gbac012.pdfGoogle ScholarCross Ref
- Simon Haykin and Zhe Chen. 2005. The cocktail party problem. Neural computation 17, 9 (2005), 1875–1902.Google Scholar
- Shin Katayama, Akhil Mathur, Marc Van den Broeck, Tadashi Okoshi, Jin Nakazawa, and Fahim Kawsar. 2019. Situation-Aware Emotion Regulation of Conversational Agents with Kinetic Earables. In 2019 8th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 725–731.Google Scholar
- Patrick Kenny. 2010. Bayesian speaker verification with, heavy tailed priors. Proc. Odyssey 2010 (2010).Google Scholar
- James F Knight and Chris Baber. 2005. A tool to assess the comfort of wearable computers. Human factors 47, 1 (2005), 77–91.Google Scholar
- Effie Lai-Chong Law, Asbjørn Følstad, Jonathan Grudin, and Björn Schuller. 2022. Conversational Agent as Trustworthy Autonomous System (Trust-CA)(Dagstuhl Seminar 21381). In Dagstuhl Reports, Vol. 11. Schloss Dagstuhl-Leibniz-Zentrum für Informatik.Google Scholar
- David A van Leeuwen and Niko Brümmer. 2007. An introduction to application-independent evaluation of speaker recognition systems. In Speaker classification I. Springer, 330–353.Google Scholar
- Daniel M Low, Kate H Bentley, and Satrajit S Ghosh. 2020. Automated assessment of psychiatric disorders using speech: A systematic review. Laryngoscope Investigative Otolaryngology 5, 1 (2020), 96–116.Google ScholarCross Ref
- Matthias R Mehl. 2017. The electronically activated recorder (EAR) a method for the naturalistic observation of daily social behavior. Current directions in psychological science 26, 2 (2017), 184–190.Google Scholar
- Matthias R Mehl, Samuel D Gosling, and James W Pennebaker. 2006. Personality in its natural habitat: manifestations and implicit folk theories of personality in daily life.Journal of personality and social psychology 90, 5(2006), 862.Google Scholar
- Matthias R Mehl, James W Pennebaker, D Michael Crow, James Dabbs, and John H Price. 2001. The Electronically Activated Recorder (EAR): A device for sampling naturalistic daily activities and conversations. Behavior research methods, instruments, & computers 33, 4 (2001), 517–523.Google Scholar
- Matthias R Mehl, Megan L Robbins, and Fenne große Deters. 2012. Naturalistic observation of health-relevant social processes: The Electronically Activated Recorder (EAR) methodology in psychosomatics. Psychosomatic medicine 74, 4 (2012), 410.Google ScholarCross Ref
- Yoshitaka Nakajima, Hideki Kashioka, Nick Campbell, and Kiyohiro Shikano. 2006. Non-audible murmur (NAM) recognition. IEICE TRANSACTIONS on Information and Systems 89, 1 (2006), 1–8.Google ScholarDigital Library
- Yoshitaka Nakajima, Hideki Kashioka, Kiyohiro Shikano, and Nick Campbell. 2003. Non-audible murmur recognition input interface using stethoscopic microphone attached to the skin. In 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing, 2003. Proceedings.(ICASSP’03)., Vol. 5. IEEE, V–708.Google ScholarCross Ref
- National Institute of Standards and Technology. 2021. NIST 2021 Speaker Recognition Evaluation Plan. https://sre.nist.gov/.Google Scholar
- Wei Rao, Chenglin Xu, Eng Siong Chng, and Haizhou Li. 2019. Target Speaker Extraction for Overlapped Multi-Talker Speaker Verification. https://doi.org/10.48550/ARXIV.1902.02546Google Scholar
- Timothy J Trull and Ulrich Ebner-Priemer. 2013. Ambulatory assessment. Annual review of clinical psychology 9 (2013), 151.Google Scholar
- European Union. 2016. Art. 6.1.a, Regulation (EU) 2016/679 of the European Parliament and of the Council of 27 April 2016 on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing Directive 95/46/EC (General Data Protection Regulation). https://eur-lex.europa.eu/legal-content/EN/ALL/?uri=celex%3A32016R0679.Google Scholar
- Chenglin Xu, Wei Rao, Eng Siong Chng, and Haizhou Li. 2019. Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6990–6994.Google ScholarCross Ref
- Zhengyou Zhang, Zicheng Liu, Mike Sinclair, Alex Acero, Li Deng, Jasha Droppo, Xuedong Huang, and Yanli Zheng. 2004. Multi-sensory microphones for robust speech detection, enhancement and recognition. In 2004 IEEE International Conference on Acoustics, Speech, and Signal Processing, Vol. 3. IEEE, iii–781.Google ScholarCross Ref
- Siqi Zheng, Weilong Huang, Xianliang Wang, Hongbin Suo, Jinwei Feng, and Zhijie Yan. 2021. A real-time speaker diarization system based on spatial spectrum. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7208–7212.Google ScholarCross Ref
- Yanli Zheng, Zicheng Liu, Zhengyou Zhang, Mike Sinclair, Jasha Droppo, Li Deng, Alex Acero, and Xuedong Huang. 2003. Air-and bone-conductive integrated microphones for robust speech detection and enhancement. In 2003 IEEE Workshop on Automatic Speech Recognition and Understanding (IEEE Cat. No. 03EX721). IEEE, 249–254.Google ScholarCross Ref
Index Terms
- Privacy Preserving Continuous Speech Recording using Throat Microphones
Recommendations
MFCC-GMM based accent recognition system for Telugu speech signals
Speech processing is very important research area where speaker recognition, speech synthesis, speech codec, speech noise reduction are some of the research areas. Many of the languages have different speaking styles called accents or dialects. ...
Improving Throat Microphone Speech Recognition by Joint Analysis of Throat and Acoustic Microphone Recordings
We present a new framework for joint analysis of throat and acoustic microphone (TAM) recordings to improve throat microphone only speech recognition. The proposed analysis framework aims to learn joint sub-phone patterns of throat and acoustic ...
Robust Voice Liveness Detection and Speaker Verification Using Throat Microphones
While having a wide range of applications, automatic speaker verification ASV systems are vulnerable to spoofing attacks, in particular, replay attacks that are effective and easy to implement. Most prior work on detecting replay attacks uses audio from ...
Comments