research-article

Compressed, Real-Time Voice Activity Detection with Open Source Implementation for Small Devices

Authors:

Lasse R. Andersen,

Lukas J. Jacobsen,

David CamposAuthors Info & Claims

iWOAR '23: Proceedings of the 8th international Workshop on Sensor-Based Activity Recognition and Artificial Intelligence

Article No.: 1, Pages 1 - 10

https://doi.org/10.1145/3615834.3615835

Published: 11 October 2023 Publication History

Abstract

This paper proposes a real-time voice activity detection (VAD) system that utilizes a compressed convolutional neural network (CNN) model. On general-purpose computers, the system is capable of accurately classifying the presence of speech in audio with low latency. Whereas, when implemented on small devices, the system is showing higher latency, which is presumably an indication of high-load computations in the preprocessing steps. The results of the evaluation indicate that the proposed VAD system is an improvement over the existing solutions, in terms of reducing the model size and improving the level of accuracy among different evaluation metrics. Furthermore, the proposed VAD system offers an extension of the applicability by training the CNN model on a different and more diverse data set. Moreover, the proposed architecture is capable of being compressed to approximately one-eleventh of the size, facilitating eventual deployment on small devices. In contrast to existing closed VAD solutions, the entire pipeline of the proposed VAD system is developed in Python and made available as open source, ensuring the verifiability and accessibility of the work.

References

[1]

Pietro Barbiero, Giovanni Squillero, and Alberto Tonda. 2020. Modeling Generalization in Machine Learning: A Methodological and Computational Study. arxiv:2006.15680 [cs.LG]

[2]

Richard E. Berg. 2023. Sound | Properties, Types, & Facts. Encyclopedia Britannica (May 2023). https://www.britannica.com/science/sound-physics

[3]

George Boateng, Prabhakaran Santhanam, Janina Lüscher, Urte Scholz, and Tobias Kowatsch. 2019. VADLite: an open-source lightweight system for real-time voice activity detection on smartwatches. In UbiComp/ISWC, Robert Harle, Katayoun Farrahi, and Nicholas D. Lane (Eds.). ACM, London, United Kingdom, 902–906.

[4]

Jason Brownlee. 2020. How to Fix the Vanishing Gradients Problem Using the ReLU - MachineLearningMastery.com. MachineLearningMastery (Aug 2020). https://machinelearningmastery.com/how-to-fix-vanishing-gradients-using-the-rectified-linear-activation-function

[5]

Tom Bäckström, Okko Räsänen, Abraham Zewoudie, Pablo Pérez Zarazaga, Liisa Koivusalo, Sneha Das, Esteban Gómez Mellado, Marieum Bouafif Mansali, Daniel Ramos, Sudarsana Kadiri, and Paavo Alku. 2022. Introduction to Speech Processing (2 ed.). https://speechprocessingbook.aalto.fi

[6]

Muhammad Hilmi Faridh and Ulil Surtia Zulpratita. 2021. HiVAD : A Voice Activity Detection Application Based on Deep Learning. ELKOMIKA: Jurnal Teknik Energi Elektrik, Teknik Telekomunikasi, & Teknik Elektronika 9, 4 (Oct. 2021), 856.

[7]

Stefaan Van Gerven and Fei Xie. 1997. A comparative study of speech detection methods. In Fifth European Conference on Speech Communication and Technology. Citeseer.

[8]

Google Git. 2015. webRTC VAD. https://chromium.googlesource.com/external/webrtc/+/branch-heads/43/webrtc/common_audio/vad/.

[9]

Yun-Ning Hung, Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife, Kelian Li, Pavan Seshadri, and Junyoung Lee. 2021. AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence. arxiv:2111.01320

[10]

M. Huzaifah. 2017. Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks. arxiv:1706.07156 [cs.CV]

[11]

Fei Jia, Somshubra Majumdar, and Boris Ginsburg. 2021. MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection. arxiv:2010.13886 [eess.AS]

[12]

Jong Hwan Ko, Josh Fromm, Matthai Philipose, Ivan Tashev, and Shuayb Zarar. 2018. Limiting Numerical Precision of Neural Networks to Achieve Real-Time Voice Activity Detection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2236–2240.

Digital Library

[13]

Ian Lavery, Alireza Kenarsari, Reza Rostam, and Dilek Karasoy. 2023. Picovoice. https://picovoice.ai

[14]

Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461 (2021), 370–403.

Digital Library

[15]

Giosué Cataldo Marinó, Alessandro Petrini, Dario Malchiodi, and Marco Frasca. 2023. Deep neural networks compression: A comparative survey and choice recommendations. Neurocomputing 520 (2023), 152–170.

Digital Library

[16]

Serban Mihalache, Ioan-Alexandru Ivanov, and Dragos Burileanu. 2021. Deep Neural Networks for Voice Activity Detection. In 2021 44th International Conference on Telecommunications and Signal Processing (TSP). 191–194.

[17]

Rahul Mishra, Hari Prabhat Gupta, and Tanima Dutta. 2020. A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions. arxiv:2010.03954 [cs.LG]

[18]

Yasunari Obuchi. 2016. Framewise speech-nonspeech classification by neural networks for voice activity detection with statistical noise suppression. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5715–5719.

Digital Library

[19]

Alan V. Oppenheim and Ronald W. Schafer. 2013. Discrete-Time Signal Processing. Pearson Education.

[20]

Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.

Digital Library

[21]

Erwan Pépiot. 2012. Voice, speech and gender:: Male-female acoustic differences and cross-language variation in English and French speakers. Corela (06 2012).

[22]

Andrinandrasana David Rasamoelina, Fouzia Adjailia, and Peter Sinčák. 2020. A Review of Activation Function for Artificial Neural Network. In 2020 IEEE 18th World Symposium on Applied Machine Intelligence and Informatics (SAMI). 281–286.

[23]

Abhipray Sahoo. 2020. Voice activity detection for low-resource settings. (2020).

[24]

A. Sangwan, M.C. Chiranth, H.S. Jamadagni, R. Sah, R. Venkatesha Prasad, and V. Gaurav. 2002. VAD techniques for real-time speech transmission on the Internet. In 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No.02EX612). 46–50.

[25]

Abhishek Sehgal and Nasser Kehtarnavaz. 2018. A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection. IEEE Access 6 (2018), 9017–9026.

[26]

Audacity Team. 2023. Audacity Development Manual. https://alphamanual.audacityteam.org/man/Sample_Format_-_Bit_Depth

[27]

Silero Team. 2021. Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. https://github.com/snakers4/silero-vad.

[28]

Alexander Veysov and Dimitrii Voronin. 2022. One Voice Detector to Rule Them All. https://thegradient.pub/one-voice-detector-to-rule-them-all/. The Gradient (2022).

[29]

Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, 2019. SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python. CoRR abs/1907.10121 (2019). arXiv:1907.10121

[30]

Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 1 (2015), 7–19.

Digital Library

[31]

Hsiao-Wuen Hon Xuedong Huang, Alex Acero. 2001. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR.

[32]

Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How Transferable Are Features in Deep Neural Networks?. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 3320–3328.

Digital Library

Index Terms

Compressed, Real-Time Voice Activity Detection with Open Source Implementation for Small Devices
1. Computing methodologies
  1. Machine learning
    1. Learning paradigms
      1. Supervised learning
        Supervised learning by classification

Recommendations

A study of voice activity detection techniques for NIST speaker recognition evaluations

Since 2008, interview-style speech has become an important part of the NIST speaker recognition evaluations (SREs). Unlike telephone speech, interview speech has lower signal-to-noise ratio, which necessitates robust voice activity detectors (VADs). ...
An improvement in audio-visual voice activity detection for automatic speech recognition
IEA/AIE'10: Proceedings of the 23rd international conference on Industrial engineering and other applications of applied intelligent systems - Volume Part I

Noise-robust Automatic Speech Recognition (ASR) is essential for robots which are expected to communicate with humans in a daily environment. In such an environment, Voice Activity Detection (VAD) strongly affects the performance of ASR because there are ...
Voice Activity Detection for Ultrasound-Based Silent Speech Interfaces Using Convolutional Neural Networks
Text, Speech, and Dialogue
Abstract
Voice Activity Detection (VAD) is not easy task when the input audio signal is noisy, and it is even more complicated when the input is not even an audio recording. This is the case with Silent Speech Interfaces (SSI) where we record the movement ...

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences

iWOAR '23: Proceedings of the 8th international Workshop on Sensor-Based Activity Recognition and Artificial Intelligence

September 2023

171 pages

ISBN:9798400708169

DOI:10.1145/3615834

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 11 October 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

iWOAR 2023

iWOAR 2023: 8th international Workshop on Sensor-Based Activity Recognition and Artificial Intelligence

September 21 - 22, 2023

Lübeck, Germany

Acceptance Rates

Overall Acceptance Rate 46 of 73 submissions, 63%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
94
Total Downloads

Downloads (Last 12 months)81
Downloads (Last 6 weeks)4

Reflects downloads up to 22 Jan 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents