skip to main content
10.1145/3615834.3615835acmotherconferencesArticle/Chapter ViewAbstractPublication PagesiwoarConference Proceedingsconference-collections
research-article

Compressed, Real-Time Voice Activity Detection with Open Source Implementation for Small Devices

Published: 11 October 2023 Publication History

Abstract

This paper proposes a real-time voice activity detection (VAD) system that utilizes a compressed convolutional neural network (CNN) model. On general-purpose computers, the system is capable of accurately classifying the presence of speech in audio with low latency. Whereas, when implemented on small devices, the system is showing higher latency, which is presumably an indication of high-load computations in the preprocessing steps. The results of the evaluation indicate that the proposed VAD system is an improvement over the existing solutions, in terms of reducing the model size and improving the level of accuracy among different evaluation metrics. Furthermore, the proposed VAD system offers an extension of the applicability by training the CNN model on a different and more diverse data set. Moreover, the proposed architecture is capable of being compressed to approximately one-eleventh of the size, facilitating eventual deployment on small devices. In contrast to existing closed VAD solutions, the entire pipeline of the proposed VAD system is developed in Python and made available as open source, ensuring the verifiability and accessibility of the work.

References

[1]
Pietro Barbiero, Giovanni Squillero, and Alberto Tonda. 2020. Modeling Generalization in Machine Learning: A Methodological and Computational Study. arxiv:2006.15680 [cs.LG]
[2]
Richard E. Berg. 2023. Sound | Properties, Types, & Facts. Encyclopedia Britannica (May 2023). https://www.britannica.com/science/sound-physics
[3]
George Boateng, Prabhakaran Santhanam, Janina Lüscher, Urte Scholz, and Tobias Kowatsch. 2019. VADLite: an open-source lightweight system for real-time voice activity detection on smartwatches. In UbiComp/ISWC, Robert Harle, Katayoun Farrahi, and Nicholas D. Lane (Eds.). ACM, London, United Kingdom, 902–906.
[4]
Jason Brownlee. 2020. How to Fix the Vanishing Gradients Problem Using the ReLU - MachineLearningMastery.com. MachineLearningMastery (Aug 2020). https://machinelearningmastery.com/how-to-fix-vanishing-gradients-using-the-rectified-linear-activation-function
[5]
Tom Bäckström, Okko Räsänen, Abraham Zewoudie, Pablo Pérez Zarazaga, Liisa Koivusalo, Sneha Das, Esteban Gómez Mellado, Marieum Bouafif Mansali, Daniel Ramos, Sudarsana Kadiri, and Paavo Alku. 2022. Introduction to Speech Processing (2 ed.). https://speechprocessingbook.aalto.fi
[6]
Muhammad Hilmi Faridh and Ulil Surtia Zulpratita. 2021. HiVAD : A Voice Activity Detection Application Based on Deep Learning. ELKOMIKA: Jurnal Teknik Energi Elektrik, Teknik Telekomunikasi, & Teknik Elektronika 9, 4 (Oct. 2021), 856.
[7]
Stefaan Van Gerven and Fei Xie. 1997. A comparative study of speech detection methods. In Fifth European Conference on Speech Communication and Technology. Citeseer.
[8]
Google Git. 2015. webRTC VAD. https://chromium.googlesource.com/external/webrtc/+/branch-heads/43/webrtc/common_audio/vad/.
[9]
Yun-Ning Hung, Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife, Kelian Li, Pavan Seshadri, and Junyoung Lee. 2021. AVASpeech-SMAD: A Strongly Labelled Speech and Music Activity Detection Dataset with Label Co-Occurrence. arxiv:2111.01320
[10]
M. Huzaifah. 2017. Comparison of Time-Frequency Representations for Environmental Sound Classification using Convolutional Neural Networks. arxiv:1706.07156 [cs.CV]
[11]
Fei Jia, Somshubra Majumdar, and Boris Ginsburg. 2021. MarbleNet: Deep 1D Time-Channel Separable Convolutional Neural Network for Voice Activity Detection. arxiv:2010.13886 [eess.AS]
[12]
Jong Hwan Ko, Josh Fromm, Matthai Philipose, Ivan Tashev, and Shuayb Zarar. 2018. Limiting Numerical Precision of Neural Networks to Achieve Real-Time Voice Activity Detection. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2236–2240.
[13]
Ian Lavery, Alireza Kenarsari, Reza Rostam, and Dilek Karasoy. 2023. Picovoice. https://picovoice.ai
[14]
Tailin Liang, John Glossner, Lei Wang, Shaobo Shi, and Xiaotong Zhang. 2021. Pruning and quantization for deep neural network acceleration: A survey. Neurocomputing 461 (2021), 370–403.
[15]
Giosué Cataldo Marinó, Alessandro Petrini, Dario Malchiodi, and Marco Frasca. 2023. Deep neural networks compression: A comparative survey and choice recommendations. Neurocomputing 520 (2023), 152–170.
[16]
Serban Mihalache, Ioan-Alexandru Ivanov, and Dragos Burileanu. 2021. Deep Neural Networks for Voice Activity Detection. In 2021 44th International Conference on Telecommunications and Signal Processing (TSP). 191–194.
[17]
Rahul Mishra, Hari Prabhat Gupta, and Tanima Dutta. 2020. A Survey on Deep Neural Network Compression: Challenges, Overview, and Solutions. arxiv:2010.03954 [cs.LG]
[18]
Yasunari Obuchi. 2016. Framewise speech-nonspeech classification by neural networks for voice activity detection with statistical noise suppression. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 5715–5719.
[19]
Alan V. Oppenheim and Ronald W. Schafer. 2013. Discrete-Time Signal Processing. Pearson Education.
[20]
Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, 2011. Scikit-learn: Machine Learning in Python. J. Mach. Learn. Res. 12 (2011), 2825–2830.
[21]
Erwan Pépiot. 2012. Voice, speech and gender:: Male-female acoustic differences and cross-language variation in English and French speakers. Corela (06 2012).
[22]
Andrinandrasana David Rasamoelina, Fouzia Adjailia, and Peter Sinčák. 2020. A Review of Activation Function for Artificial Neural Network. In 2020 IEEE 18th World Symposium on Applied Machine Intelligence and Informatics (SAMI). 281–286.
[23]
Abhipray Sahoo. 2020. Voice activity detection for low-resource settings. (2020).
[24]
A. Sangwan, M.C. Chiranth, H.S. Jamadagni, R. Sah, R. Venkatesha Prasad, and V. Gaurav. 2002. VAD techniques for real-time speech transmission on the Internet. In 5th IEEE International Conference on High Speed Networks and Multimedia Communication (Cat. No.02EX612). 46–50.
[25]
Abhishek Sehgal and Nasser Kehtarnavaz. 2018. A Convolutional Neural Network Smartphone App for Real-Time Voice Activity Detection. IEEE Access 6 (2018), 9017–9026.
[26]
Audacity Team. 2023. Audacity Development Manual. https://alphamanual.audacityteam.org/man/Sample_Format_-_Bit_Depth
[27]
Silero Team. 2021. Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. https://github.com/snakers4/silero-vad.
[28]
Alexander Veysov and Dimitrii Voronin. 2022. One Voice Detector to Rule Them All. https://thegradient.pub/one-voice-detector-to-rule-them-all/. The Gradient (2022).
[29]
Pauli Virtanen, Ralf Gommers, Travis E. Oliphant, Matt Haberland, 2019. SciPy 1.0-Fundamental Algorithms for Scientific Computing in Python. CoRR abs/1907.10121 (2019). arXiv:1907.10121
[30]
Yong Xu, Jun Du, Li-Rong Dai, and Chin-Hui Lee. 2015. A Regression Approach to Speech Enhancement Based on Deep Neural Networks. IEEE/ACM Transactions on Audio, Speech, and Language Processing 23, 1 (2015), 7–19.
[31]
Hsiao-Wuen Hon Xuedong Huang, Alex Acero. 2001. Spoken Language Processing: A Guide to Theory, Algorithm and System Development. Prentice Hall PTR.
[32]
Jason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. 2014. How Transferable Are Features in Deep Neural Networks?. In Proceedings of the 27th International Conference on Neural Information Processing Systems - Volume 2 (Montreal, Canada) (NIPS’14). MIT Press, Cambridge, MA, USA, 3320–3328.

Index Terms

  1. Compressed, Real-Time Voice Activity Detection with Open Source Implementation for Small Devices

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    iWOAR '23: Proceedings of the 8th international Workshop on Sensor-Based Activity Recognition and Artificial Intelligence
    September 2023
    171 pages
    ISBN:9798400708169
    DOI:10.1145/3615834
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 11 October 2023

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. convolutional neural network
    2. model compression
    3. open source VAD
    4. real-time VAD
    5. voice activity detection

    Qualifiers

    • Research-article
    • Research
    • Refereed limited

    Conference

    iWOAR 2023

    Acceptance Rates

    Overall Acceptance Rate 46 of 73 submissions, 63%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 94
      Total Downloads
    • Downloads (Last 12 months)81
    • Downloads (Last 6 weeks)4
    Reflects downloads up to 22 Jan 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    HTML Format

    View this article in HTML Format.

    HTML Format

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media