skip to main content
research-article

Hardware Acceleration for Embedded Keyword Spotting: Tutorial and Survey

Published: 18 October 2021 Publication History

Abstract

In recent years, Keyword Spotting (KWS) has become a crucial human–machine interface for mobile devices, allowing users to interact more naturally with their gadgets by leveraging their own voice. Due to privacy, latency and energy requirements, the execution of KWS tasks on the embedded device itself instead of in the cloud, has attracted significant attention from the research community. However, the constraints associated with embedded systems, including limited energy, memory, and computational capacity, represent a real challenge for the embedded deployment of such interfaces. In this article, we explore and guide the reader through the design of KWS systems. To support this overview, we extensively survey the different approaches taken by the recent state-of-the-art (SotA) at the algorithmic, architectural, and circuit level to enable KWS tasks in edge, devices. A quantitative and qualitative comparison between relevant SotA hardware platforms is carried out, highlighting the current design trends, as well as pointing out future research directions in the development of this technology.

References

[1]
Nicholas D. Lane and Petko Georgiev. 2015. Can deep learning revolutionize mobile sensing? In Proceedings of the 16th International Workshop on Mobile Computing Systems and Applications. ACM, 117–122.
[2]
Raphael Tang and Jimmy Lin. 2018. Deep residual learning for small-footprint keyword spotting. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 5484–5488.
[3]
Ian McGraw, Rohit Prabhavalkar, Raziel Alvarez, Montse Gonzalez Arenas, Kanishka Rao, David Rybach, Ouais Alsharif, Haşim Sak, Alexander Gruenstein, Françoise Beaufays, and Carolina Parada. 2016. Personalized speech recognition on mobile devices. In Proceedings of the 2016 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 5955–5959.
[4]
Tharam Dillon, Chen Wu, and Elizabeth Chang. 2010. Cloud computing: Issues and challenges. In Proceedings of the 2010 24th IEEE International Conference on Advanced Information Networking and Applications. IEEE, 27–33.
[5]
Yundong Zhang, Naveen Suda, Liangzhen Lai, and Vikas Chandra. 2017. Hello edge: Keyword spotting on microcontrollers. arXiv:1711.07128. Retrieved from https://arxiv.org/abs/1711.07128.
[6]
Javier Fernández-Marqués, Vincent W.-S. Tseng, Sourav Bhattachara, and Nicholas D. Lane. 2018. On-the-fly deterministic binary filters for memory efficient keyword spotting applications on embedded devices. In Proceedings of the 2nd International Workshop on Embedded and Mobile Deep Learning. 13–18.
[7]
Peter Blouw, Gurshaant Malik, Benjamin Morcos, Aaron R. Voelker, and Chris Eliasmith. 2020. Hardware aware training for efficient keyword spotting on general purpose and specialized hardware. arXiv:2009.04465. Retrieved from https://arxiv.org/abs/2009.04465.
[8]
Qing He, Gregory W. Wornell, and Wei Ma. 2016. An adaptive multi-band system for low power voice command recognition. In Proceedings of the 17th Annual Conference of the International Speech Communication Association. Nelson Morgan (Ed.), ISCA, 1888–1892.
[9]
Juan Sebastian P. Giraldo and Marian Verhelst. 2018. Laika: A 5 uW programmable LSTM accelerator for always-on keyword spotting in 65 nm CMOS. In Proceedings of the IEEE 44th European Solid State Circuits Conference. IEEE, 166–169.
[10]
Qin Li, Sheng Lin, Changlu Liu, Yidong Liu, Fei Qiao, Yanzhi Wang, and Huazhong Yang. 2020. NS-KWS: Joint optimization of near-sensor processing architecture and low-precision GRU for always-on keyword spotting. In Proceedings of the ACM/IEEE International Symposium on Low Power Electronics and Design. 97–102.
[11]
Pete Warden. 2018. Speech commands: A dataset for limited-vocabulary speech recognition. arXiv:1804.03209. Retrieved from https://arxiv.org/abs/1804.03209.
[12]
Weiwei Shan, Minhao Yang, Jiaming Xu, Yicheng Lu, Shuai Zhang, Tao Wang, Jun Yang, Longxing Shi, and Mingoo Seok. 2020. 14.1 a 510 nW 0.41 v low-memory low-computation keyword-spotting chip using serial FFT-Based MFCC and binarized depthwise separable convolutional neural network in 28 nm CMOS. In Proceedings of the 2020 IEEE International Solid-State Circuits Conference. IEEE, 230–232.
[13]
R. Gary Leonard and George Doddington. 1993. Tidigits Speech Corpus. Texas Instruments, Inc .
[14]
Juan Sebastian P. Giraldo, Steven Lauwereins, Komail Badami, and Marian Verhelst. 2020. Vocell: A 65-nm speech-triggered wake-up soc for 10 uW keyword spotting and speaker verification. IEEE Journal of Solid-State Circuits 55, 4 (2020), 868–878.
[15]
Michael Price, James Glass, and Anantha P. Chandrakasan. 2017. 14.4 a scalable speech recognizer with deep-neural-network acoustic models and voice-activated power gating. In Proceedings of the 2017 IEEE International Solid-State Circuits Conference. IEEE, 244–245.
[16]
Shouyi Yin, Peng Ouyang, Shixuan Zheng, Dandan Song, Xiudong Li, Leibo Liu, and Shaojun Wei. 2018. A 141 uw, 2.46 pj/neuron binarized convolutional neural network based self-learning speech recognition processor in 28 nm CMOS. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits. IEEE, 139–140.
[17]
Alice Coucke, Mohammed Chlieh, Thibault Gisselbrecht, David Leroy, Mathieu Poumeyrol, and Thibaut Lavril. 2019. Efficient keyword spotting using dilated convolutions and gating. In Proceedings of the 2019 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 6351–6355.
[18]
Dewei Wang, Pavan Kumar Chundi, Sung Justin Kim, Minhao Yang, Joao Pedro Cerqueira, Joonsung Kang, Seunchul Jung, Sangjoon Kim, and Mingoo Seok. 2020. Always-on, Sub-300-nW, event-driven spiking neural network based on spike-driven clock-generation and clock-and power-gating for an ultra-low-power intelligent device. arXiv:2006.12314. Retrieved from https://arxiv.org/abs/2006.12314.
[19]
Ruiqi Guo, Yonggang Liu, Shixuan Zheng, Ssu-Yen Wu, Peng Ouyang, Win-San Khwa, Xi Chen, Jia-Jing Chen, Xiudong Li, Leibo Liu, Meng-Fan Chang, Shaojun Wei, and Shouyi Yin. 2019. A 5.1 pJ/neuron 127.3 us/inference RNN-based speech recognition processor using 16 computing-in-memory SRAM macros in 65 nm CMOS. In Proceedings of the 2019 Symposium on VLSI Circuits. IEEE, C120–C121.
[20]
Liping Xiang, Shengli Lu, Xuexiang Wang, Hao Liu, Wei Pang, and Huazhen Yu. 2019. Implementation of LSTM accelerator for speech keywords recognition. In Proceedings of the 2019 IEEE 4th International Conference on Integrated Circuits and Microsystems. IEEE, 195–198.
[21]
Hassan Dbouk, Sujan K. Gonugondla, Charbel Sakr, and Naresh R. Shanbhag. 2020. KeyRAM: A 0.34 uJ/decision 18 k decisions/s recurrent attention in-memory processor for keyword spotting. In Proceedings of the 2020 IEEE Custom Integrated Circuits Conference. IEEE, 1–4.
[22]
Bo Liu, Zhen Wang, Hu Fan, Jing Yang, Wentao Zhu, Lepeng Huang, Yu Gong, Wei Ge, and Longxing Shi. 2019. EERA-KWS: A 163 TOPS/W always-on keyword spotting accelerator in 28 nm CMOS using binary weight network and precision self-adaptive approximate computing. IEEE Access 7 (2019), 82453–82465. DOI: https://doi.org/10.1109/ACCESS.2019.2924340
[23]
Bo Liu, Hao Cai, Zhen Wang, Yuhao Sun, Zeyu Shen, Wentao Zhu, Yan Li, Yu Gong, Wei Ge, Jun Yang, and L. Shi. 2020. A 22 nm, 10.8 \(\mu\)W/15.1 \(\mu\)W dual computing modes high power-performance-area efficiency domained background noise aware keyword-spotting processor. IEEE Transactions on Circuits and Systems I: Regular Papers 67, 12 (2020), 4733–4746.
[24]
Paul Palomero Bernardo, Christoph Gerum, Adrian Frischknecht, Konstantin Lübeck, and Oliver Bringmann. 2020. Ultratrail: A configurable ultralow-power TC-ResNet AI accelerator for efficient keyword spotting. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems 39, 11 (2020), 4240–4251.
[25]
Bo Liu, Zhen Wang, Wentao Zhu, Yuhao Sun, Zeyu Shen, Lepeng Huang, Yan Li, Yu Gong, and Wei Ge. 2019. An ultra-low power always-on keyword spotting accelerator using quantized convolutional neural network and voltage-domain analog switching network-based approximate computing. IEEE Access 7 (2019), 186456–186469. DOI: https://doi.org/10.1109/ACCESS.2019.2960948
[26]
Miaorong Wang and Anantha P. Chandrakasan. 2019. Flexible low power CNN accelerator for edge computing with weight tuning. In Proceedings of the 2019 IEEE Asian Solid-State Circuits Conference. IEEE, 209–212.
[27]
Chao-Yang Kao, Huang-Chih Kuo, Jian-Wen Chen, Chiung-Liang Lin, Pin-Han Chen, and Youn-Long Lin. 2020. RNNAccel: A fusion recurrent neural network accelerator for edge intelligence. arXiv:2010.13311. Retrieved from https://arxiv.org/abs/2010.13311.
[28]
Patti Price, William M. Fisher, Jared Bernstein, and David S. Pallett. 1988. The DARPA 1000-word resource management database for continuous speech recognition. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE Computer Society, 651–652.
[29]
Mohit Shah, Jingcheng Wang, David Blaauw, Dennis Sylvester, Hun-Seok Kim, and Chaitali Chakrabarti. 2015. A fixed-point neural network for keyword detection on resource constrained hardware. In Proceedings of the 2015 IEEE Workshop on Signal Processing Systems. IEEE, 1–6.
[30]
Byeonggeun Kim, Mingu Lee, Jinkyu Lee, Yeonseok Kim, and Kyuwoong Hwang. 2019. Query-by-example on-device keyword spotting. In Proceedings of the 2019 IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 532–538.
[31]
Knowles. 2005. Passive Microphone BJ-21590-000. Retrieved on April 29, 2021 from https://www.digikey.be/htmldatasheets/production/388648/0/0/1/bj-21590-000-drawing.html.
[32]
John Eargle. 2012. The Microphone Book: From Mono to Stereo to Surround-A Guide to Microphone Design and Application. CRC Press.
[33]
Dennis B. Fry. 1975. Simple reaction-times to speech and non-speech stimuli. Cortex 11, 4 (1975), 355–360.
[34]
Michael Price, James Glass, and Anantha P. Chandrakasan. 2017. A low-power speech recognizer and voice activity detector using deep neural networks. IEEE Journal of Solid-State Circuits 53, 1 (2017), 66–75.
[35]
Kenichi Kumatani, Sankaran Panchapagesan, Minhua Wu, Minjae Kim, Nikko Strom, Gautam Tiwari, and Arindam Mandai. 2017. Direct modeling of raw audio with dnns for wake word detection. In Proceedings of the 2017 IEEE Automatic Speech Recognition and Understanding Workshop. IEEE, 252–257.
[36]
Simon Mittermaier, Ludwig Kürzinger, Bernd Waschneck, and Gerhard Rigoll. 2020. Small-footprint keyword spotting on raw audio data with sinc-convolutions. In Proceedings of the 2020 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7454–7458.
[37]
Qin Li, Huifeng Zhu, Fei Qiao, Xinjun Liu, Qi Wei, and Huazhong Yang. 2018. Energy-efficient MFCC extraction architecture in mixed-signal domain for automatic speech recognition. In Proceedings of the 2018 IEEE/ACM International Symposium on Nanoscale Architectures. IEEE, 1–3.
[38]
Komail Badami, Kushal Dakshina Murthy, Pieter Harpe, and Marian Verhelst. 2018. A 0.6 V 54DB SNR analog frontend with 0.18 THD for low power sensory applications in 65NM CMOS. In Proceedings of the 2018 IEEE Symposium on VLSI Circuits. IEEE, 241–242.
[39]
Paul Boersma. 1993. Accurate short-term analysis of the fundamental frequency and the harmonics-to-noise ratio of a sampled sound. In Proceedings of the Institute of Phonetic Sciences. Vol. 17, Amsterdam, 97–110.
[40]
François Chollet. 2017. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 1251–1258.
[41]
Muhammad Shahnawaz, Emanuele Plebani, Ivana Guaneri, Danilo Pau, and Marco Marcon. 2018. Studying the effects of feature extraction settings on the accuracy and memory requirements of neural networks for keyword spotting. In Proceedings of the 2018 IEEE 8th International Conference on Consumer Electronics. IEEE, 1–6.
[42]
Suyoung Bang, Jingcheng Wang, Ziyun Li, Cao Gao, Yejoong Kim, Qing Dong, Yen-Po Chen, Laura Fick, Xun Sun, Ron Dreslinski, Trevor Mudge, Hun Seok Kim, David Blaauw, and Dennis Sylvester. 2017. 14.7 a 288 \(\mu\)w programmable deep-learning processor with 270 kb on-chip weight storage using non-uniform memory hierarchy for mobile intelligence. In Proceedings of the 2017 IEEE International Solid-State Circuits Conference. IEEE, 250–251.
[43]
Komail M. H. Badami, Steven Lauwereins, Wannes Meert, and Marian Verhelst. 2015. A 90 nm CMOS, 6 uW power-proportional acoustic sensing frontend for voice activity detection. IEEE Journal of Solid-State Circuits 51, 1 (2015), 291–302.
[44]
Richard C. Rose and Douglas B. Paul. 1990. A hidden Markov model based keyword recognition system. In Proceedings of the International Conference on Acoustics, Speech, and Signal Processing. IEEE, 129–132.
[45]
Pallavi Baljekar, Jill Fain Lehman, and Rita Singh. 2014. Online word-spotting in continuous speech with recurrent neural networks. In Proceedings of the 2014 IEEE Spoken Language Technology Workshop. IEEE, 536–541.
[46]
Guoguo Chen, Carolina Parada, and Georg Heigold. 2014. Small-footprint keyword spotting using deep neural networks. In Proceedings of the 2014 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 4087–4091.
[47]
Tara N. Sainath and Carolina Parada. 2015. Convolutional neural networks for small-footprint keyword spotting. In Proceedings of the 16th Annual Conference of the International Speech Communication Association.
[48]
Santiago Fernández, Alex Graves, and Jürgen Schmidhuber. 2007. An application of recurrent neural networks to discriminative keyword spotting. In Proceedings of the International Conference on Artificial Neural Networks. Springer, 220–229.
[49]
Colin Lea, Michael D. Flynn, Rene Vidal, Austin Reiter, and Gregory D. Hager. 2017. Temporal convolutional networks for action segmentation and detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 156–165.
[50]
Giacomo Indiveri, Bernabé Linares-Barranco, Tara Julia Hamilton, André Van Schaik, Ralph Etienne-Cummings, Tobi Delbruck, Shih-Chii Liu, Piotr Dudek, Philipp Häfliger, Sylvie Renaud, Johannes Schemmel, Gert Cauwenberghs, John Arthur, Kai Hynna, Fopefolu Folowosele, Sylvain Saïghi, Teresa Serrano-Gotarredona, Jayawan Wijekoon, Yingxue Wang, and Kwabena Boahen. 2011. Neuromorphic silicon neuron circuits. Frontiers in Neuroscience 31, 5 (2011), 73.
[51]
Michael L. Seltzer, Dong Yu, and Yongqiang Wang. 2013. An investigation of deep neural networks for noise robust speech recognition. In Proceedings of the 2013 IEEE International Conference on Acoustics, Speech and Signal Processing. IEEE, 7398–7402.
[52]
Jürgen Teich. 2012. Hardware/software codesign: The past, the present, and predicting the future. Proceedings of the IEEE 100, Special Centennial Issue (2012), 1411–1430.
[53]
Siri Team. 2017. Hey Siri: An on-device DNN-powered voice trigger for Apple’s personal assistant. Apple Machine Learning Journal 1, 6 (2017).
[54]
J. S. P. Giraldo, Chris O’Connor, and Marian Verhelst. 2019. Efficient keyword spotting through hardware-aware conditional execution of deep neural networks. In Proceedings of the 2019 IEEE/ACS 16th International Conference on Computer Systems and Applications. IEEE, 1–8.
[55]
Vivienne Sze, Yu-Hsin Chen, Tien-Ju Yang, and Joel S. Emer. 2017. Efficient processing of deep neural networks: A tutorial and survey. Proceedings of the IEEE 105, 12 (2017), 2295–2329.
[56]
Soheil Hashemi, Nicholas Anthony, Hokchhay Tann, R. Iris Bahar, and Sherief Reda. 2017. Understanding the impact of precision quantization on the accuracy and energy of neural networks. In Proceedings of the Design, Automation & Test in Europe Conference & Exhibition. IEEE, 1474–1479.
[57]
Asit Mishra, Eriko Nurvitadhi, Jeffrey J. Cook, and Debbie Marr. 2017. WRPN: Wide reduced-precision networks. arXiv:1709.01134. Retrieved from https://arxiv.org/abs/1709.01134.
[58]
Jian Li and Raziel Alvarez. 2021. On the quantization of recurrent neural networks. arXiv:2101.05453. Retrieved from https://arxiv.org/abs/2101.05453.
[59]
Davis Blalock, Jose Javier Gonzalez Ortiz, Jonathan Frankle, and John Guttag. 2020. What is the state of neural network pruning?arXiv:2003.03033. Retrieved from https://arxiv.org/abs/2003.03033.
[60]
Hengyuan Hu, Rui Peng, Yu-Wing Tai, and Chi-Keung Tang. 2016. Network trimming: A data-driven neuron pruning approach towards efficient deep architectures. arXiv:1607.03250. Retrieved from https://arxiv.org/abs/1607.03250.
[61]
Song Han, Huizi Mao, and William J. Dally. 2015. Deep compression: Compressing deep neural networks with pruning, trained quantization and huffman coding. arXiv:1510.00149. Retrieved from https://arxiv.org/abs/1510.00149.
[62]
Klaus Greff, Rupesh K. Srivastava, Jan Koutník, Bas R. Steunebrink, and Jürgen Schmidhuber. 2016. LSTM: A search space odyssey. IEEE Transactions on Neural Networks and Learning Systems 28, 10 (2016), 2222–2232.
[63]
Sebastian Herbert and Diana Marculescu. 2007. Analysis of dynamic voltage/frequency scaling in chip-multiprocessors. In Proceedings of the 2007 International Symposium on Low Power Electronics and Design. IEEE, 38–43.
[64]
Syntiant. 2019. The Speed and Power Advantage of a Purpose-Built Neural Compute Engine. RetrievedJune 2019 from https://www.syntiant.com/post/keyword-spotting-power-comparison.
[65]
Joan Figueras, UPC Barcelona, CLEAN Training Leader Spain, Herman E. Maes, IMEC Leuven, Dominique Thomas, and ST Microelectronics France. Controlling Leakage Power in Nanometer CMOS: Technology Meets Design. Retrieved on April 29, 2021 from https://www.edacentrum.de/controlling-leakage-power-nanometer-cmos-technology-meets-design.
[66]
David S. Moore and Stephane Kirkland. 2007. The Basic Practice of Statistics. Vol. 2. WH Freeman, New York, NY.
[67]
A. R. M. Peter Greenhalgh. 2011. Big. LITTLE Processing with ARM Cortex™-A15 & Cortex-A7. Retrieved on April 29, 2021 from https://www.eetimes.com/big-little-processing-with-arm-cortex-a15-cortex-a7/.
[68]
Minchang Cho, Sechang Oh, Zhan Shi, Jongyup Lim, Yejoong Kim, Seokhyeon Jeong, Yu Chen, David Blaauw, Hun-Seok Kim, and Dennis Sylvester. 2019. 17.2 a 142 nW voice and acoustic activity detection chip for mm-scale sensor nodes using time-interleaved mixer-based frequency scanning. In Proceedings of the 2019 IEEE International Solid-State Circuits Conference. IEEE, 278–280.

Cited By

View all
  • (2024)A 1.5-μW Fully-Integrated Keyword Spotting SoC in 28-nm CMOS With Skip-RNN and Fast-Settling Analog Frontend for Adaptive Frame SkippingIEEE Journal of Solid-State Circuits10.1109/JSSC.2023.331664859:1(29-39)Online publication date: Jan-2024
  • (2024)Efficient Real-Time Smart Keyword Spotting Using Spectrogram-Based Hybrid CNN-LSTM for Edge SystemIEEE Access10.1109/ACCESS.2024.338035012(43109-43125)Online publication date: 2024
  • (2022)A Configurable Accelerator for Keyword Spotting Based on Small-Footprint Temporal Efficient Neural NetworkElectronics10.3390/electronics1116257111:16(2571)Online publication date: 17-Aug-2022
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Embedded Computing Systems
ACM Transactions on Embedded Computing Systems  Volume 20, Issue 6
November 2021
256 pages
ISSN:1539-9087
EISSN:1558-3465
DOI:10.1145/3485150
  • Editor:
  • Tulika Mitra
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Journal Family

Publication History

Published: 18 October 2021
Accepted: 01 July 2021
Received: 01 January 2021
Published in TECS Volume 20, Issue 6

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hardware acceleration
  2. speech recognition
  3. ASIC

Qualifiers

  • Research-article
  • Refereed

Funding Sources

  • EU ERC
  • Scientific Research Flanders (FWO-Vlaanderen)
  • Flemish Government (AI Research Program)

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)283
  • Downloads (Last 6 weeks)22
Reflects downloads up to 17 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A 1.5-μW Fully-Integrated Keyword Spotting SoC in 28-nm CMOS With Skip-RNN and Fast-Settling Analog Frontend for Adaptive Frame SkippingIEEE Journal of Solid-State Circuits10.1109/JSSC.2023.331664859:1(29-39)Online publication date: Jan-2024
  • (2024)Efficient Real-Time Smart Keyword Spotting Using Spectrogram-Based Hybrid CNN-LSTM for Edge SystemIEEE Access10.1109/ACCESS.2024.338035012(43109-43125)Online publication date: 2024
  • (2022)A Configurable Accelerator for Keyword Spotting Based on Small-Footprint Temporal Efficient Neural NetworkElectronics10.3390/electronics1116257111:16(2571)Online publication date: 17-Aug-2022
  • (2022)Self-organization of an inhomogeneous memristive hardware for sequence learningNature Communications10.1038/s41467-022-33476-613:1Online publication date: 2-Oct-2022

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media