Skip to main content
Log in

EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting

  • Original Research
  • Published:
Journal of Ambient Intelligence and Humanized Computing Aims and scope Submit manuscript

Abstract

Keyword Spotting (KWS) is a significant branch of Automatic Speech Recognition (ASR) and has been widely used in edge computing devices. The goal of KWS is to provide high accuracy with a low False Alarm Rate (FAR), while reducing the costs of memory, computation, and latency. However, limited resources are challenging for KWS applications on edge computing devices. Lightweight models and structures for deep learning have achieved good results in the KWS branch while maintaining efficient performances. In this paper, we present a new Convolutional Recurrent Neural Network (CRNN) architecture named EdgeCRNN for edge computing devices. EdgeCRNN, which is based on depthwise separable convolution and residual structure, uses a feature enhanced method. On the Google Speech Commands Dataset, the experimental results depict that EdgeCRNN can test 11.1 audio data per second on Raspberry Pi 3B+, which is 2.2 times than that of Tpool2. Compared with Tpool2, the accuracy of EdgeCRNN reaches 98.05% whilst its performance is also competitive.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

References

  • Abdel-Hamid O, Ar Mohamed, Jiang H, Deng L, Penn G, Yu D (2014) Convolutional neural networks for speech recognition. IEEE/ACM Trans Audio Speech Lang Process 22(10):1533–1545

    Article  Google Scholar 

  • Anderson A, Su J, Dahyot R, Gregg D (2020) Performance-oriented neural architecture search. arXiv preprint arXiv:200102976

  • Arik SO, Kliegl M, Child R, Hestness J, Gibiansky A, Fougner C, Prenger R, Coates A (2017) Convolutional recurrent neural networks for small-footprint keyword spotting. arXiv preprint arXiv:170305390

  • Benelli G, Meoni G, Fanucci L (2018) A low power keyword spotting algorithm for memory constrained embedded systems. In: 2018 IFIP/IEEE international conference on very large scale integration (VLSI-SoC). IEEE, pp 267–272

  • Chen G, Parada C, Heigold G (2014) Small-footprint keyword spotting using deep neural networks. In: 2014 IEEE international conference on acoustics. speech and signal processing (ICASSP). IEEE, pp 4087–4091

  • Cho K, Van Merriënboer B, Gulcehre C, Bahdanau D, Bougares F, Schwenk H, Bengio Y (2014) Learning phrase representations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:14061078

  • Coucke A, Chlieh M, Gisselbrecht T, Leroy D, Poumeyrol M, Lavril T (2019) Efficient keyword spotting using dilated convolutions and gating. In: ICASSP 2019–2019 IEEE international conference on acoustics. speech and signal processing (ICASSP). IEEE, pp 6351–6355

  • Custers B, Sears AM, Dechesne F, Georgieva I, Tani T, van der Hof S (2019) EU personal data protection in policy and practice. Springer, Berlin

    Book  Google Scholar 

  • Dey R, Salemt FM (2017) Gate-variants of gated recurrent unit (GRU) neural networks. In: 2017 IEEE 60th international midwest symposium on circuits and systems (MWSCAS). IEEE, pp 1597–1600

  • Dinelli G, Meoni G, Rapuano E, Benelli G, Fanucci L (2019) An FPGA-based hardware accelerator for CNNS using on-chip memories only: design and benchmarking with intel movidius neural compute stick. Int J Reconfigurable Comput 2019:7218758

    Article  Google Scholar 

  • Du H, Li R, Kim D, Hirota K, Dai Y (2018) Low-latency convolutional recurrent neural network for keyword spotting. In: 2018 Joint 10th international conference on soft computing and intelligent systems (SCIS) and 19th International Symposium on Advanced Intelligent Systems (ISIS). IEEE, pp 802–807

  • Gaff BM, Sussman HE, Geetter J (2014) Privacy and big data. Computer 47(6):7–9

    Article  Google Scholar 

  • Glorot X, Bordes A, Bengio Y (2011) Deep sparse rectifier neural networks. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics, pp 315–323

  • He K, Zhang X, Ren S, Sun J (2016) Deep residual learning for image recognition. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 770–778

  • Hochreiter S, Schmidhuber J (1997) Long short-term memory. Neural Comput 9(8):1735–1780

    Article  Google Scholar 

  • Howard A, Sandler M, Chu G, Chen LC, Chen B, Tan M, Wang W, Zhu Y, Pang R, Vasudevan V et al (2019) Searching for mobilenetv3. In: Proceedings of the IEEE international conference on computer vision, pp 1314–1324

  • Howard AG, Zhu M, Chen B, Kalenichenko D, Wang W, Weyand T, Andreetto M, Adam H (2017) Mobilenets: efficient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:170404861

  • Ioffe S, Szegedy C (2015) Batch normalization: accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:150203167

  • Luo R, Sun T, Wang C, Du M, Tang Z, Zhou K, Gong X, Yang X (2019) Multi-layer attention mechanism for speech keyword recognition. arXiv preprint arXiv:190704536

  • Ma N, Zhang X, Zheng HT, Sun J (2018) Shufflenet v2: practical guidelines for efficient CNN architecture design. In: Proceedings of the European conference on computer vision (ECCV), pp 116–131

  • Mazzawi H, Gonzalvo X, Kracun A, Sridhar P, Subrahmanya N, Moreno IL, Park HJ, Violette P (2019) Improving keyword spotting and language identification via neural architecture search at scale. In: Proc Interspeech, vol 2019, pp 1278–1282

  • McFee B, Raffel C, Liang D, Ellis DP, McVicar M, Battenberg E, Nieto O (2015) Librosa: audio and music signal analysis in python. In: Proceedings of the 14th python in science conference, vol 8

  • Mishchenko Y, Goren Y, Sun M, Beauchene C, Matsoukas S, Rybakov O, Vitaladevuni SNP (2019) Low-bit quantization and quantization-aware training for small-footprint keyword spotting. In: 2019 18th IEEE international conference on machine learning and applications (ICMLA). IEEE, pp 706–711

  • Nakkiran P, Alvarez R, Prabhavalkar R and Parada C (2015) Compressing deep neural networks using a rank-constrained topology, In: Proceedings of annual conference of the international speech communication association (Interspeech). pp 1473–1477

  • Sainath TN, Parada C (2015) Convolutional neural networks for small-footprint keyword spotting. In: Proceeding of the Sixteenth Annual Conference of the International Speech Communication Association (Interspeech). pp 1478–1482

  • Sandler M, Howard A, Zhu M, Zhmoginov A, Chen LC (2018) Mobilenetv2: Inverted residuals and linear bottlenecks. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 4510–4520

  • Sifre L, Mallat S (2014) Rigid-motion scattering for image classification. Ph.D. Thesis

  • Silaghi MC (2005) Spotting subsequences matching an hmm using the average observation probability criteria with application to keyword spotting. In: AAAI, pp 1118–1123

  • Silaghi MC, Bourlard H (1999) Iterative posterior-based keyword spotting without filler models. In: Proceedings of the IEEE automatic speech recognition and understanding workshop. Citeseer, pp 213–216

  • Sun M, Raju A, Tucker G, Panchapagesan S, Fu G, Mandal A, Matsoukas S, Strom N, Vitaladevuni S (2016) Max-pooling loss training of long short-term memory networks for small-footprint keyword spotting. In: 2016 IEEE spoken language technology workshop (SLT). IEEE, pp 474–480

  • Sun M, Snyder D, Gao Y, Nagaraja VK, Rodehorst M, Panchapagesan S, Strom N, Matsoukas S, Vitaladevuni S (2017) Compressed time delay neural network for small-footprint keyword spotting. In: INTERSPEECH, pp 3607–3611

  • Tan M, Chen B, Pang R, Vasudevan V, Sandler M, Howard A, Le QV (2018) Resource-efficient neural architect. arXiv preprint arXiv:180607912

  • Tang R, Lin J (2018) Deep residual learning for small-footprint keyword spotting. 2018 IEEE International Conference on Acoustics. Speech and Signal Processing (ICASSP). IEEE, pp 5484–5488

  • Tang R, Wang W, Tu Z, Lin J (2018) An experimental analysis of the power consumption of convolutional neural networks for keyword spotting. In: 2018 IEEE international conference on acoustics. speech and signal processing (ICASSP). IEEE, pp 5479–5483

  • Tucker G, Wu M, Sun M, Panchapagesan S, Fu G, Vitaladevuni S (2016) Model compression applied to small-footprint keyword spotting. In: INTERSPEECH, pp 1878–1882

  • Véniat T, Schwander O, Denoyer L (2019) Stochastic adaptive neural architecture search for keyword spotting. In: ICASSP 2019–2019 IEEE international conference on acoustics. speech and signal processing (ICASSP). IEEE, pp 2842–2846

  • Warden P (2018) Speech commands: A dataset for limited-vocabulary speech recognition. arXiv preprint arXiv:180403209

  • Wilpon J, Miller L, Modi P (1991) Improvements and applications for key word recognition using hidden markov modeling techniques. In: 1991 international conference on acoustics, speech, and signal processing. IEEE, pp 309–312

  • Zeng M, Xiao N (2019) Effective combination of DenseNet and BiLSTM for keyword spotting. IEEE Access 7:10767–10775

    Article  Google Scholar 

  • Zhang B, Li W, Li Q, Zhuang W, Chu X, Wang Y (2020) Autokws: keyword spotting with differentiable architecture search. arXiv preprint arXiv:200903658

  • Zhang X, Zhou X, Lin M, Sun J (2018) Shufflenet: An extremely efficient convolutional neural network for mobile devices. In: Proceedings of the IEEE conference on computer vision and pattern recognition, pp 6848–6856

  • Zhang Y, Suda N, Lai L, Chandra V (2017) Hello edge: keyword spotting on microcontrollers. arXiv preprint arXiv:171107128

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yamin Wen.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

A preliminary version of this paper has been published at ML4CS 2020, Springer LNCS, this is the full-length version. This paper is supported by the National Natural Sciences Foundation of China (No. 62072192), National Cryptography Development Fund (No. MMJJ20180206), the Project of Science and Technology of Guangzhou (No. 201802010044), Guangdong Basic and Applied Basic Research Foundation (No. 2019A1515011797), the Opening Project of GuangDong Province Key Laboratory of Information Security Technology(No. 2020B1212060078), the Project of Guangdong Province Innovative Team(2020WCXTD011) and the Research Team of Big Data Audit from Guangdong University of Finance and Economics.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wei, Y., Gong, Z., Yang, S. et al. EdgeCRNN: an edge-computing oriented model of acoustic feature enhancement for keyword spotting. J Ambient Intell Human Comput 13, 1525–1535 (2022). https://doi.org/10.1007/s12652-021-03022-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s12652-021-03022-1

Keywords

Navigation