Abstract
Keyword spotting (KWS) is important in numerous trigger, trigger-command and command and control applications of embedded platforms. However, the embedded platforms used currently in the fast growing market of the Internet of Things (IoT) and in standalone systems have still considerable processing power, memory and battery constraints. In IoT and smart devices applications, speakers are usually far from the microphone resulting in severe distortions and considerable amounts of noise and noticeable reverberation. Speech enhancement can be used as a front-end or pre-processing module to improve the performance of the KWS. However, denoisers and dereverberators as front-end processing modules add to the complexity of the keyword spotting system and the computing, memory and battery requirements of the embedded platforms. In this paper, a noise robust keyword spotting engine with small memory footprint is presented. Multi-condition utterances training of a deep neural networks model is developed to increase the keyword spotting noise robustness. A comparative study is conducted to compare the deep learning approach with Gaussian mixture model. Experimental results show that deep learning outperforms the Gaussian approach in both clean and noisy conditions. Moreover, deep learning model trained using partially noisy data saves the need for using speech enhancement module or denoiser for front-end processing.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Seth, H., Kumar, P., Srivastava, M.M.: Prototypical metric transfer learning for continuous speech keyword spotting with limited training data. arXiv preprint arXiv:1901.03860 (2019)
Mary, L., G, D.: Keyword spotting techniques. In: Searching Speech Databases. SST, pp. 45–60. Springer, Cham (2019). https://doi.org/10.1007/978-3-319-97761-4_4
Moyal, A., Aharonson, V., Tetariy, E., Gishri, M.: Keyword spotting methods. In: Phonetic Search Methods for Large Speech Databases. SpringerBriefs in Electrical and Computer Engineering, pp. 7–11. Springer, New York (2013). https://doi.org/10.1007/978-1-4614-6489-1_2
Chen, I.-F., Ni, C., Lim, B.P., Chen, N.F., Lee, C.-H.: A novel keyword+LVCSR-filler based grammar network representation for spoken keyword search. In: 9th International Symposium on Chinese Spoken Language Processing (ISCSLP), pp. 192–196. IEEE (2014)
Szoke, I., et al.: Comparison of keyword spotting approaches for informal continuous speech. In: Interspeech, pp. 633–636 (2005)
Chen, G., Parada, C., Heigold, G.: Small-footprint keyword spotting using deep neural networks. In: 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 4087–4091. IEEE (2014)
Moyal, A., Aharonson, V., Tetariy, E., Gishri, M.: Phonetic Search Methods for Large Speech Databases. Springer Science and Business Media, New York (2013)
Alon, G.: Key-word spotting the base technology for speech analytics. Natural Speech Communications (2005)
Abdelmoula, R.: Noise robust keyword spotting using deep neural networks for embedded platforms. Master’s thesis, University of Waterloo (2016)
Ortega-Garcia, J., Gonzalez-Rodrguez, J.: Overview of speech enhancement techniques for automatic speaker recognition. In: Proceedings of the Fourth International Conference on Spoken Language, ICSLP 1996, vol. 2, pp. 929–932. IEEE (1996)
Loizou, P.: Speech Enhancement: Theory and Practice. CRC Press, Boca Raton (2013)
Yousefian, N., Loizou, P.C.: A dual-microphone speech enhancement algorithm based on the coherence function. IEEE Trans. Audio Speech Lang. Process. 20(2), 599–609 (2012)
Xu, Y., Du, J., Dai, L.-R., Lee, C.-H.: An experimental study on speech enhancement based on deep neural networks. IEEE Signal Process. Lett. 21(1), 65–68 (2014)
Gemmeke, J.F., Virtanen, T., Hurmalainen, A.: Exemplar-based speech enhancement and its application to noise-robust automatic speech recognition. In: International Workshop on Machine Listening in Multisource Environments, pp. 53–75 (2011)
Liu, D., Smaragdis, P., Kim, M.: Experiments on deep learning for speech denoising. In: Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) (2014)
Lu, X., Tsao, Y., Matsuda, S., Hori, C.: Speech enhancement based on deep denoising autoencoder. In: INTERSPEECH, pp. 436–440 (2013)
Cohen, I., Gannot, S.: Spectral enhancement methods. In: Benesty, J., Sondhi, M.M., Huang, Y.A. (eds.) Springer Handbook of Speech Processing. SH, pp. 873–902. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-49127-9_44
Ephraim, Y., Malah, D.: Speech enhancement using a minimum mean-square error log-spectral amplitude estimator. IEEE Trans. Acoust. Speech Signal Process. 33(2), 443–445 (1985)
Seltzer, M.L., Yu, D., Wang, Y.: An investigation of deep neural networks for noise robust speech recognition. In: 2013 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 7398–7402. IEEE (2013)
Virtanen, T., Singh, R., Raj, B.: Techniques for Noise Robustness in Automatic Speech Recognition. Wiley, Hoboken (2012)
Deng, L., Acero, L., Plumpe, M., Huang, X.: Large-vocabulary speech recognition under adverse acoustic environments. In: INTERSPEECH, pp. 806–809 (2000)
Hinton, G.E., Osindero, S., Teh, Y.-W.: A fast learning algorithm for deep belief nets. Neural Comput. 18(7), 1527–1554 (2006)
Janni, D.: Introduction to deep neural network (2015). http://derekjanni.github.io/Easy-Neural-Nets/
Vincent, P., Larochelle, H., Lajoie, I., Bengio, Y., Manzagol, P.-A.: Stacked denoising autoencoders: learning useful representations in a deep network with a local denoising criterion. J. Mach. Learn. Res. 11, 3371–3408 (2010)
Ciresan, D.C., Meier, U., Masci, J., Maria Gambardella, L., Schmidhuber, J.: Flexible, high performance convolutional neural networks for image classification. In: IJCAI Proceedings-International Joint Conference on Artificial Intelligence, vol. 22, no. 1, p. 1237 (2011)
Ciresan, D., Meier, U., Schmidhuber, J.: Multi-column deep neural networks for image classification. In: 2012 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3642–3649. IEEE (2012)
Strigl, D., Kofler, K., Podlipnig, S.: Performance and scalability of GPU-based convolutional neural networks. In: 2010 18th Euromicro International Conference on Parallel, Distributed and Network-Based Processing (PDP), pp. 317–324. IEEE (2010)
Uetz, R., Behnke, S.: Large-scale object recognition with cudaaccelerated hierarchical neural networks. In: IEEE International Conference on Intelligent Computing and Intelligent Systems, ICIS 2009, vol. 1, pp. 536–541. IEEE (2009)
Lei, X., Senior, A., Gruenstein, A., Sorensen, J.: Accurate and compact large vocabulary speech recognition on mobile devices. In: INTERSPEECH, pp. 662–665 (2013)
Author information
Authors and Affiliations
Corresponding authors
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2019 Springer Nature Switzerland AG
About this paper
Cite this paper
Abdelmoula, R., Khamis, A., Karray, F. (2019). A Deep Learning-Based Noise-Resilient Keyword Spotting Engine for Embedded Platforms. In: Karray, F., Campilho, A., Yu, A. (eds) Image Analysis and Recognition. ICIAR 2019. Lecture Notes in Computer Science(), vol 11663. Springer, Cham. https://doi.org/10.1007/978-3-030-27272-2_12
Download citation
DOI: https://doi.org/10.1007/978-3-030-27272-2_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-27271-5
Online ISBN: 978-3-030-27272-2
eBook Packages: Computer ScienceComputer Science (R0)