Elsevier

Neural Networks

Volume 150, June 2022, Pages 28-42
Neural Networks

Two-stage streaming keyword detection and localization with multi-scale depthwise temporal convolution

https://doi.org/10.1016/j.neunet.2022.03.003Get rights and content

Abstract

A keyword spotting (KWS) system running on smart devices should accurately detect the appearances and predict the locations of predefined keywords from audio streams, with small footprint and high efficiency. To this end, this paper proposes a new two-stage KWS method which combines a novel multi-scale depthwise temporal convolution (MDTC) feature extractor and a two-stage keyword detection and localization module. The MDTC feature extractor learns multi-scale feature representation efficiently with dilated depthwise temporal convolution, modeling both the temporal context and the speech rate variation. We use a region proposal network (RPN) as the first-stage KWS. At each frame, we design multiple time regions, which all take the current frame as the end position but have different start positions. These time regions (or formally anchors) are used to indicate rough location candidates of keyword. With frame level features from the MDTC feature extractor as inputs, RPN learns to propose keyword region proposals based on the designed anchors. To alleviate the keyword/non-keyword class imbalance problem, we specifically introduce a hard example mining algorithm to select effective negative anchors in RPN training. The keyword region proposals from the first-stage RPN contain keyword location information which is subsequently used to explicitly extract keyword related sequential features to train the second-stage KWS. The second-stage system learns to classify and transform region proposal to keyword IDs and ground-truth keyword region respectively. Experiments on the Google Speech Command dataset show that the proposed MDTC feature extractor surpasses several competitive feature extractors with a new state-of-the-art command classification error rate of 1.74%. With the MDTC feature extractor, we further conduct wake-up word (WuW) detection and localization experiments on a commercial WuW dataset. Compared to a strong baseline, our proposed two-stage method achieves relatively 27–32% better false rejection rate at one false alarm per hour, while for keyword localization, the two-stage approach achieves more than 0.95 mean intersection-over-union ratio, which is clearly better than the one-stage RPN method.

Introduction

Keyword spotting (KWS), or specifically wake-up word (WuW) detection, has been widely used in popular smart devices, e.g., smart speakers, mobile phones and earphones. A KWS system usually runs locally and persistently on these devices. Due to resource limitation of edge devices, a KWS system is desired to be small footprint and computational efficient, i.e., it should have as small number of model parameters as possible and use as few calculations as possible.

One main mission of a KWS system is to accurately detect predefined keywords or key-phrases. A KWS system is typically evaluated by false rejections (FRs) and false alarms (FAs). FR makes users unable to get feedback when they need to wake up a device; FA disturbs users annoyingly when they do not need the device to response. Another important mission of a KWS system is to accurately locate the keywords from audio streams. Exact localization of keywords can provide valuable information for downstream tasks. For example, in a voice-controlled personal device, segments of keywords are first detected and then sent to a speaker verification module for user authentication (Jia et al., 2021, Sigtia et al., 2020, Zhang et al., 2016). It is also proven that accurate keyword segments can benefit speech recognition as well (Maas et al., 2016, Wang, Fan, et al., 2019).

KWS or WuW detection was previously dominated by keyword/filler hidden Markov model (HMM) based methods (Rohlicek et al., 1989, Rose and Paul, 1990, Silaghi, 2005, Silaghi and Bourlard, 1999, Wilpon et al., 1991), where an HMM, often based on keyword phone sequence, is built to model the keyword, while another HMM, usually via a phone loop, is built to model non-keyword speech segments. With the powerful modeling capabilities of deep neural networks (DNNs) (Hinton et al., 2012), recent work have replaced Gaussian mixture models (GMMs) with DNNs to predict the state emission likelihoods in keyword/filler HMM-based methods (Guo et al., 2018, Kumatani et al., 2017, Panchapagesan et al., 2016, Sun et al., 2017, Wu et al., 2018).

Chen et al. first proposed deep KWS paradigm (Chen, Parada, & Heigold, 2014) which uses a DNN to predict word posterior probabilities and a sliding-window based posterior-handling method to determine the existence of keywords in audio streams. Deep KWS shows superior performance compared to the conventional keyword/filler HMM approach in small-footprint KWS scenarios (Chen et al., 2014) and has attracted much attention recently (Li et al., 2018, Prabhavalkar et al., 2015, Sainath and Parada, 2015). Under the deep KWS paradigm, some of the latest KWS methods directly take keywords, rather than smaller units like phones, as modeling units, using DNNs to detect their appearance in audio streams. This kind of method is often called end-to-end (E2E) KWS (Alvarez and Park, 2019, Coucke et al., 2019, Lengerich and Hannun, 2016, Shan et al., 2018, Yu et al., 2020), because it directly takes the keyword classification accuracy as the ultimate optimization goal and does not need the sliding-window based posterior-handling or complicated decoding method to obtain the occurrence probability of keyword in audio streams. A good acoustic model is essential to the success of an E2E KWS system. Here, we divide an acoustic model of E2E KWS into two parts: a backbone network or NN based feature extractor1 and a keyword detection module. To learn feature representation of continuous speech well, different NN architectures were proposed in E2E KWS, including recurrent neural networks (RNNs) (Arık et al., 2017, Chung et al., 2014, Hochreiter and Schmidhuber, 1997, Shan et al., 2018, Sun et al., 2016, Woellmer et al., 2013), convolutional neural networks (CNNs) (Alvarez and Park, 2019, Coucke et al., 2019, He et al., 2016, Howard et al., 2017, Rybakov et al., 2020, van den Oord et al., 2016, Xu and Zhang, 2020, Zhang et al., 2017) and combinations of RNNs and CNNs (Arık et al., 2017, Shan et al., 2018, Zeng and Xiao, 2019). The keyword detection module, built upon the backbone network of an acoustic model, defines different training objectives and methods for E2E KWS. Some methods use a sequence model to learn the keywords/non-keyword sequence (Arık et al., 2017, He et al., 2017, Li et al., 2018, Wang et al., 2020, Woellmer et al., 2013). Other methods assume a trigger position in the keyword, either automatically learned via max-pooling (Hou et al., 2020, Park et al., 2020, Sun et al., 2016), attention (Shan et al., 2018), or simply specified as the last several frames of the keyword (Alvarez and Park, 2019, Coucke et al., 2019, Higuchi et al., 2020, Zhang et al., 2018).

Most of the above acoustic modeling methods solely aim at eliminating FAs and FRs of KWS. They do not explicitly consider to predict precise keyword location. As we mentioned before, exact localization of keyword is also important. In a preliminary version of this work (Hou, Shi, Ostendorf, Hwang, & Xie, 2019), we specifically proposed a region proposal network (RPN) based KWS method to accurately predict keyword location. Specifically, according to the prior knowledge of keyword duration, we design a pool of anchor-regions (or simply called anchors) to indicate rough location candidates of keyword. Anchors at each frame, take the current frame as the end position and have different start positions. The RPN then aims to classify each anchor to keyword IDs and transform each anchor to the ground truth keyword region. The RPN-based KWS method achieved significantly better detection performance compared with a strong end-of-keyword labeling method (Coucke et al., 2019). However, we find this method still has substantial space to improve. Firstly, an RNN with gated recurrent unit (GRU) is used in the RPN-based KWS method (Hou et al., 2019) to model long/short-range dependency feature representation, which is proven to be inefficient compared to the convolution family in KWS task (Alvarez and Park, 2019, Coucke et al., 2019, Higuchi et al., 2020, Li et al., 2020, Zhang et al., 2017). Secondly, as a preliminary work, in Hou et al. (2019), we use a simple random down-sample strategy to deal with the class imbalance problem. The problem of imbalanced keyword/non-keyword samples has been recently addressed as a major obstacle to the KWS performance (Hou et al., 2020, Liu et al., 2019, López-Espejo et al., 2021). Last but not the least, in our previous design, we simply combine keyword classification and keyword localization through a multi-task training objective, while the predicted valuable keyword location is not explicitly used to further assist keyword classification.

In this paper, we aim to improve both detection and localization accuracy of the RPN based KWS approach by solving the above problems, while maintaining the compact model footprint and computational efficiency.

Multi-scale depthwise temporal convolution (MDTC) feature extractor. We first propose a multi-scale depthwise temporal convolution (MDTC) feature extractor, where stacked depthwise one-dimensional (1-D) convolution with dilated convolution is adopted to efficiently model long-range dependency of speech. With a large receptive field while keeping a small number of model parameters, the structure can model temporal context of speech effectively. To further improve robustness against speech rate variations, we extract multi-scale features from different hidden layers of MDTC with different receptive fields. On Google Speech Command dataset (Warden, 2018), with only 154K model parameters and 15.1M multiplications per second in the inference, the proposed MDTC feature extractor achieves the state-of-the-art (SOTA) command classification performance with an error rate of 1.74%, which surpasses a competitive depthwise separable convolutional ResNet method (Xu & Zhang, 2020) by a large margin.

Two-stage keyword detection and localization. We further improve our previous RPN-based method by proposing a two-stage strategy. The basic idea is inspired by the two-stage based object detection method in computer vision (Dai et al., 2016, Ren et al., 2015, Shrivastava et al., 2016). In detail, RPN is taken as the first-stage to output keyword proposals and the second-stage is adopted to further refine on the proposals to obtain final decision on the keyword detection and localization. To control the ratio of negative vs. positive anchors in the RPN training in the first-stage, we introduce a hard example mining (HEM) algorithm to select effective negative training anchors. With the predicted region proposals of keywords, we explicitly extract keyword related sequential features to subsequently train a second-stage KWS system. The second-stage system learns to classify each region proposal to keyword IDs and transform the region proposal to the ground-truth keyword region. To specifically deal with the class imbalance problem in the second-stage training, the proposed two-stage KWS introduces a way to select effective region proposals, with which the second-stage system can better distinguish positive and hard negative region proposals. Unlike detecting object in static pictures, our proposed method needs to detect keyword in streaming audios. So we propose a streaming non-maximum suppression (sNMS) algorithm to make our two-stage method to achieve streaming detection. We conduct WuW detection experiments on a commercial WuW detection dataset. Confirmed with different feature extractors, the proposed two-stage KWS method significantly outperforms its corresponding one-stage RPN KWS method in both keyword detection and localization. The proposed two-stage method achieves 27–32% relative reduction in false rejection rate (FRR) over a strong max-pooling based method (Hou et al., 2020) at one FA per hour.

We summarize the main contributions of this paper as follows:

  • We propose a novel small-footprint and computation-efficient MDTC feature extractor for KWS, which learns multi-scale feature representation efficiently with dilated depthwise 1-D convolution.

  • We propose a novel two-stage KWS approach where the keyword proposals outputted by the first-stage RPN is subsequently refined by the second-stage which classifies and transforms the proposals to keyword IDs and the ground-truth keyword regions respectively. The proposed method manages to address the class-imbalance problem in both stages and to detect and locate keywords in audio streams accurately.

  • SOTA performances are achieved by the proposed MDTC feature extractor and the two-stage KWS method, confirmed on the Google Speech Command classification benchmark and a commercial WuW detection task, respectively.

The rest of the paper is organized as follows. We review related works in Section 2. In Section 3, we illustrate the overall design of the proposed method. In Section 4 and Section 5, the proposed MDTC feature extractor and two-stage KWS method are introduced in detail. We conduct speech command classification and WuW detection experiments in Section 6 and Section 7, respectively. Finally, we conclude the work in Section 8. Our code will be made publicly available after the peer review.2

Section snippets

Related works

In this section, we will first review related works on NN based feature extractor in KWS. Then, literature related to keyword localization and two-stage based KWS will also be reviewed.

System overview

An overview of our proposed method is illustrated in Fig. 1. It mainly consists of three modules: a multi-scale depthwise temporal convolution (MDTC) feature extractor; a region proposal network (RPN) as the first-stage of KWS system and an extra keyword detection and localization module (keyword checker) as the second-stage of the KWS system. The MDTC feature extractor and the first stage model are first trained and then the parameters are frozen to train the second stage model.

Taking speech

MDTC feature extractor

In this section, we will first introduce depthwise temporal convolution (dilated-depth TCN), an important component of our proposed MDTC feature extractor. Then, the MDTC feature extractor with dilated-depth TCN will be introduced in detail. The model parameters and computational complexity will also be analyzed.

Two-stage keyword detection and localization

The proposed two-stage keyword detection and localization method will be introduced in following three subsections. We will first introduce some prior knowledge used in both first- and second-stage. After that, an RPN based first-stage KWS module will be introduced. Different from our previous work (Hou et al., 2019), here we propose an exponential growth anchor design in Section 5.2.1 and a new training anchor selection strategy in Section 5.2.4. Lastly, we will introduce the newly designed

Speech command classification experiment

To evaluate the feature representation ability of our proposed MDTC feature extractor, speech command classification experiment is conducted on the Google Speech Command benchmark dataset (Warden, 2018). In this section, we will introduce the Google Speech Command dataset first, then implementation details and experimental results.

Wake-up word detection experiment

To evaluate the proposed two-stage keyword detection and localization method, WuW detection experiment is conducted on a commercial WuW corpus. In this section, we will introduce the WuW corpus, the implementation details with different NN feature extractors and KWS methods, and the experimental results.

Conclusions

To summarize, in this paper, aiming at improving both keyword detection and localization performance for small-footprint and computational efficient KWS, we propose a two-stage KWS approach with a novel MDTC feature extractor. The proposed MDTC feature extractor can model long-range context with efficient dilated depthwise temporal convolution and improve robustness against speech rate variations by fusing features of different time scales. In the two-stage KWS, an RPN is taken as the

Declaration of Competing Interest

The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

Acknowledgment

This work is supported by MoE-CMCC “Artifical Intelligence” project, China (MCM20190701)

References (74)

  • Alvarez, R., & Park, H.-J. (2019). End-to-end streaming keyword spotting. In Proc. ICASSP (pp....
  • Arık, S. Ö., Kliegl, M., Child, R., Hestness, J., Gibiansky, A., Fougner, C., et al. (2017). Convolutional Recurrent...
  • BahdanauD. et al.

    Neural machine translation by jointly learning to align and translate

    (2014)
  • BaiS. et al.

    An empirical evaluation of generic convolutional and recurrent networks for sequence modeling

    (2018)
  • Bai, Y., Yi, J., Tao, J., Wen, Z., Tian, Z., Zhao, C., et al. (2019). A Time Delay Neural Network with Shared Weight...
  • Chen, G., Parada, C., & Heigold, G. (2014). Small-footprint keyword spotting using deep neural networks. In Proc....
  • Choi, S., Seo, S., Shin, B., Byun, H., Kersner, M., Kim, B., et al. (2019). Temporal Convolution for Real-Time Keyword...
  • ChungJ. et al.

    Empirical evaluation of gated recurrent neural networks on sequence modeling

    (2014)
  • Coucke, A., Chlieh, M., Gisselbrecht, T., Leroy, D., Poumeyrol, M., & Lavril, T. (2019). Efficient keyword spotting...
  • Dai, J., Li, Y., He, K., & Sun, J. (2016). R-FCN: Object detection via region-based fully convolutional networks. In...
  • de AndradeD.C. et al.

    A neural attention model for speech command recognition

    (2018)
  • Fernández, S., Graves, A., & Schmidhuber, J. (2007). An application of recurrent neural networks to discriminative...
  • Guo, J., Kumatani, K., Sun, M., Wu, M., Raju, A., Ström, N., et al. (2018). Time-delayed bottleneck highway networks...
  • He, Y., Prabhavalkar, R., Rao, K., Li, W., Bakhtin, A., & McGraw, I. (2017). Streaming Small-Footprint Keyword Spotting...
  • He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep Residual Learning for Image Recognition. In Proc. CVPR (pp....
  • Higuchi, T., Ghasemzadeh, M., You, K., & Dhir, C. (2020). Stacked 1D Convolutional Networks for End-to-End Small...
  • HintonG. et al.

    Deep neural networks for acoustic modeling in speech recognition: The shared views of four research groups

    IEEE Signal Processing Magazine

    (2012)
  • HochreiterS. et al.

    Long short-term memory

    Neural Computation

    (1997)
  • HouJ. et al.

    Region proposal network based small-footprint keyword spotting

    IEEE Signal Processing Letters

    (2019)
  • Hou, J., Shi, Y., Ostendorf, M., Hwang, M.-Y., & Xie, L. (2020). Mining effective negative training samples for keyword...
  • HowardA.G. et al.

    MobileNets: Efficient convolutional neural networks for mobile vision applications

    (2017)
  • Ibrahim, E. A., Huisken, J., Fatemi, H., & de Gyvez, J. P. (2019). Keyword spotting using time-domain features in a...
  • Iván, L.-E., Tan, Z.-H., & Jensen, J. (2021). Exploring Filterbank Learning for Keyword Spotting. In Proc. EUSIPCO (pp....
  • Jia, Y., Wang, X., Qin, X., Zhang, Y., Wang, X., Wang, J., et al. (2021). The 2020 Personalized Voice Trigger...
  • Jose, C., Mishchenko, Y., Sénéchal, T., Shah, A., Escott, A., & Vitaladevuni, S. N. P. (2020). Accurate Detection of...
  • Kumatani, K., Panchapagesan, S., Wu, M., Kim, M., Strom, N., Tiwari, G., et al. (2017). Direct modeling of raw audio...
  • LengerichC. et al.

    An end-to-end architecture for keyword spotting and voice activity detection

    (2016)
  • Li, X., Wei, X., & Qin, X. (2020). Small-Footprint Keyword Spotting with Multi-Scale Temporal Convolution. In Proc....
  • Li, J., Zhao, R., Chen, Z., Liu, C., Xiao, X., Ye, G., et al. (2018). Developing far-field speaker system via...
  • Liu, B., Nie, S., Zhang, Y., Liang, S., Yang, Z., & Liu, W. (2019). Loss and Double-edge-triggered Detector for Robust...
  • López-EspejoI. et al.

    Deep spoken keyword spotting: An overview

    (2021)
  • Maas, R., Parthasarathi, S. H. K., King, B., Huang, R., & Hoffmeister, B. (2016). Anchored Speech Detection. In Proc....
  • Maekaku, T., Kida, Y., & Sugiyama, A. (2019). Simultaneous Detection and Localization of a Wake-Up Word Using...
  • Majumdar, S., & Ginsburg, B. (2020). MatchboxNet: 1D Time-Channel Separable Convolutional Neural Network Architecture...
  • Nakkiran, P., Alvarez, R., Prabhavalkar, R., & Parada, C. (2015). Compressing Deep Neural Networks Using a...
  • Neubeck, A., & Van Gool, L. (2006). Efficient Non-Maximum Suppression. In Proc. ICPR (pp....
  • Panchapagesan, S., Sun, M., Khare, A., Matsoukas, S., Mandal, A., Hoffmeister, B., et al. (2016). Multi-task learning...
  • Cited by (6)

    • Flowchart Generation and Mind Map Creation using Extracted Summarized Text

      2023, International Conference on Recent Advances in Science and Engineering Technology, ICRASET 2023
    • Wekws: A Production First Small-Footprint End-to-End Keyword Spotting Toolkit

      2023, ICASSP, IEEE International Conference on Acoustics, Speech and Signal Processing - Proceedings
    View full text