Abstract
Spoken term detection (STD) provides an efficient means for content based indexing of speech. However, achieving high detection performance, faster speed, detecting ot-of-vocabulary (OOV) words and performing STD on low resource languages are some of the major research challenges. The paper provides a comprehensive survey of the important approaches in the area of STD and their addressing of the challenges mentioned above. The review provides a classification of these approaches, highlights their advantages and limitations and discusses their context of usage. It also performs an analysis of the various approaches in terms of detection accuracy, storage requirements and execution time. The paper summarizes various tools and speech corpora used in the different approaches. Finally it concludes with future research directions in this area.
Similar content being viewed by others
References
Allauzen, C., Mohri, M., & Saraclar, M. (2004). General indexation of weighted automata: application to spoken utterance retrieval. In HLT-NAACL, Boston, USA.
Avidan, S., & Shamir, A. (2007). Seam carving for content-aware image resizing. ACM Transactions on Graph, 26(3).
Baghai-Ravary, L., Kochanski, G., & Coleman, J. (2009). Data-driven approaches to objective evaluation of phoneme alignment systems. In Proceedings of the 4th conference on human language technology, Poznan, Poland.
Barnwal, S., Sahni, K., Singh, R., & Raj, B. (2012). Spectrographic seam patterns for discriminative word spotting. In Proc. int. conf. acoustics, speech and signal processing, Kyoto, Japan.
Benayed, Y. D., Fohr, J. H., & Chollet, G. (2003). Confidence measures for keyword spotting using support vector machines. In Proc. int. conf. acoustics, speech and signal processing, Hong Kong.
Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S., Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources for speech research: present and future infrastructure needs. In Proc. int. conf. speech processing, Brighton, UK.
Bridle, J. (1973). An efficient elastic template method for detecting given key words in running speech. In Proc. of British acoustic society meeting, UK.
Can, D. (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347.
Can, P., Cooper, E., Sethy, A., White, C., Ramabhadran, B., & Saraclar, M. (2009). Effect of pronunciations on oov queries in spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.
Chan, C., & Lee, L. (2010). Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proc. int. conf. speech processing, Chiba, Japan.
Chan, C., & Lee, L. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In Proc. int. conf. acoustics, speech and signal processing, Prague.
Chelba, C., & Acero, A. (2005). Position specific posterior lattices for indexing speech. In Annual conference of the association of computational linguistics, Ann Arbor, USA.
Deligne, S., & Bimbot, F. (1995). Language modeling by variable length sequences. In Proc. int. conf. acoustics, speech and signal processing, Michigan, USA.
Ezzat, T., & Poggio, T. (2008). Discriminative word spotting using ordered spectro-temporal patch features. In ISCA workshop statistical and perceptual audition, Brisbane, Australia.
Fousek, P., & Hermansky, H. (2006). Towards ASR based on hierarchical posterior-based keyword recognition. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.
Garcia, A., & Gish, H. (2006). Keyword spotting of arbitrary words using minimal speech resources. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.
Garofolo, J., Auzzane, G., & Voorhees, E. (2000). The trec spoken document retrieval track: a success story. In Ninth text retrieval conference (TREC-9) NIST.
Grangier, D., Keshet, J., & Bengio, S. (2009). Chapter on discriminative keyword spotting. In Automatic speech and speaker recognition: large margin and kernel methods. New York: Wiley.
Hakkani-Tur, D., & Riccardi, G. (2003). A general algorithm for word graph matrix decomposition. In Proc. int. conf. acoustics, speech and signal processing, Hong-Kong.
Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.
Huijbregts, M., McLaren, M., & Leeuwen, D. V. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Prague.
James, D., & Young, S. (1994). A fast lattice-based approach to vocabulary independent wordspotting. In Proc. int. conf. acoustics, speech and signal processing, Adelaide, Australia.
Jansen, A., & Niyogi, P. (2009). Point process models for spotting keywords in continuous speech. IEEE Transactions on Audio, Speech, and Language Processing, 17(8), 1457–1470.
Jansen, A., Church, K., & Hermansky, H. (2010). Towards spoken term discovery at scale with zero resources. In Proc. int. conf. speech processing, Chiba, Japan.
Keshet, J., Grangier, D., & Bengio, S. (2007). Discriminative keyword spotting. In Proc. of workshop on non-linear speech processing, Paris, France.
Kintzley, K., Jansen, A., & Hermansky, H. (2011). Event selection from phone posteriorgrams using matched filters. In Proc. int. conf. speech processing, Florence, Italy.
Lehtonen, M., Fousek, P., & Hermansky, H. (2005). IDIAP research report: hierarchical approach for spotting keywords.
Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary independent spoken term detection. In Proc. ACM special interest group on information retrieval, New York, USA.
Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language, 14(4), 373–400.
Meyers, C., Rabiner, L., & Rosenberg, A. (1980). Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6), 623–635.
Mohri, M., Pereira, F., Pereira, O., & Reiley, M. (1996). Weighted automata in text and speech processing. In ECAI workshop.
Ng, K., & Zue, V. (2000). Subwordbased approaches for spoken document retrieval. Speech Communication, 32(3), 157–186.
Novotney, S., Schwartz, R., & Ma, J. (2009). Unsupervised acoustic and language model training with small amounts of labelled data. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.
Pan, Y. C., & shan Lee, L. (2010). Performance analysis for lattice-based speech indexing approaches using words and subword units. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1562–1574.
Parada, C., Sethi, A., & Ramabhadran, B. (2009). Query-by-example spoken term detection for oov terms. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.
Park, A. S., & Glass, J. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 186–197.
Rohlicek, J. R. (1995). Chapter on word spotting. In Modern methods of speech processing, Norwell: Kluwer Academic.
Rose, R. C. (1996). Word spotting from continuous speech utterances. In Automatic speech and speaker recognition: advanced topics, Norwell: Kluwer Academic.
Rose, R. C., & Paul, D. B. (1990). A hidden Markov model based keyword recognition system. In Proc. int. conf. acoustics, speech and signal processing, Albuquerque, USA.
Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1), 43–49.
Sandness, E., & Hetherington, I. (2000). Keyword-based discriminative training of acoustic models. In Proc. int. conf. speech and language processing, Beijing, China.
Saraclar, M., & Sproat, R. W. (2004). Lattice based search for spoken utterance retrieval. In HLT-NAACL, Boston, USA.
Shen, W., White, C., & Hazen, T. (2009). A comparison of query-by-example methods for spoken term detection. In Proc. int. conf. speech processing, Brighton, UK.
Silaghi, M., & Bourlard, H. (1999). Iterative posterior-based keyword spotting without filler models. In Proc. IEEE workshop on automatic speech recognition and understanding, Colorado, USA.
Sukkar, R., Seltur, A., Rahim, M. G., & Lee, C. H. (1996). Utterance verification of keyword strings using word-based minimum verification error training. In Proc. int. conf. acoustics, speech and signal processing, Atlanta, USA.
Szoke, I., Schwarz, P., Patejka, P., Burget, L., Karafiat, M., Fapso, M., & Cernocky, J. (2005). Comparison of keyword spotting approaches for informal continuous speech. In Eurospeech, Lisbon, Portugal.
Szoke, I., Burget, L., Cernocky, J., & Fapso, M. (2008). Sub-word modeling of out-of-vocabulary words in spoken term detection. In Spoken language technology workshop, Goa, India.
Tejedor, J., Szoke, I., & Fapso, M. (2010). Novel methods for query selection and combination in query-by-example spoken term detection. In ACM workshop on searching spontaneous conversational speech, Firenze, Italy.
Thambiratnam, K., & Sridharan, S. (2005). Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting. In Proc. int. conf. acoustics, speech and signal processing, Philadelphia, USA.
Vergyri, D., Shafran, I., Stocke, A., Gadde, R., Akbacak, M., Roark, B., & Wang, W. (2007). The sri/ogi 2006 spoken term detection system. In Proc. int. conf. speech processing, Antwerp, Belgium.
Wang, H., Lee, T., & Leung, C. (2011). Unsupervised spoken term detection with acoustic segment model. In Int. conf. speech database and assessments, China.
Weintraub, M., Beaufays, F., Rivlin, Z., Konig, Y., & Stolcke, A. (1997). Neuralnetwork based measures of confidence for word recognition. In Proc. int. conf. acoustics, speech and signal processing, Munich, Germany.
Wright, C., Ballar, L., Coull, S., Monrose, F., & Masson, G. (2010). Uncovering spoken phrases in encrypted voice over IP conversations. ACM Transactions on Information and System Security, 13(4), 35.1–35.30.
Zhang, Y., & Glass, J. (2009). Unsupervised spoken keyword spotting via segmental dtw on Gaussian posteriorgrams. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.
Zhang, Y., & Glass, J. (2011). An inner-product lower-bound estimate for dynamic time warping. In Proc. int. conf. acoustics, speech and signal processing, Prague.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Mandal, A., Prasanna Kumar, K.R. & Mitra, P. Recent developments in spoken term detection: a survey. Int J Speech Technol 17, 183–198 (2014). https://doi.org/10.1007/s10772-013-9217-1
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10772-013-9217-1