Skip to main content
Log in

Recent developments in spoken term detection: a survey

  • Published:
International Journal of Speech Technology Aims and scope Submit manuscript

Abstract

Spoken term detection (STD) provides an efficient means for content based indexing of speech. However, achieving high detection performance, faster speed, detecting ot-of-vocabulary (OOV) words and performing STD on low resource languages are some of the major research challenges. The paper provides a comprehensive survey of the important approaches in the area of STD and their addressing of the challenges mentioned above. The review provides a classification of these approaches, highlights their advantages and limitations and discusses their context of usage. It also performs an analysis of the various approaches in terms of detection accuracy, storage requirements and execution time. The paper summarizes various tools and speech corpora used in the different approaches. Finally it concludes with future research directions in this area.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4

Similar content being viewed by others

Notes

  1. http://www.itl.nist.gov/iad/mig/tests/std/2006/.

  2. https://www.ldc.upenn.edu/.

  3. htk.eng.cam.ac.uk/.

  4. http://speech.fit.vutbr.cz/software/phoneme-recognizer-based-long-temporal-context.

  5. http://speech.fit.vutbr.cz/software/hmm-toolkit-stk.

  6. www.openfst.org.

  7. http://www.multimediaeval.org/.

References

  • Allauzen, C., Mohri, M., & Saraclar, M. (2004). General indexation of weighted automata: application to spoken utterance retrieval. In HLT-NAACL, Boston, USA.

    Google Scholar 

  • Avidan, S., & Shamir, A. (2007). Seam carving for content-aware image resizing. ACM Transactions on Graph, 26(3).

  • Baghai-Ravary, L., Kochanski, G., & Coleman, J. (2009). Data-driven approaches to objective evaluation of phoneme alignment systems. In Proceedings of the 4th conference on human language technology, Poznan, Poland.

    Google Scholar 

  • Barnwal, S., Sahni, K., Singh, R., & Raj, B. (2012). Spectrographic seam patterns for discriminative word spotting. In Proc. int. conf. acoustics, speech and signal processing, Kyoto, Japan.

    Google Scholar 

  • Benayed, Y. D., Fohr, J. H., & Chollet, G. (2003). Confidence measures for keyword spotting using support vector machines. In Proc. int. conf. acoustics, speech and signal processing, Hong Kong.

    Google Scholar 

  • Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S., Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources for speech research: present and future infrastructure needs. In Proc. int. conf. speech processing, Brighton, UK.

    Google Scholar 

  • Bridle, J. (1973). An efficient elastic template method for detecting given key words in running speech. In Proc. of British acoustic society meeting, UK.

    Google Scholar 

  • Can, D. (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347.

    Article  MathSciNet  Google Scholar 

  • Can, P., Cooper, E., Sethy, A., White, C., Ramabhadran, B., & Saraclar, M. (2009). Effect of pronunciations on oov queries in spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.

    Google Scholar 

  • Chan, C., & Lee, L. (2010). Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proc. int. conf. speech processing, Chiba, Japan.

    Google Scholar 

  • Chan, C., & Lee, L. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In Proc. int. conf. acoustics, speech and signal processing, Prague.

    Google Scholar 

  • Chelba, C., & Acero, A. (2005). Position specific posterior lattices for indexing speech. In Annual conference of the association of computational linguistics, Ann Arbor, USA.

    Google Scholar 

  • Deligne, S., & Bimbot, F. (1995). Language modeling by variable length sequences. In Proc. int. conf. acoustics, speech and signal processing, Michigan, USA.

    Google Scholar 

  • Ezzat, T., & Poggio, T. (2008). Discriminative word spotting using ordered spectro-temporal patch features. In ISCA workshop statistical and perceptual audition, Brisbane, Australia.

    Google Scholar 

  • Fousek, P., & Hermansky, H. (2006). Towards ASR based on hierarchical posterior-based keyword recognition. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.

    Google Scholar 

  • Garcia, A., & Gish, H. (2006). Keyword spotting of arbitrary words using minimal speech resources. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.

    Google Scholar 

  • Garofolo, J., Auzzane, G., & Voorhees, E. (2000). The trec spoken document retrieval track: a success story. In Ninth text retrieval conference (TREC-9) NIST.

    Google Scholar 

  • Grangier, D., Keshet, J., & Bengio, S. (2009). Chapter on discriminative keyword spotting. In Automatic speech and speaker recognition: large margin and kernel methods. New York: Wiley.

    Google Scholar 

  • Hakkani-Tur, D., & Riccardi, G. (2003). A general algorithm for word graph matrix decomposition. In Proc. int. conf. acoustics, speech and signal processing, Hong-Kong.

    Google Scholar 

  • Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.

    Google Scholar 

  • Huijbregts, M., McLaren, M., & Leeuwen, D. V. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Prague.

    Google Scholar 

  • James, D., & Young, S. (1994). A fast lattice-based approach to vocabulary independent wordspotting. In Proc. int. conf. acoustics, speech and signal processing, Adelaide, Australia.

    Google Scholar 

  • Jansen, A., & Niyogi, P. (2009). Point process models for spotting keywords in continuous speech. IEEE Transactions on Audio, Speech, and Language Processing, 17(8), 1457–1470.

    Article  Google Scholar 

  • Jansen, A., Church, K., & Hermansky, H. (2010). Towards spoken term discovery at scale with zero resources. In Proc. int. conf. speech processing, Chiba, Japan.

    Google Scholar 

  • Keshet, J., Grangier, D., & Bengio, S. (2007). Discriminative keyword spotting. In Proc. of workshop on non-linear speech processing, Paris, France.

    Google Scholar 

  • Kintzley, K., Jansen, A., & Hermansky, H. (2011). Event selection from phone posteriorgrams using matched filters. In Proc. int. conf. speech processing, Florence, Italy.

    Google Scholar 

  • Lehtonen, M., Fousek, P., & Hermansky, H. (2005). IDIAP research report: hierarchical approach for spotting keywords.

  • Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary independent spoken term detection. In Proc. ACM special interest group on information retrieval, New York, USA.

    Google Scholar 

  • Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language, 14(4), 373–400.

    Article  Google Scholar 

  • Meyers, C., Rabiner, L., & Rosenberg, A. (1980). Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6), 623–635.

    Article  Google Scholar 

  • Mohri, M., Pereira, F., Pereira, O., & Reiley, M. (1996). Weighted automata in text and speech processing. In ECAI workshop.

    Google Scholar 

  • Ng, K., & Zue, V. (2000). Subwordbased approaches for spoken document retrieval. Speech Communication, 32(3), 157–186.

    Article  Google Scholar 

  • Novotney, S., Schwartz, R., & Ma, J. (2009). Unsupervised acoustic and language model training with small amounts of labelled data. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.

    Google Scholar 

  • Pan, Y. C., & shan Lee, L. (2010). Performance analysis for lattice-based speech indexing approaches using words and subword units. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1562–1574.

    Article  Google Scholar 

  • Parada, C., Sethi, A., & Ramabhadran, B. (2009). Query-by-example spoken term detection for oov terms. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.

    Google Scholar 

  • Park, A. S., & Glass, J. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 186–197.

    Article  Google Scholar 

  • Rohlicek, J. R. (1995). Chapter on word spotting. In Modern methods of speech processing, Norwell: Kluwer Academic.

    Google Scholar 

  • Rose, R. C. (1996). Word spotting from continuous speech utterances. In Automatic speech and speaker recognition: advanced topics, Norwell: Kluwer Academic.

    Google Scholar 

  • Rose, R. C., & Paul, D. B. (1990). A hidden Markov model based keyword recognition system. In Proc. int. conf. acoustics, speech and signal processing, Albuquerque, USA.

    Google Scholar 

  • Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1), 43–49.

    Article  MATH  Google Scholar 

  • Sandness, E., & Hetherington, I. (2000). Keyword-based discriminative training of acoustic models. In Proc. int. conf. speech and language processing, Beijing, China.

    Google Scholar 

  • Saraclar, M., & Sproat, R. W. (2004). Lattice based search for spoken utterance retrieval. In HLT-NAACL, Boston, USA.

    Google Scholar 

  • Shen, W., White, C., & Hazen, T. (2009). A comparison of query-by-example methods for spoken term detection. In Proc. int. conf. speech processing, Brighton, UK.

    Google Scholar 

  • Silaghi, M., & Bourlard, H. (1999). Iterative posterior-based keyword spotting without filler models. In Proc. IEEE workshop on automatic speech recognition and understanding, Colorado, USA.

    Google Scholar 

  • Sukkar, R., Seltur, A., Rahim, M. G., & Lee, C. H. (1996). Utterance verification of keyword strings using word-based minimum verification error training. In Proc. int. conf. acoustics, speech and signal processing, Atlanta, USA.

    Google Scholar 

  • Szoke, I., Schwarz, P., Patejka, P., Burget, L., Karafiat, M., Fapso, M., & Cernocky, J. (2005). Comparison of keyword spotting approaches for informal continuous speech. In Eurospeech, Lisbon, Portugal.

    Google Scholar 

  • Szoke, I., Burget, L., Cernocky, J., & Fapso, M. (2008). Sub-word modeling of out-of-vocabulary words in spoken term detection. In Spoken language technology workshop, Goa, India.

    Google Scholar 

  • Tejedor, J., Szoke, I., & Fapso, M. (2010). Novel methods for query selection and combination in query-by-example spoken term detection. In ACM workshop on searching spontaneous conversational speech, Firenze, Italy.

    Google Scholar 

  • Thambiratnam, K., & Sridharan, S. (2005). Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting. In Proc. int. conf. acoustics, speech and signal processing, Philadelphia, USA.

    Google Scholar 

  • Vergyri, D., Shafran, I., Stocke, A., Gadde, R., Akbacak, M., Roark, B., & Wang, W. (2007). The sri/ogi 2006 spoken term detection system. In Proc. int. conf. speech processing, Antwerp, Belgium.

    Google Scholar 

  • Wang, H., Lee, T., & Leung, C. (2011). Unsupervised spoken term detection with acoustic segment model. In Int. conf. speech database and assessments, China.

    Google Scholar 

  • Weintraub, M., Beaufays, F., Rivlin, Z., Konig, Y., & Stolcke, A. (1997). Neuralnetwork based measures of confidence for word recognition. In Proc. int. conf. acoustics, speech and signal processing, Munich, Germany.

    Google Scholar 

  • Wright, C., Ballar, L., Coull, S., Monrose, F., & Masson, G. (2010). Uncovering spoken phrases in encrypted voice over IP conversations. ACM Transactions on Information and System Security, 13(4), 35.1–35.30.

    Article  Google Scholar 

  • Zhang, Y., & Glass, J. (2009). Unsupervised spoken keyword spotting via segmental dtw on Gaussian posteriorgrams. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.

    Google Scholar 

  • Zhang, Y., & Glass, J. (2011). An inner-product lower-bound estimate for dynamic time warping. In Proc. int. conf. acoustics, speech and signal processing, Prague.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anupam Mandal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mandal, A., Prasanna Kumar, K.R. & Mitra, P. Recent developments in spoken term detection: a survey. Int J Speech Technol 17, 183–198 (2014). https://doi.org/10.1007/s10772-013-9217-1

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10772-013-9217-1

Keywords

Navigation