Recent developments in spoken term detection: a survey

Mandal, Anupam; Prasanna Kumar, K. R.; Mitra, Pabitra

doi:10.1007/s10772-013-9217-1

Recent developments in spoken term detection: a survey

Published: 14 December 2013

Volume 17, pages 183–198, (2014)
Cite this article

International Journal of Speech Technology Aims and scope Submit manuscript

Anupam Mandal¹,
K. R. Prasanna Kumar² &
Pabitra Mitra¹

827 Accesses
23 Citations
3 Altmetric
Explore all metrics

Abstract

Spoken term detection (STD) provides an efficient means for content based indexing of speech. However, achieving high detection performance, faster speed, detecting ot-of-vocabulary (OOV) words and performing STD on low resource languages are some of the major research challenges. The paper provides a comprehensive survey of the important approaches in the area of STD and their addressing of the challenges mentioned above. The review provides a classification of these approaches, highlights their advantages and limitations and discusses their context of usage. It also performs an analysis of the various approaches in terms of detection accuracy, storage requirements and execution time. The paper summarizes various tools and speech corpora used in the different approaches. Finally it concludes with future research directions in this area.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Notes

References

Allauzen, C., Mohri, M., & Saraclar, M. (2004). General indexation of weighted automata: application to spoken utterance retrieval. In HLT-NAACL, Boston, USA.
Google Scholar
Avidan, S., & Shamir, A. (2007). Seam carving for content-aware image resizing. ACM Transactions on Graph, 26(3).
Baghai-Ravary, L., Kochanski, G., & Coleman, J. (2009). Data-driven approaches to objective evaluation of phoneme alignment systems. In Proceedings of the 4th conference on human language technology, Poznan, Poland.
Google Scholar
Barnwal, S., Sahni, K., Singh, R., & Raj, B. (2012). Spectrographic seam patterns for discriminative word spotting. In Proc. int. conf. acoustics, speech and signal processing, Kyoto, Japan.
Google Scholar
Benayed, Y. D., Fohr, J. H., & Chollet, G. (2003). Confidence measures for keyword spotting using support vector machines. In Proc. int. conf. acoustics, speech and signal processing, Hong Kong.
Google Scholar
Boves, L., Carlson, R., Hinrichs, E., House, D., Krauwer, S., Lemnitzer, L., Vainio, M., & Wittenburg, P. (2009). Resources for speech research: present and future infrastructure needs. In Proc. int. conf. speech processing, Brighton, UK.
Google Scholar
Bridle, J. (1973). An efficient elastic template method for detecting given key words in running speech. In Proc. of British acoustic society meeting, UK.
Google Scholar
Can, D. (2011). Lattice indexing for spoken term detection. IEEE Transactions on Audio, Speech, and Language Processing, 19(8), 2338–2347.
Article MathSciNet Google Scholar
Can, P., Cooper, E., Sethy, A., White, C., Ramabhadran, B., & Saraclar, M. (2009). Effect of pronunciations on oov queries in spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.
Google Scholar
Chan, C., & Lee, L. (2010). Unsupervised spoken-term detection with spoken queries using segment-based dynamic time warping. In Proc. int. conf. speech processing, Chiba, Japan.
Google Scholar
Chan, C., & Lee, L. (2011). Integrating frame-based and segment-based dynamic time warping for unsupervised spoken term detection with spoken queries. In Proc. int. conf. acoustics, speech and signal processing, Prague.
Google Scholar
Chelba, C., & Acero, A. (2005). Position specific posterior lattices for indexing speech. In Annual conference of the association of computational linguistics, Ann Arbor, USA.
Google Scholar
Deligne, S., & Bimbot, F. (1995). Language modeling by variable length sequences. In Proc. int. conf. acoustics, speech and signal processing, Michigan, USA.
Google Scholar
Ezzat, T., & Poggio, T. (2008). Discriminative word spotting using ordered spectro-temporal patch features. In ISCA workshop statistical and perceptual audition, Brisbane, Australia.
Google Scholar
Fousek, P., & Hermansky, H. (2006). Towards ASR based on hierarchical posterior-based keyword recognition. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.
Google Scholar
Garcia, A., & Gish, H. (2006). Keyword spotting of arbitrary words using minimal speech resources. In Proc. int. conf. acoustics, speech and signal processing, Toulouse, France.
Google Scholar
Garofolo, J., Auzzane, G., & Voorhees, E. (2000). The trec spoken document retrieval track: a success story. In Ninth text retrieval conference (TREC-9) NIST.
Google Scholar
Grangier, D., Keshet, J., & Bengio, S. (2009). Chapter on discriminative keyword spotting. In Automatic speech and speaker recognition: large margin and kernel methods. New York: Wiley.
Google Scholar
Hakkani-Tur, D., & Riccardi, G. (2003). A general algorithm for word graph matrix decomposition. In Proc. int. conf. acoustics, speech and signal processing, Hong-Kong.
Google Scholar
Hazen, T., Shen, W., & White, C. (2009). Query-by-example spoken term detection using phonetic posteriorgram templates. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.
Google Scholar
Huijbregts, M., McLaren, M., & Leeuwen, D. V. (2011). Unsupervised acoustic sub-word unit detection for query-by-example spoken term detection. In Proc. int. conf. acoustics, speech and signal processing, Prague.
Google Scholar
James, D., & Young, S. (1994). A fast lattice-based approach to vocabulary independent wordspotting. In Proc. int. conf. acoustics, speech and signal processing, Adelaide, Australia.
Google Scholar
Jansen, A., & Niyogi, P. (2009). Point process models for spotting keywords in continuous speech. IEEE Transactions on Audio, Speech, and Language Processing, 17(8), 1457–1470.
Article Google Scholar
Jansen, A., Church, K., & Hermansky, H. (2010). Towards spoken term discovery at scale with zero resources. In Proc. int. conf. speech processing, Chiba, Japan.
Google Scholar
Keshet, J., Grangier, D., & Bengio, S. (2007). Discriminative keyword spotting. In Proc. of workshop on non-linear speech processing, Paris, France.
Google Scholar
Kintzley, K., Jansen, A., & Hermansky, H. (2011). Event selection from phone posteriorgrams using matched filters. In Proc. int. conf. speech processing, Florence, Italy.
Google Scholar
Lehtonen, M., Fousek, P., & Hermansky, H. (2005). IDIAP research report: hierarchical approach for spotting keywords.
Mamou, J., Ramabhadran, B., & Siohan, O. (2007). Vocabulary independent spoken term detection. In Proc. ACM special interest group on information retrieval, New York, USA.
Google Scholar
Mangu, L., Brill, E., & Stolcke, A. (2000). Finding consensus in speech recognition: word error minimization and other applications of confusion networks. Computer Speech & Language, 14(4), 373–400.
Article Google Scholar
Meyers, C., Rabiner, L., & Rosenberg, A. (1980). Performance tradeoffs in dynamic time warping algorithms for isolated word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(6), 623–635.
Article Google Scholar
Mohri, M., Pereira, F., Pereira, O., & Reiley, M. (1996). Weighted automata in text and speech processing. In ECAI workshop.
Google Scholar
Ng, K., & Zue, V. (2000). Subwordbased approaches for spoken document retrieval. Speech Communication, 32(3), 157–186.
Article Google Scholar
Novotney, S., Schwartz, R., & Ma, J. (2009). Unsupervised acoustic and language model training with small amounts of labelled data. In Proc. int. conf. acoustics, speech and signal processing, Taipei, Taiwan.
Google Scholar
Pan, Y. C., & shan Lee, L. (2010). Performance analysis for lattice-based speech indexing approaches using words and subword units. IEEE Transactions on Audio, Speech, and Language Processing, 18(6), 1562–1574.
Article Google Scholar
Parada, C., Sethi, A., & Ramabhadran, B. (2009). Query-by-example spoken term detection for oov terms. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.
Google Scholar
Park, A. S., & Glass, J. (2008). Unsupervised pattern discovery in speech. IEEE Transactions on Audio, Speech, and Language Processing, 16(1), 186–197.
Article Google Scholar
Rohlicek, J. R. (1995). Chapter on word spotting. In Modern methods of speech processing, Norwell: Kluwer Academic.
Google Scholar
Rose, R. C. (1996). Word spotting from continuous speech utterances. In Automatic speech and speaker recognition: advanced topics, Norwell: Kluwer Academic.
Google Scholar
Rose, R. C., & Paul, D. B. (1990). A hidden Markov model based keyword recognition system. In Proc. int. conf. acoustics, speech and signal processing, Albuquerque, USA.
Google Scholar
Sakoe, H., & Chiba, S. (1978). Dynamic programming algorithm optimization for spoken word recognition. IEEE Transactions on Acoustics, Speech, and Signal Processing, 26(1), 43–49.
Article MATH Google Scholar
Sandness, E., & Hetherington, I. (2000). Keyword-based discriminative training of acoustic models. In Proc. int. conf. speech and language processing, Beijing, China.
Google Scholar
Saraclar, M., & Sproat, R. W. (2004). Lattice based search for spoken utterance retrieval. In HLT-NAACL, Boston, USA.
Google Scholar
Shen, W., White, C., & Hazen, T. (2009). A comparison of query-by-example methods for spoken term detection. In Proc. int. conf. speech processing, Brighton, UK.
Google Scholar
Silaghi, M., & Bourlard, H. (1999). Iterative posterior-based keyword spotting without filler models. In Proc. IEEE workshop on automatic speech recognition and understanding, Colorado, USA.
Google Scholar
Sukkar, R., Seltur, A., Rahim, M. G., & Lee, C. H. (1996). Utterance verification of keyword strings using word-based minimum verification error training. In Proc. int. conf. acoustics, speech and signal processing, Atlanta, USA.
Google Scholar
Szoke, I., Schwarz, P., Patejka, P., Burget, L., Karafiat, M., Fapso, M., & Cernocky, J. (2005). Comparison of keyword spotting approaches for informal continuous speech. In Eurospeech, Lisbon, Portugal.
Google Scholar
Szoke, I., Burget, L., Cernocky, J., & Fapso, M. (2008). Sub-word modeling of out-of-vocabulary words in spoken term detection. In Spoken language technology workshop, Goa, India.
Google Scholar
Tejedor, J., Szoke, I., & Fapso, M. (2010). Novel methods for query selection and combination in query-by-example spoken term detection. In ACM workshop on searching spontaneous conversational speech, Firenze, Italy.
Google Scholar
Thambiratnam, K., & Sridharan, S. (2005). Dynamic match phone-lattice searches for very fast and accurate unrestricted vocabulary keyword spotting. In Proc. int. conf. acoustics, speech and signal processing, Philadelphia, USA.
Google Scholar
Vergyri, D., Shafran, I., Stocke, A., Gadde, R., Akbacak, M., Roark, B., & Wang, W. (2007). The sri/ogi 2006 spoken term detection system. In Proc. int. conf. speech processing, Antwerp, Belgium.
Google Scholar
Wang, H., Lee, T., & Leung, C. (2011). Unsupervised spoken term detection with acoustic segment model. In Int. conf. speech database and assessments, China.
Google Scholar
Weintraub, M., Beaufays, F., Rivlin, Z., Konig, Y., & Stolcke, A. (1997). Neuralnetwork based measures of confidence for word recognition. In Proc. int. conf. acoustics, speech and signal processing, Munich, Germany.
Google Scholar
Wright, C., Ballar, L., Coull, S., Monrose, F., & Masson, G. (2010). Uncovering spoken phrases in encrypted voice over IP conversations. ACM Transactions on Information and System Security, 13(4), 35.1–35.30.
Article Google Scholar
Zhang, Y., & Glass, J. (2009). Unsupervised spoken keyword spotting via segmental dtw on Gaussian posteriorgrams. In Proc. IEEE workshop on automatic speech recognition and understanding, Merano, Italy.
Google Scholar
Zhang, Y., & Glass, J. (2011). An inner-product lower-bound estimate for dynamic time warping. In Proc. int. conf. acoustics, speech and signal processing, Prague.
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science and Engineering, Indian Institute of Technology, Kharagpur, India
Anupam Mandal & Pabitra Mitra
Center for AI & Robotics, Bangalore, India
K. R. Prasanna Kumar

Authors

Anupam Mandal
View author publications
You can also search for this author in PubMed Google Scholar
K. R. Prasanna Kumar
View author publications
You can also search for this author in PubMed Google Scholar
Pabitra Mitra
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anupam Mandal.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Mandal, A., Prasanna Kumar, K.R. & Mitra, P. Recent developments in spoken term detection: a survey. Int J Speech Technol 17, 183–198 (2014). https://doi.org/10.1007/s10772-013-9217-1

Download citation

Received: 18 July 2013
Accepted: 21 November 2013
Published: 14 December 2013
Issue Date: June 2014
DOI: https://doi.org/10.1007/s10772-013-9217-1

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Recent developments in spoken term detection: a survey

Abstract

Access this article

Similar content being viewed by others

Multilingual spoken term detection: a review

Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

Phonetic Spoken Term Detection in Large Audio Archive Using the WFST Framework

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Recent developments in spoken term detection: a survey

Abstract

Access this article

Similar content being viewed by others

Multilingual spoken term detection: a review

Spoken term detection ALBAYZIN 2014 evaluation: overview, systems, results, and discussion

Phonetic Spoken Term Detection in Large Audio Archive Using the WFST Framework

Notes

References

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation