Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

Rebai, Ilyes; Ayed, Yassine Ben; Mahdi, Walid

doi:10.1007/s11042-018-6276-y

Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

Published: 25 June 2018

Volume 78, pages 1495–1510, (2019)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

322 Accesses
2 Citations
Explore all metrics

Abstract

Keyword search for spoken documents has become more and more important nowadays due to the increasing amount of spoken data. The typical system makes use of an Automatic Speech Recognition system (ASR) and information retrieval methods. While a number of studies have been done to get the optimal system performance, KeyWord Search (KWS) systems still suffer from two main drawbacks. First, the system performance depends strongly on the ASR transcripts which are inherently inexact. Due to the speech signal variabilities, ASR systems are far from being powerful. Second, KWS systems make detection decisions based on the lattice-based posterior probability which is incomparable across keywords. In addition, posterior probabilities of true detection usually fall into different ranges which decrease the spotting performance. This paper considers the problems of ASR transcriptions and keyword detection decision based on posterior probabilities. More specifically, we propose to enhance the ASR transcripts accuracy by introducing a new ASR architecture in which we integrate data augmentation and ensemble learning techniques into a single framework. In addition, we proposed a novel keyword rescoring method that provides scores from a new perspective. Precisely, inspired by template-based KWS approach, scores of similarity between the detected keywords are computed by computing the distance between the acoustic features and are used as new scores for decision. Experiments on French and English datasets show that the proposed KWS system potentially leads to more accurate keyword results than the conventional systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Automatic speech recognition: a survey

Article 10 November 2020

A comprehensive survey on automatic speech recognition using neural networks

Article 15 August 2023

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Notes

β = C/V (Pr^− 1 − 1) where C = 0.1 is the cost of a false detection, V = 1 is the value of a correct detection, and Pr = 10^− 4 is the prior probability of a keyword.
LibriVox is a group of volunteers who read and record public domain texts creating of approximately 25000 public domain audiobooks for download. Link: https://librivox.org
https://www.nist.gov/sites/default/files/documents/itl/iad/mig/OpenKWS13-EvalPlan.pdf
It is the NIST scoring tool which contains a set of evaluation tools for detection evaluations including keyword spotting task. https://github.com/usnistgov/F4DE

References

Abdullah A, Veltkamp R, Wiering M (2009) An ensemble of deep support vector machines for image categorization. In: International Conference of soft computing and pattern recognition (SOCPAR), pp 301–306
Allauzen C, Mohri M, Saraclar M (2004) General indexation of weighted automata: application to spoken utterance retrieval. In: Proceedings of the workshop on interdisciplinary approaches to speech indexing and retrieval at HLT-NAACL 2004. Association for Computational Linguistics, pp 33–40
Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell, 29(1)
Can D, Saraclar M (2011) Lattice indexing for spoken term detection. IEEE Trans Audio Speech Language Process 19(8):2338–2347
Article Google Scholar
Ceamanos X, Waske B, Benediktsson JA, Chanussot J, Fauvel M, Sveinsson JR (2010) A classifier ensemble based on fusion of support vector machines for classifying hyperspectral data. Int J Image Data Fusion 1(4):293–307
Article Google Scholar
Chen G, Parada C, Heigold G (2014) Small-footprint keyword spotting using deep neural networks. In: International Conference on, acoustics, speech, and signal processing (ICASSP), pp 4087–4091
Chen G, Parada C, Sainath TN (2015) Query-by-example keyword spotting using long short-term memory networks. In: International Conference on, acoustics, speech, and signal processing (ICASSP), pp 5236–5240
Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech, Lang Process 20(1):30–42
Article Google Scholar
Deng L, Yu D, Platt J (2012) Scalable stacking and learning for building deep architectures. In: IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2133–2136
Deng L, Li J, Huang JT, Yao K, Yu D, Seide F, Seltzer M, Zweig G, He X, Williams J et al (2013) Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 8604–8608
Fiscus JG, Ajot J, Garofolo JS, Doddingtion G (2007) Results of the 2006 spoken term detection evaluation. In: Proc. sigir, vol 7, pp 51–57
Graves A, Jaitly N, Mohamed Ar (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 273–278
Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE, pp 6645–6649
Jaitly N, Hinton GE (2013) Vocal tract length perturbation (vtlp) improves speech recognition. In: ICML Workshop on deep learning for audio, speech and language
Jaitly N, Nguyen P, Senior AW, Vanhoucke V (2012) Application of pretrained deep neural networks to large vocabulary speech recognition. In: Interspeech, pp 2578–2581
Karakos D, Schwartz R, Tsakalidis S, Zhang L, Ranjan S, Ng TT, Hsiao R, Saikumar G, Bulyko I, Nguyen L et al (2013) Score normalization and system combination for improved keyword spotting. In: 2013 IEEE Workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 210–215
Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: INTERSPEECH, pp 3586–3589
Mamou J, Ramabhadran B, Siohan O (2007) Vocabulary independent spoken term detection. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp 615–622
Mamou J, Cui J, Cui X, Gales MJ, Kingsbury B, Knill K, Mangu L, Nolden D, Picheny M, Ramabhadran B et al (2013) System combination and score normalization for spoken term detection. In: 2013 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 8272–8276
Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The det curve in assessment of detection task performance. Tech. rep. National Inst of Standards and Technology Gaithersburg MD
Miller DR, Kleber M, Kao CL, Kimball O, Colthurst T, Lowe SA, Schwartz RM, Gish H (2007) Rapid and accurate spoken term detection. In: Eighth Annual Conference of the international speech communication association
Mohamed AR, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22
Article Google Scholar
Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5206–5210
Ragni A, Knill KM, Rath SP, Gales MJ (2014) Data augmentation for low resource languages. In: INTERSPEECH, pp 810–814
Rebai I, BenAyed Y, Mahdi W, Lorré JP (2017) Improving speech recognition using data augmentation and acoustic model fusion. Procedia Comput Sci 112:316–322
Article Google Scholar
Saraclar M, Sproat R (2004) Lattice-based search for spoken utterance retrieval. Urbana 51(61):801
Google Scholar
Siohan O, Bacchiani M (2005) Fast vocabulary-independent audio search using path-based graph indexing. In: Ninth European Conference on speech communication and technology
Szöke I, Fapso M, Karafiát M, Burget L, Grézl F, Schwarz P, Glembek O, Matejka P, Kontár S, Cernockỳ J (2006) But system for nist std 2006-english. In: NIST Spoken Term detection evaluation workshop
Szöke I, Burget L, Cernocky J, Fapso M (2008) Sub-word modeling of out of vocabulary words in spoken term detection. In: Spoken Language technology workshop, 2008. SLT 2008. IEEE, pp 273–276
Wang Y, Metze F (2014) An in-depth comparison of keyword specific thresholding and sum-to-one score normalization. Tech. rep., Carnegie Mellon University
Wang SH, Lv YD, Sui Y, Liu S, Wang SJ, Zhang YD (2018) Alcoholism detection by data augmentation and convolutional neural network with stochastic pooling. J Med Syst 42(1):2
Article Google Scholar
Wolpert D (1992) Stacked generalization. IEEE Trans Neural Netw 5(2):241–259
Article Google Scholar
Xu H, Chen NF, Sivadas S, Lim BP, Chng ES, Li H et al (2014) Discriminative score normalization for keyword search decision. In: IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7078–7082
Yu RP, Thambiratnam K, Seide F (2008) Word-lattice based spoken-document indexing with standard text indexers. In: Searching Spontaneous conversational speech workshop, SIGIR, pp 54–61
Zhang X, Trmal J, Povey D, Khudanpur S (2014) Improving deep neural network acoustic models using generalized maxout networks. In: 2014 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 215–219
Zhang YD, Dong Z, Chen X, Jia W, Du S, Muhammad K, Wang SH (2017) Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimed Tools Appl, 1–20
Zhang YD, Muhammad K, Tang C (2018) Twelve-layer deep convolutional neural network with stochastic pooling for tea category classification on gpu platform. Multimed Tools Appl, 1–19

Download references

Author information

Authors and Affiliations

MIRACL: Multimedia InfoRmation System and Advanced Computing Laboratory, University of Sfax, Sfax, Tunisia
Ilyes Rebai & Yassine Ben Ayed
College of Computers and Information Technology, Taif University, Taif, Saudi Arabia
Walid Mahdi

Authors

Ilyes Rebai
View author publications
You can also search for this author in PubMed Google Scholar
Yassine Ben Ayed
View author publications
You can also search for this author in PubMed Google Scholar
Walid Mahdi
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ilyes Rebai.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Rebai, I., Ayed, Y.B. & Mahdi, W. Spoken keyword search system using improved ASR engine and novel template-based keyword scoring. Multimed Tools Appl 78, 1495–1510 (2019). https://doi.org/10.1007/s11042-018-6276-y

Download citation

Received: 17 November 2017
Revised: 24 April 2018
Accepted: 15 June 2018
Published: 25 June 2018
Issue Date: January 2019
DOI: https://doi.org/10.1007/s11042-018-6276-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

Abstract

Access this article

Similar content being viewed by others

Automatic speech recognition: a survey

A comprehensive survey on automatic speech recognition using neural networks

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Notes

References

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher’s Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation