Skip to main content
Log in

Spoken keyword search system using improved ASR engine and novel template-based keyword scoring

  • Published:
Multimedia Tools and Applications Aims and scope Submit manuscript

Abstract

Keyword search for spoken documents has become more and more important nowadays due to the increasing amount of spoken data. The typical system makes use of an Automatic Speech Recognition system (ASR) and information retrieval methods. While a number of studies have been done to get the optimal system performance, KeyWord Search (KWS) systems still suffer from two main drawbacks. First, the system performance depends strongly on the ASR transcripts which are inherently inexact. Due to the speech signal variabilities, ASR systems are far from being powerful. Second, KWS systems make detection decisions based on the lattice-based posterior probability which is incomparable across keywords. In addition, posterior probabilities of true detection usually fall into different ranges which decrease the spotting performance. This paper considers the problems of ASR transcriptions and keyword detection decision based on posterior probabilities. More specifically, we propose to enhance the ASR transcripts accuracy by introducing a new ASR architecture in which we integrate data augmentation and ensemble learning techniques into a single framework. In addition, we proposed a novel keyword rescoring method that provides scores from a new perspective. Precisely, inspired by template-based KWS approach, scores of similarity between the detected keywords are computed by computing the distance between the acoustic features and are used as new scores for decision. Experiments on French and English datasets show that the proposed KWS system potentially leads to more accurate keyword results than the conventional systems.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Notes

  1. β = C/V (Pr− 1 − 1) where C = 0.1 is the cost of a false detection, V = 1 is the value of a correct detection, and Pr = 10− 4 is the prior probability of a keyword.

  2. LibriVox is a group of volunteers who read and record public domain texts creating of approximately 25000 public domain audiobooks for download. Link: https://librivox.org

  3. https://www.nist.gov/sites/default/files/documents/itl/iad/mig/OpenKWS13-EvalPlan.pdf

  4. It is the NIST scoring tool which contains a set of evaluation tools for detection evaluations including keyword spotting task. https://github.com/usnistgov/F4DE

References

  1. Abdullah A, Veltkamp R, Wiering M (2009) An ensemble of deep support vector machines for image categorization. In: International Conference of soft computing and pattern recognition (SOCPAR), pp 301–306

  2. Allauzen C, Mohri M, Saraclar M (2004) General indexation of weighted automata: application to spoken utterance retrieval. In: Proceedings of the workshop on interdisciplinary approaches to speech indexing and retrieval at HLT-NAACL 2004. Association for Computational Linguistics, pp 33–40

  3. Banfield RE, Hall LO, Bowyer KW, Kegelmeyer WP (2007) A comparison of decision tree ensemble creation techniques. IEEE Trans Pattern Anal Mach Intell, 29(1)

  4. Can D, Saraclar M (2011) Lattice indexing for spoken term detection. IEEE Trans Audio Speech Language Process 19(8):2338–2347

    Article  Google Scholar 

  5. Ceamanos X, Waske B, Benediktsson JA, Chanussot J, Fauvel M, Sveinsson JR (2010) A classifier ensemble based on fusion of support vector machines for classifying hyperspectral data. Int J Image Data Fusion 1(4):293–307

    Article  Google Scholar 

  6. Chen G, Parada C, Heigold G (2014) Small-footprint keyword spotting using deep neural networks. In: International Conference on, acoustics, speech, and signal processing (ICASSP), pp 4087–4091

  7. Chen G, Parada C, Sainath TN (2015) Query-by-example keyword spotting using long short-term memory networks. In: International Conference on, acoustics, speech, and signal processing (ICASSP), pp 5236–5240

  8. Dahl GE, Yu D, Deng L, Acero A (2012) Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition. IEEE Trans Audio Speech, Lang Process 20(1):30–42

    Article  Google Scholar 

  9. Deng L, Yu D, Platt J (2012) Scalable stacking and learning for building deep architectures. In: IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 2133–2136

  10. Deng L, Li J, Huang JT, Yao K, Yu D, Seide F, Seltzer M, Zweig G, He X, Williams J et al (2013) Recent advances in deep learning for speech research at microsoft. In: 2013 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 8604–8608

  11. Fiscus JG, Ajot J, Garofolo JS, Doddingtion G (2007) Results of the 2006 spoken term detection evaluation. In: Proc. sigir, vol 7, pp 51–57

  12. Graves A, Jaitly N, Mohamed Ar (2013) Hybrid speech recognition with deep bidirectional lstm. In: 2013 IEEE Workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 273–278

  13. Graves A, Mohamed AR, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: 2013 IEEE international conference on acoustics, speech and signal processing (icassp). IEEE, pp 6645–6649

  14. Jaitly N, Hinton GE (2013) Vocal tract length perturbation (vtlp) improves speech recognition. In: ICML Workshop on deep learning for audio, speech and language

  15. Jaitly N, Nguyen P, Senior AW, Vanhoucke V (2012) Application of pretrained deep neural networks to large vocabulary speech recognition. In: Interspeech, pp 2578–2581

  16. Karakos D, Schwartz R, Tsakalidis S, Zhang L, Ranjan S, Ng TT, Hsiao R, Saikumar G, Bulyko I, Nguyen L et al (2013) Score normalization and system combination for improved keyword spotting. In: 2013 IEEE Workshop on automatic speech recognition and understanding (ASRU). IEEE, pp 210–215

  17. Ko T, Peddinti V, Povey D, Khudanpur S (2015) Audio augmentation for speech recognition. In: INTERSPEECH, pp 3586–3589

  18. Mamou J, Ramabhadran B, Siohan O (2007) Vocabulary independent spoken term detection. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval, pp 615–622

  19. Mamou J, Cui J, Cui X, Gales MJ, Kingsbury B, Knill K, Mangu L, Nolden D, Picheny M, Ramabhadran B et al (2013) System combination and score normalization for spoken term detection. In: 2013 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 8272–8276

  20. Martin A, Doddington G, Kamm T, Ordowski M, Przybocki M (1997) The det curve in assessment of detection task performance. Tech. rep. National Inst of Standards and Technology Gaithersburg MD

  21. Miller DR, Kleber M, Kao CL, Kimball O, Colthurst T, Lowe SA, Schwartz RM, Gish H (2007) Rapid and accurate spoken term detection. In: Eighth Annual Conference of the international speech communication association

  22. Mohamed AR, Dahl GE, Hinton G (2012) Acoustic modeling using deep belief networks. IEEE Trans Audio Speech Lang Process 20(1):14–22

    Article  Google Scholar 

  23. Panayotov V, Chen G, Povey D, Khudanpur S (2015) Librispeech: an asr corpus based on public domain audio books. In: 2015 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 5206–5210

  24. Ragni A, Knill KM, Rath SP, Gales MJ (2014) Data augmentation for low resource languages. In: INTERSPEECH, pp 810–814

  25. Rebai I, BenAyed Y, Mahdi W, Lorré JP (2017) Improving speech recognition using data augmentation and acoustic model fusion. Procedia Comput Sci 112:316–322

    Article  Google Scholar 

  26. Saraclar M, Sproat R (2004) Lattice-based search for spoken utterance retrieval. Urbana 51(61):801

    Google Scholar 

  27. Siohan O, Bacchiani M (2005) Fast vocabulary-independent audio search using path-based graph indexing. In: Ninth European Conference on speech communication and technology

  28. Szöke I, Fapso M, Karafiát M, Burget L, Grézl F, Schwarz P, Glembek O, Matejka P, Kontár S, Cernockỳ J (2006) But system for nist std 2006-english. In: NIST Spoken Term detection evaluation workshop

  29. Szöke I, Burget L, Cernocky J, Fapso M (2008) Sub-word modeling of out of vocabulary words in spoken term detection. In: Spoken Language technology workshop, 2008. SLT 2008. IEEE, pp 273–276

  30. Wang Y, Metze F (2014) An in-depth comparison of keyword specific thresholding and sum-to-one score normalization. Tech. rep., Carnegie Mellon University

  31. Wang SH, Lv YD, Sui Y, Liu S, Wang SJ, Zhang YD (2018) Alcoholism detection by data augmentation and convolutional neural network with stochastic pooling. J Med Syst 42(1):2

    Article  Google Scholar 

  32. Wolpert D (1992) Stacked generalization. IEEE Trans Neural Netw 5(2):241–259

    Article  Google Scholar 

  33. Xu H, Chen NF, Sivadas S, Lim BP, Chng ES, Li H et al (2014) Discriminative score normalization for keyword search decision. In: IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 7078–7082

  34. Yu RP, Thambiratnam K, Seide F (2008) Word-lattice based spoken-document indexing with standard text indexers. In: Searching Spontaneous conversational speech workshop, SIGIR, pp 54–61

  35. Zhang X, Trmal J, Povey D, Khudanpur S (2014) Improving deep neural network acoustic models using generalized maxout networks. In: 2014 IEEE International Conference on acoustics, speech and signal processing (ICASSP). IEEE, pp 215–219

  36. Zhang YD, Dong Z, Chen X, Jia W, Du S, Muhammad K, Wang SH (2017) Image based fruit category classification by 13-layer deep convolutional neural network and data augmentation. Multimed Tools Appl, 1–20

  37. Zhang YD, Muhammad K, Tang C (2018) Twelve-layer deep convolutional neural network with stochastic pooling for tea category classification on gpu platform. Multimed Tools Appl, 1–19

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ilyes Rebai.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Rebai, I., Ayed, Y.B. & Mahdi, W. Spoken keyword search system using improved ASR engine and novel template-based keyword scoring. Multimed Tools Appl 78, 1495–1510 (2019). https://doi.org/10.1007/s11042-018-6276-y

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11042-018-6276-y

Keywords

Navigation