Abstract
Automatic Speech Recognition (ASR) has become increasingly popular since it significantly simplifies human-computer interaction, providing a more intuitive way of communication. Building an accurate, general-purpose ASR system is a challenging task that requires a lot of data and computing power. Especially for languages not widely spoken, such as Greek, the lack of adequately large speech datasets leads to the development of ASR systems adapted to a restricted corpus and/or for specific topics. When used in specific domains, these systems can be both accurate and fast, without the need for large datasets and extended training. An interesting application domain of such narrow-scope ASR systems is the development of personalized models that can be used for dictation. In the current work we present three personalization-via-adaptation modules, that can be integrated into any ASR/dictation system and increase its accuracy. The adaptation can be applied both on the language model (based on past text samples of the user) as well as on the acoustic model (using a set of user’s narrations). To provide more precise recommendations, clustering algorithms are applied and topic-specific language models are created. Also, heterogeneous adaptation methods are combined to provide recommendations to the user. Evaluation performed on a self-created database containing 746 corpora included in messaging applications and e-mails from the same user, demonstrates that the proposed approach can achieve better results than the vanilla existing Greek models.
Similar content being viewed by others
Code Availability
The code is available on GitHubFootnote 12.
Notes
References
Arora SJ, Singh RP (2012) Automatic speech recognition: a review. Int J Comput Appl 60(9)
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. Mach Learn Res 3:993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993
Brown PF, Della Pietra VJ, Desouza PV, Lai JC, Mercer RL (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–480
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146. https://doi.org/10.1162/tacl_a_00051
Cephei A (2021) Vosk offline speech recognition API. https://alphacephei.com/vosk/
Chan W, Jaitly N, Le Q, Vinyals O (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: Proceedings of IEEE ICASSP, pp 4960–4964
Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E et al (2018) State-of-the-art speech recognition with sequence-to-sequence models. In: Proceedings of IEEE ICASSP, pp 4774–4778
CMU (accessed May 21, 2021a) Sphinx: Acoustic Model Types. https://cmusphinx.github.io/wiki/acousticmodeltypes/
CMU (accessed May 21, 2021b) Sphinx: Adapting the default acoustic model. https://cmusphinx.github.io/wiki/tutorialadapt/
CMU (accessed May 21, 2021c) Sphinx: Training acoustic model. https://cmusphinx.github.io/wiki/tutorial/
Dahl GE, Yu D, Deng L, Acero A (2011) Large vocabulary continuous speech recognition with context-dependent dbn-hmms. In: Proceedings of IEEE ICASSP, pp 4688–4691
Digalakis V, Oikonomidis D, Pratsolis D, Tsourakis N, Vosnidis C, Chatzichrisafis N, Diakoloukas V (2003) Large vocabulary continuous speech recognition in greek: Corpus and an automatic dictation system. In: Proceedings of EUROSPEECH, pp 1565–1568
Gaida C, Lange P, Petrick R, Proba P, Malatawy A, Suendermann-Oeft D (2014) Comparing open-source speech recognition toolkits. Tech. Rep., DHBW Stuttgart
Gales M, Young S (2008) Application of hidden markov models in speech recognition. Now Foundations and Trends
Graves A, Mohamed A-, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of IEEE ICASSP, pp 6645–6649
Google (accessed May 21, 2021) Cloud Speech-to-Text API. https://cloud.google.com/speech-to-text
Huggins-Daines D, Kumar M, Chan A, Black AW, Ravishankar M, Rudnicky AI (2006) Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In: Proceedings of IEEE ICASSP, vol 1, pp I–I
Jacob RJK, Leggett JJ, Myers BA, Pausch R (1993) Interaction styles and input/output devices. Behav Inf Technol 12(2):69–79. https://doi.org/10.1080/01449299308924369
Jaitly N, Le QV, Vinyals O, Sutskever I, Sussillo D, Bengio S (2016) An online sequence-to-sequence model using partial conditioning. In: Proceedings of Advances in NIPS, vol 29. Curran Associates, Inc.
Kunze J, Kirsch L, Kurenkov I, Krug A, Johannsmeier J, Stober S (2017) Transfer learning for speech recognition on a budget. In: Proceedings of RepL4NLP, pp 168–177
Macskassy SA, Hirsh H, Banerjee A, Dayanik AA (2003) Converting numerical classification into text classification. Artif Intell 143(1):51–77. https://doi.org/10.1016/s0004-3702(02)00359-4
Martinčić-Ipšić S, Pobar M, Ipšić I (2011) Croatian large vocabulary automatic speech recognition. Automatika 52(2):147–157. https://doi.org/10.1080/00051144.2011.11828413
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781
Militaru D, Gavat I, Dumitru O, Zaharia T, Segarceanu S (2009) Protologos, system for romanian language automatic speech recognition and understanding (asru). In: Proceedings of SpeD, pp 1–9
Mohamed A-, Yu D, Deng L (2010) Investigation of full-sequence training of deep belief networks for speech recognition. In: Proceedings of INTERSPEECH, pp 2846–2849
Morbini F, Audhkhasi K, Sagae K, Artstein R, Can D, Georgiou P, Narayanan S, Leuski A, Traum D (2013) Which asr should i choose for my dialogue system?. In: Proceedings of SIGDIAL, pp 394–403
Morgan N, Bourlard H (1990) Continuous speech recognition using multilayer perceptrons with hidden markov models. In: Proceedings of IEEE ICASSP, pp 413–416
Mulbregt Pv, Carp I, Gillick L, Lowe S, Yamron J (1998) Text segmentation and topic tracking on broadcast news via a hidden markov model approach. In: Proceedings of ICSLP, pp 2519–2522
Oikonomidis D, Digalakis V (2003) Stem-based maximum entropy language models for inflectional languages. In: Proceedings of EUROSPEECH, pp 2285–2288
Pantazoglou F, Papadakis N, Kladis G (2017) Implementation of the generic greek model for cmu sphinx speech recognition toolkit. In: Proceedings of eRA-12
Pleva M, Juhár J (2014) Tuke-bnews-sk: Slovak broadcast news corpus construction and evaluation. In: Proceedings of LREC, pp 1709–1713
Rabiner LR, Juang B-H, Levinson SE, Sondhi MM (1985) Recognition of isolated digits using hidden markov models with continuous mixture densities. AT&T Techn J 64(6):1211–1234. https://doi.org/10.1002/j.1538-7305.1985.tb00272.x
Raffel C, Luong M-T, Liu PJ, Weiss RJ, Eck D (2017) Online and linear-time attention by enforcing monotonic alignments. In: Proceedings of ICML, pp 2837–2846
Rusko M, Juhár J, Trnka M, Staš J, Darjaa S, Hládek D, Sabo R, Pleva M, Ritomskỳ M, Lojka M (2014) Slovak automatic dictation system for judicial domain. In: Proceedings of Human Language Technology Challenges for Computer Science and Linguistics, pp 16–27
Sak H, Shannon M, Rao K, Beaufays F (2017) Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping.. In: Proceedings of INTERSPEECH, vol 8, pp 1298–1302
Schlippe T, Volovyk M, Yurchenko K, Schultz T (2013) Rapid bootstrapping of a ukrainian large vocabulary continuous speech recognition system. In: Proceedings of IEEE ICASSP, pp 7329–7333
Stolcke A (2002) Srilm-an extensible language modeling toolkit. In: Proceedings of ICSLP, pp 901–904
Tamura M, Masuko T, Tokuda K, Kobayashi T (2001) Adaptation of pitch and spectrum for hmm-based speech synthesis using mllr. In: Proceedings of IEEE ICASSP, vol 2, pp 805–808
Trentin E, Gori M (2001) A survey of hybrid ann/hmm models for automatic speech recognition. Neurocomputing 37(1-4):91–126. https://doi.org/10.1016/s0925-2312(00)00308-8
Tsardoulias EG, Symeonidis AL, Mitkas PA (2015) An automatic speech detection architecture for social robot oral interaction. In: Proceedings of Audio Mostly 2015 on Interaction With Sound
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K (1988) Phoneme recognition: neural networks vs. hidden markov models. In: Proceedings of IEEE ICASSP, pp 107–108
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D et al (2002) The htk book. Cambridge university 3(175):12
Zgank A, Vitez AZ, Verdonik D (2014) The slovene bnsi broadcast news database and reference speech corpus gos: Towards the uniform guidelines for future work.. In: Proceedings of LREC, pp 2644–2647
Zhang L, Renals S (2006) Phone recognition analysis for trajectory hmm. In: Proceedings of INTERSPEECH, pp 589–592
Ziolko B, Jadczyk T, Skurzok D, Żlasko P, Gałka J, Pȩdzima̧ż T, Gawlik I, Pałka S (2015) Sarmata 2.0 automatic polish language speech recognition system. In: Proceedings of ISCA, pp 1062–1063
Funding
Part of this work was supported by Google Summer of Code as an open source projectFootnote 11.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of Interests
The authors declare that they have no conflict of interest.
Additional information
Publisher’s note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Antoniadis, P., Tsardoulias, E. & Symeonidis, A. A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case. Multimed Tools Appl 81, 40635–40652 (2022). https://doi.org/10.1007/s11042-022-12953-6
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11042-022-12953-6