A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case

Antoniadis, Panagiotis; Tsardoulias, Emmanouil; Symeonidis, Andreas

doi:10.1007/s11042-022-12953-6

A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case

Published: 12 May 2022

Volume 81, pages 40635–40652, (2022)
Cite this article

Multimedia Tools and Applications Aims and scope Submit manuscript

Panagiotis Antoniadis ORCID: orcid.org/0000-0001-5647-641X¹,
Emmanouil Tsardoulias² &
Andreas Symeonidis²

286 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Automatic Speech Recognition (ASR) has become increasingly popular since it significantly simplifies human-computer interaction, providing a more intuitive way of communication. Building an accurate, general-purpose ASR system is a challenging task that requires a lot of data and computing power. Especially for languages not widely spoken, such as Greek, the lack of adequately large speech datasets leads to the development of ASR systems adapted to a restricted corpus and/or for specific topics. When used in specific domains, these systems can be both accurate and fast, without the need for large datasets and extended training. An interesting application domain of such narrow-scope ASR systems is the development of personalized models that can be used for dictation. In the current work we present three personalization-via-adaptation modules, that can be integrated into any ASR/dictation system and increase its accuracy. The adaptation can be applied both on the language model (based on past text samples of the user) as well as on the acoustic model (using a set of user’s narrations). To provide more precise recommendations, clustering algorithms are applied and topic-specific language models are created. Also, heterogeneous adaptation methods are combined to provide recommendations to the user. Evaluation performed on a self-created database containing 746 corpora included in messaging applications and e-mails from the same user, demonstrates that the proposed approach can achieve better results than the vanilla existing Greek models.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 3

Natural language processing: state of the art, current trends and challenges

Article 14 July 2022

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Article 29 September 2022

Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation

Code Availability

The code is available on GitHub^{Footnote 12}.

Notes

References

Arora SJ, Singh RP (2012) Automatic speech recognition: a review. Int J Comput Appl 60(9)
Blei DM, Ng AY, Jordan MI (2003) Latent dirichlet allocation. Mach Learn Res 3:993–1022. https://doi.org/10.1162/jmlr.2003.3.4-5.993
Brown PF, Della Pietra VJ, Desouza PV, Lai JC, Mercer RL (1992) Class-based n-gram models of natural language. Comput Linguist 18(4):467–480
Google Scholar
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist 5:135–146. https://doi.org/10.1162/tacl_a_00051
Cephei A (2021) Vosk offline speech recognition API. https://alphacephei.com/vosk/
Chan W, Jaitly N, Le Q, Vinyals O (2016) Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: Proceedings of IEEE ICASSP, pp 4960–4964
Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E et al (2018) State-of-the-art speech recognition with sequence-to-sequence models. In: Proceedings of IEEE ICASSP, pp 4774–4778
CMU (accessed May 21, 2021a) Sphinx: Acoustic Model Types. https://cmusphinx.github.io/wiki/acousticmodeltypes/
CMU (accessed May 21, 2021b) Sphinx: Adapting the default acoustic model. https://cmusphinx.github.io/wiki/tutorialadapt/
CMU (accessed May 21, 2021c) Sphinx: Training acoustic model. https://cmusphinx.github.io/wiki/tutorial/
Dahl GE, Yu D, Deng L, Acero A (2011) Large vocabulary continuous speech recognition with context-dependent dbn-hmms. In: Proceedings of IEEE ICASSP, pp 4688–4691
Digalakis V, Oikonomidis D, Pratsolis D, Tsourakis N, Vosnidis C, Chatzichrisafis N, Diakoloukas V (2003) Large vocabulary continuous speech recognition in greek: Corpus and an automatic dictation system. In: Proceedings of EUROSPEECH, pp 1565–1568
Gaida C, Lange P, Petrick R, Proba P, Malatawy A, Suendermann-Oeft D (2014) Comparing open-source speech recognition toolkits. Tech. Rep., DHBW Stuttgart
Gales M, Young S (2008) Application of hidden markov models in speech recognition. Now Foundations and Trends
Graves A, Mohamed A-, Hinton G (2013) Speech recognition with deep recurrent neural networks. In: Proceedings of IEEE ICASSP, pp 6645–6649
Google (accessed May 21, 2021) Cloud Speech-to-Text API. https://cloud.google.com/speech-to-text
Huggins-Daines D, Kumar M, Chan A, Black AW, Ravishankar M, Rudnicky AI (2006) Pocketsphinx: A free, real-time continuous speech recognition system for hand-held devices. In: Proceedings of IEEE ICASSP, vol 1, pp I–I
Jacob RJK, Leggett JJ, Myers BA, Pausch R (1993) Interaction styles and input/output devices. Behav Inf Technol 12(2):69–79. https://doi.org/10.1080/01449299308924369
Article Google Scholar
Jaitly N, Le QV, Vinyals O, Sutskever I, Sussillo D, Bengio S (2016) An online sequence-to-sequence model using partial conditioning. In: Proceedings of Advances in NIPS, vol 29. Curran Associates, Inc.
Kunze J, Kirsch L, Kurenkov I, Krug A, Johannsmeier J, Stober S (2017) Transfer learning for speech recognition on a budget. In: Proceedings of RepL4NLP, pp 168–177
Macskassy SA, Hirsh H, Banerjee A, Dayanik AA (2003) Converting numerical classification into text classification. Artif Intell 143(1):51–77. https://doi.org/10.1016/s0004-3702(02)00359-4
Martinčić-Ipšić S, Pobar M, Ipšić I (2011) Croatian large vocabulary automatic speech recognition. Automatika 52(2):147–157. https://doi.org/10.1080/00051144.2011.11828413
Article Google Scholar
Mikolov T, Chen K, Corrado G, Dean J (2013) Efficient estimation of word representations in vector space. arXiv:13013781
Militaru D, Gavat I, Dumitru O, Zaharia T, Segarceanu S (2009) Protologos, system for romanian language automatic speech recognition and understanding (asru). In: Proceedings of SpeD, pp 1–9
Mohamed A-, Yu D, Deng L (2010) Investigation of full-sequence training of deep belief networks for speech recognition. In: Proceedings of INTERSPEECH, pp 2846–2849
Morbini F, Audhkhasi K, Sagae K, Artstein R, Can D, Georgiou P, Narayanan S, Leuski A, Traum D (2013) Which asr should i choose for my dialogue system?. In: Proceedings of SIGDIAL, pp 394–403
Morgan N, Bourlard H (1990) Continuous speech recognition using multilayer perceptrons with hidden markov models. In: Proceedings of IEEE ICASSP, pp 413–416
Mulbregt Pv, Carp I, Gillick L, Lowe S, Yamron J (1998) Text segmentation and topic tracking on broadcast news via a hidden markov model approach. In: Proceedings of ICSLP, pp 2519–2522
Oikonomidis D, Digalakis V (2003) Stem-based maximum entropy language models for inflectional languages. In: Proceedings of EUROSPEECH, pp 2285–2288
Pantazoglou F, Papadakis N, Kladis G (2017) Implementation of the generic greek model for cmu sphinx speech recognition toolkit. In: Proceedings of eRA-12
Pleva M, Juhár J (2014) Tuke-bnews-sk: Slovak broadcast news corpus construction and evaluation. In: Proceedings of LREC, pp 1709–1713
Rabiner LR, Juang B-H, Levinson SE, Sondhi MM (1985) Recognition of isolated digits using hidden markov models with continuous mixture densities. AT&T Techn J 64(6):1211–1234. https://doi.org/10.1002/j.1538-7305.1985.tb00272.x
Article Google Scholar
Raffel C, Luong M-T, Liu PJ, Weiss RJ, Eck D (2017) Online and linear-time attention by enforcing monotonic alignments. In: Proceedings of ICML, pp 2837–2846
Rusko M, Juhár J, Trnka M, Staš J, Darjaa S, Hládek D, Sabo R, Pleva M, Ritomskỳ M, Lojka M (2014) Slovak automatic dictation system for judicial domain. In: Proceedings of Human Language Technology Challenges for Computer Science and Linguistics, pp 16–27
Sak H, Shannon M, Rao K, Beaufays F (2017) Recurrent neural aligner: An encoder-decoder neural network model for sequence to sequence mapping.. In: Proceedings of INTERSPEECH, vol 8, pp 1298–1302
Schlippe T, Volovyk M, Yurchenko K, Schultz T (2013) Rapid bootstrapping of a ukrainian large vocabulary continuous speech recognition system. In: Proceedings of IEEE ICASSP, pp 7329–7333
Stolcke A (2002) Srilm-an extensible language modeling toolkit. In: Proceedings of ICSLP, pp 901–904
Tamura M, Masuko T, Tokuda K, Kobayashi T (2001) Adaptation of pitch and spectrum for hmm-based speech synthesis using mllr. In: Proceedings of IEEE ICASSP, vol 2, pp 805–808
Trentin E, Gori M (2001) A survey of hybrid ann/hmm models for automatic speech recognition. Neurocomputing 37(1-4):91–126. https://doi.org/10.1016/s0925-2312(00)00308-8
Article Google Scholar
Tsardoulias EG, Symeonidis AL, Mitkas PA (2015) An automatic speech detection architecture for social robot oral interaction. In: Proceedings of Audio Mostly 2015 on Interaction With Sound
Waibel A, Hanazawa T, Hinton G, Shikano K, Lang K (1988) Phoneme recognition: neural networks vs. hidden markov models. In: Proceedings of IEEE ICASSP, pp 107–108
Young S, Evermann G, Gales M, Hain T, Kershaw D, Liu X, Moore G, Odell J, Ollason D, Povey D et al (2002) The htk book. Cambridge university 3(175):12
Google Scholar
Zgank A, Vitez AZ, Verdonik D (2014) The slovene bnsi broadcast news database and reference speech corpus gos: Towards the uniform guidelines for future work.. In: Proceedings of LREC, pp 2644–2647
Zhang L, Renals S (2006) Phone recognition analysis for trajectory hmm. In: Proceedings of INTERSPEECH, pp 589–592
Ziolko B, Jadczyk T, Skurzok D, Żlasko P, Gałka J, Pȩdzima̧ż T, Gawlik I, Pałka S (2015) Sarmata 2.0 automatic polish language speech recognition system. In: Proceedings of ISCA, pp 1062–1063

Download references

Funding

Part of this work was supported by Google Summer of Code as an open source project^{Footnote 11}.

Author information

Authors and Affiliations

School of Electrical and Computer Engineering, National Technical University of Athens (NTUA), Athens, Greece
Panagiotis Antoniadis
School of Electrical and Computer Engineering, Faculty of Engineering, Aristotle University of Thessaloniki (AUTh), Thessaloniki, Greece
Emmanouil Tsardoulias & Andreas Symeonidis

Authors

Panagiotis Antoniadis
View author publications
You can also search for this author in PubMed Google Scholar
Emmanouil Tsardoulias
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Symeonidis
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Panagiotis Antoniadis.

Ethics declarations

Conflict of Interests

The authors declare that they have no conflict of interest.

Additional information

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Antoniadis, P., Tsardoulias, E. & Symeonidis, A. A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case. Multimed Tools Appl 81, 40635–40652 (2022). https://doi.org/10.1007/s11042-022-12953-6

Download citation

Received: 30 January 2021
Revised: 31 May 2021
Accepted: 10 March 2022
Published: 12 May 2022
Issue Date: November 2022
DOI: https://doi.org/10.1007/s11042-022-12953-6

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A mechanism for personalized Automatic Speech Recognition for less frequently spoken languages: the Greek case

Abstract

Access this article

Similar content being viewed by others

Natural language processing: state of the art, current trends and challenges

A deep learning approaches in text-to-speech system: a systematic review and recent research perspective

Large-Language-Models (LLM)-Based AI Chatbots: Architecture, In-Depth Analysis and Their Performance Evaluation

Code Availability

Notes

References

Funding

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of Interests

Additional information

Publisher’s note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation