Abstract
The proliferation of voice assistants for information retrieval is propelled by technological advancements and seamless integration across multiple devices. Nevertheless, these systems face persistent limitations in accuracy and comprehension, particularly with accents, dialects, and uncommon terminology. Additional challenges include the cost of these technologies and their reliance on internet connectivity. This study conducts a comprehensive evaluation of various low-cost speech-to-text transcription software, including Windows10, Google Docs, GBoard Android, Speech-Texter, and SpeechNotes. The analysis focuses on high-error criteria in text retrieval, such as proper names, homophones, neologisms, and multilingual usage. Key variables examined include user age, message duration, and ambient noise levels. Transcription quality is meticulously assessed to determine the efficacy of voice retrieval. Results reveal significant disparities among the software, with GBoard Android demonstrating superior accuracy and the lowest error rates.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Deng, L., Liu, Y. eds: Deep Learning in Natural Language Processing. Springer Singapore, Singapore (2018). https://doi.org/10.1007/978-981-10-5209-5
Roger, V., Farinas, J., Pinquier, J.: Deep neural networks for automatic speech processing: a survey from large corpora to limited data. J. Audio Speech Music Proc. 2022, 19 (2022). https://doi.org/10.1186/s13636-022-00251-w
Richter, F.: Infographic: Smart Speaker Adoption Continues to Rise [Infographic]. Statista Daily Data (2020). https://www.statista.com/chart/16597/smart-speaker-ownership-in-the-united-states
Morato, J., Sanchez-Cuadrado, S., Iglesias, A., Campillo, A., Fernández-Panadero, C.: Sustainable technologies for older adults. Sustainability 13, 8465 (2021). https://doi.org/10.3390/su13158465
Kashinath, G., Kanhaiya, K., Vineet, K.: Speech-to-Text API Market. Allied Market Research, report code A09527 (2023)
Yu, D., Deng, L.: Automatic Speech Recognition: A Deep Learning Approach. Springer London, London (2015). https://doi.org/10.1007/978-1-4471-5779-3
Watanabe, S., Delcroix, M., Metze, F., Hershey, J.R. eds: New Era for Robust Speech Recognition: Exploiting Deep Learning. Springer International Publishing: Imprint: Springer, Cham (2017)
LeCun, Y., Bengio, Y., Hinton, G.: Deep learning. Nature 521, 436–444 (2015). https://doi.org/10.1038/nature14539
Errattahi, R., El Hannani, A., Ouahmane, H.: Automatic speech recognition errors detection and correction: a review. Procedia Comput. Sci. 128, 32–37 (2018). https://doi.org/10.1016/j.procs.2018.03.005
Marrero, M., Urbano, J., Sánchez-Cuadrado, S., Morato, J., Gómez-BerbÃs, J.M.: Named entity recognition: fallacies, challenges and opportunities. Comput. Stand. Interfaces 35, 482–489 (2013). https://doi.org/10.1016/j.csi.2012.09.004
Humes, L.E.: Factors underlying individual differences in speech-recognition threshold (SRT) in noise among older adults. Front. Aging Neurosci. 13, 702739 (2021). https://doi.org/10.3389/fnagi.2021.702739
Ogun, S.: How to create a speech dataset for ASR, TTS, and other speech tasks [Blog] (2021). https://ogunlao.github.io/blog/2021/01/26/how-to-create-speech-dataset.html
Tatman, R., Kasten, C.: Effects of talker dialect, gender & race on accuracy of Bing speech and Youtube automatic captions. In: Interspeech 2017, pp. 934–938. ISCA (2017). https://doi.org/10.21437/Interspeech.2017-1746
Winata, G.I., et al.: Learning fast adaptation on cross-accented speech recognition. In: Interspeech 2020, pp. 1276–1280. ISCA (2020). https://doi.org/10.21437/Interspeech.2020-45
Lu, X., Li, S., Fujimoto, M.: Automatic Speech Recognition. In: Kidawara, Y., Sumita, E., Kawai, H. (eds.) Speech-to-Speech Translation, pp. 21–38. Springer Singapore, Singapore (2020). https://doi.org/10.1007/978-981-15-0595-9_2
Pentland, S.J., Fuller, C.M., Spitzley, L.A., Twitchell, D.P.: Does accuracy matter? Methodological considerations when using automated speech-to-text for social science research. Int. J. Soc. Res. Methodol. 26, 661–677 (2023). https://doi.org/10.1080/13645579.2022.2087849
Pfeifer, V.A., Chilton, T.D., Grilli, M.D., Mehl, M.R.: How ready is speech-to-text for psychological language research? Evaluating the validity of AI-generated English transcripts for analyzing free-spoken responses in younger and older adults. Behav. Res. 56, 7621–7631 (2024). https://doi.org/10.3758/s13428-024-02440-1
Klakow, D., Peters, J.: Testing the correlation of word error rate and perplexity. Speech Commun. 38, 19–28 (2002). https://doi.org/10.1016/S0167-6393(01)00041-3
Durand, J.: Corpus Phonology. In: Oxford Research Encyclopedia of Linguistics. Oxford University Press (2017). https://doi.org/10.1093/acrefore/9780199384655.013.145
Niemants, N.: Des enregistrements aux corpus: transcription et extraction de données d’interprétation en milieu médical. meta. 63, 665–694 (2019). https://doi.org/10.7202/1060168ar
Ravanelli, M., Brakel, P., Omologo, M., Bengio, Y.: Light gated recurrent units for speech recognition. IEEE Trans. Emerg. Top. Comput. Intell. 2, 92–102 (2018). https://doi.org/10.1109/TETCI.2017.2762739
Dias, G.: Dossier: IA & technologies du langage humain. Bulletin de l’AFIA 107, 6–9 (2020)
Blackley, S.V., Huynh, J., Wang, L., Korach, Z., Zhou, L.: Speech recognition for clinical documentation from 1990 to 2018: a systematic review. J. Am. Med. Inform. Assoc. 26, 324–338 (2019). https://doi.org/10.1093/jamia/ocy179
Iancu, B.: Evaluating Google Speech-to-Text API's performance for Romanian e-learning resources. Informatica Economica 23(1), 17–25 (2019). https://ideas.repec.org/a/aes/infoec/v23y2019i1p17-25.html
Rufino Morales, M.: Estudio comparativo de métodos de transcripción para corpus orales: el caso del español. Revista Nebrija de LingüÃstica Aplicada a la Enseñanza de Lenguas 14, 126–146 (2020). https://doi.org/10.26378/rnlael1429406
Serna, Y., Morato, J, Sanchez-Cuadrado, S.: Evaluación de la comprensión de los paneles interpretativos en parajes naturales. Scire 24, 53–62 (2018). https://doi.org/10.54886/scire.v24i2.4568
Online Resources
AssemblyAI: The #1 Speech-to-Text API for Developers. https://www.assemblyai.com/
Cloud Speech-to-Text API - Marketplace - Google Cloud Platform. https://console.cloud.google.com/marketplace/product/google/speech.googleapis.com
Google Documents: create and edit documents online for free. https://www.google.es/intl/es/docs/about/
Gboard: Google's keyboard - Apps on Google Play. https://play.google.com/store/apps/details?id=com.google.android.inputmethod.latin&hl=es&gl=US
Speech to Text Online Notepad. Free. Speechnotes. https://speechnotes.co/
SpeechTexter. Type with your voice online. Speech Texter. https://www.speechtexter.com
Funding
Research partially funded by the R&D grant from the Autonomus Community of Madrid (PHS-2024/PH-HUM-313).
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2025 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Morato, J., Pedrero, A., Sanchez-Cuadrado, S. (2025). Comparative Evaluation of Speech-to-Text Software Based on Sociodemographic and Environmental Factors. In: Guarda, T., Portela, F., Augusto, M.F. (eds) Advanced Research in Technologies, Information, Innovation and Sustainability. ARTIIS 2024. Communications in Computer and Information Science, vol 2349. Springer, Cham. https://doi.org/10.1007/978-3-031-83432-5_20
Download citation
DOI: https://doi.org/10.1007/978-3-031-83432-5_20
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-83431-8
Online ISBN: 978-3-031-83432-5
eBook Packages: Computer ScienceComputer Science (R0)