Improving a Conversational Speech Recognition System Using Phonetic and Neural Transcript Correction

Campos-Soberanis, Mario; Campos-Sobrino, Diego; Viana-Cámara, Rafael

doi:10.1007/978-3-030-89820-5_4

Improving a Conversational Speech Recognition System Using Phonetic and Neural Transcript Correction

Conference paper
First Online: 21 October 2021

722 Accesses
1 Citations
3 Altmetric

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13068))

Abstract

This article describes the successful implementation of a conversational speech recognition system applied to telephonic sales performed by an autonomous agent. Our implementation uses a post-processing corrector based on phonetic representations of text and subsequent neural network classifier. The classifier assesses the proposed correction’s relevance to reduce the errors in the transcript sent to a downstream Natural Language Understanding engine. The experiments were carried on correcting transcripts from real audios of orders placed by customers of a large bottling company. We measured the Word Error Rate of the corrected transcripts against human-annotated ground-truth to verify the improvement produced by the system. To evaluate the corrections’ impact on the entities detected by the Natural Language Understanding engine, we used Jaccard distance, Precision, Recall, and \(F_1\). Results show that the implemented system and architecture enhance the transcript relative Word Error Rate on a 39% and Jaccard distance on 13% in comparison to the Automatic Speech Recognition baseline, making them suitable for real-time telephonic sales systems implementation.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Bassil, Y., Alwani, M.: Post-editing ERRO correction algorithm for speech recognition using bing spelling suggestion. Int. J. Adv. Comput. Sci. Appl. 3 (2012). https://doi.org/10.14569/IJACSA.2012.030217
Berg, M.: Modelling of natural dialogues in the context of speech-based information and control systems. Ph.D. thesis, July 2014
Google Scholar
Campos-Sobrino, D., Campos-Soberanis, M., Martínez-Chin, I., Uc-Cetina, V.: Corrección de errores del reconocedor de voz de google usando métricas de distancia fonética. Res. Comput. Sci. 148(1), 57–70 (2019)
Article Google Scholar
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: ICASSP (2016). http://williamchan.ca/papers/wchan-icassp-2016.pdf
Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results. In: NIPS 2014 Workshop on Deep Learning, December 2014 (2014)
Google Scholar
Errattahi, R., Hannani, A.E., Ouahmane, H.: Automatic speech recognition errors detection and correction: a review. Procedia Comput. Sci. 128, 32–37 (2018). https://doi.org/10.1016/j.procs.2018.03.005, http://www.sciencedirect.com/science/article/pii/S1877050918302187, 1st International Conference on Natural Language and Speech Processing
Fang, A., Filice, S., Limsopatham, N., Rokhlenko, O.: Using Phoneme Representations to Build Predictive Models Robust to ASR Errors, pp. 699–708. ACM, July 2020. https://doi.org/10.1145/3397271.3401050
Ghannay, S., Caubrière, A., Estève, Y., Laurent, A., Morin, E.: End-to-end named entity extraction from speech. In: EEE Spoken Language Technology Workshop (2018)
Google Scholar
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: 31st International Conference on Machine Learning (ICML 2014), vol. 5, pp. 1764–1772, January 2014
Google Scholar
Graves, A.: Sequence transduction with recurrent neural networks (2012)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural ’networks. In: Proceedings of the 23rd International Con-ference on Machine Learning, vol. 2006, pp. 369–376, January 2006. https://doi.org/10.1145/1143844.1143891
Haghani, P., et al.: From audio to semantics: approaches to end-to-end spoken language understanding. In: Spoken Language Technology Workshop (2018)
Google Scholar
Liao, J., et al.: Improving readability for automatic speech recognition transcription (2020)
Google Scholar
Limsopatham, N., Rokhlenko, O., Carmel, D.: Research challenges in building a voice-based artificial personal shopper - position paper. In: Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AIpp, 40–45, January 2018. https://doi.org/10.18653/v1/W18-5706
Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., Bengio, Y.: Speech model pre-training for end-to-end spoken language understanding (2019)
Google Scholar
Ogawa, A., Hori, T.: Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks. Speech Commun. 89 (2017).https://doi.org/10.1016/j.specom.2017.02.009
Qian, Y., et al.: Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 569–576 (2017). https://doi.org/10.1109/ASRU.2017.8268987
Schumann, R., Angkititrakul, P.: Incorporating asr errors with attention-based, jointly trained RNN for intent detection and slot filling. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6059–6063, April 2018. https://doi.org/10.1109/ICASSP.2018.8461598
Shivakumar, P.G., Li, H., Knight, K., Georgiou, P.G.: Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. CoRR abs/1802.02607 (2018), http://arxiv.org/abs/1802.02607
Song, S., Zhang, N., Huang, H.: Named entity recognition based on conditional random fields. Clust. Comput. 22, 1–12 (2019). https://doi.org/10.1007/s10586-017-1146-3
Twiefel, J., Baumann, T., Heinrich, S., Wermter, S.: Improving domain-independent cloud-based speech recognition with domain-dependent phonetic post-processing. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligencevol. vol. 2, pp. 1529–1535, July 2014
Google Scholar
Viana-Cámara, R., Campos-Soberanis, M., Campos-Sobrino, D.: Modelo hıbrido fonético-neural para corrección en sistemas de reconocimiento del habla. Res. Comput. Sci. 149(8), 1163–1177 (2020)
Google Scholar
Vorontsov, I., Kulakovskiy, I., Makeev, V.: Jaccard index based similarity measure to compare transcription factor binding site models. Algorith. Mol. Biol. AMB 8, 23 (2013). https://doi.org/10.1186/1748-7188-8-23
Vtyurina, A., Fourney, A., Morris, M., Findlater, L., White, R.: Bridging screen readers and voice assistants for enhanced eyes-free web search. In: he World Wide Web Conference, pp. 3590–3594, May 2019). https://doi.org/10.1145/3308558.3314136
Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval, July 2002. https://doi.org/10.1145/243199.243258

Download references

Author information

Authors and Affiliations

SoldAI Research, Mérida, Yucatán, Mexico
Mario Campos-Soberanis, Diego Campos-Sobrino & Rafael Viana-Cámara

Authors

Mario Campos-Soberanis
View author publications
You can also search for this author in PubMed Google Scholar
Diego Campos-Sobrino
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Viana-Cámara
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Diego Campos-Sobrino .

Editor information

Editors and Affiliations

Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Ildar Batyrshin
Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Alexander Gelbukh
Instituto Politécnico Nacional, Centro de Investigación en Computación, Mexico City, Mexico
Grigori Sidorov

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Campos-Soberanis, M., Campos-Sobrino, D., Viana-Cámara, R. (2021). Improving a Conversational Speech Recognition System Using Phonetic and Neural Transcript Correction. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2021. Lecture Notes in Computer Science(), vol 13068. Springer, Cham. https://doi.org/10.1007/978-3-030-89820-5_4

Download citation

DOI: https://doi.org/10.1007/978-3-030-89820-5_4
Published: 21 October 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89819-9
Online ISBN: 978-3-030-89820-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics