Abstract
This article describes the successful implementation of a conversational speech recognition system applied to telephonic sales performed by an autonomous agent. Our implementation uses a post-processing corrector based on phonetic representations of text and subsequent neural network classifier. The classifier assesses the proposed correction’s relevance to reduce the errors in the transcript sent to a downstream Natural Language Understanding engine. The experiments were carried on correcting transcripts from real audios of orders placed by customers of a large bottling company. We measured the Word Error Rate of the corrected transcripts against human-annotated ground-truth to verify the improvement produced by the system. To evaluate the corrections’ impact on the entities detected by the Natural Language Understanding engine, we used Jaccard distance, Precision, Recall, and \(F_1\). Results show that the implemented system and architecture enhance the transcript relative Word Error Rate on a 39% and Jaccard distance on 13% in comparison to the Automatic Speech Recognition baseline, making them suitable for real-time telephonic sales systems implementation.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsReferences
Bassil, Y., Alwani, M.: Post-editing ERRO correction algorithm for speech recognition using bing spelling suggestion. Int. J. Adv. Comput. Sci. Appl. 3 (2012). https://doi.org/10.14569/IJACSA.2012.030217
Berg, M.: Modelling of natural dialogues in the context of speech-based information and control systems. Ph.D. thesis, July 2014
Campos-Sobrino, D., Campos-Soberanis, M., Martínez-Chin, I., Uc-Cetina, V.: Corrección de errores del reconocedor de voz de google usando métricas de distancia fonética. Res. Comput. Sci. 148(1), 57–70 (2019)
Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: ICASSP (2016). http://williamchan.ca/papers/wchan-icassp-2016.pdf
Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results. In: NIPS 2014 Workshop on Deep Learning, December 2014 (2014)
Errattahi, R., Hannani, A.E., Ouahmane, H.: Automatic speech recognition errors detection and correction: a review. Procedia Comput. Sci. 128, 32–37 (2018). https://doi.org/10.1016/j.procs.2018.03.005, http://www.sciencedirect.com/science/article/pii/S1877050918302187, 1st International Conference on Natural Language and Speech Processing
Fang, A., Filice, S., Limsopatham, N., Rokhlenko, O.: Using Phoneme Representations to Build Predictive Models Robust to ASR Errors, pp. 699–708. ACM, July 2020. https://doi.org/10.1145/3397271.3401050
Ghannay, S., Caubrière, A., Estève, Y., Laurent, A., Morin, E.: End-to-end named entity extraction from speech. In: EEE Spoken Language Technology Workshop (2018)
Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: 31st International Conference on Machine Learning (ICML 2014), vol. 5, pp. 1764–1772, January 2014
Graves, A.: Sequence transduction with recurrent neural networks (2012)
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural ’networks. In: Proceedings of the 23rd International Con-ference on Machine Learning, vol. 2006, pp. 369–376, January 2006. https://doi.org/10.1145/1143844.1143891
Haghani, P., et al.: From audio to semantics: approaches to end-to-end spoken language understanding. In: Spoken Language Technology Workshop (2018)
Liao, J., et al.: Improving readability for automatic speech recognition transcription (2020)
Limsopatham, N., Rokhlenko, O., Carmel, D.: Research challenges in building a voice-based artificial personal shopper - position paper. In: Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AIpp, 40–45, January 2018. https://doi.org/10.18653/v1/W18-5706
Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., Bengio, Y.: Speech model pre-training for end-to-end spoken language understanding (2019)
Ogawa, A., Hori, T.: Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks. Speech Commun. 89 (2017).https://doi.org/10.1016/j.specom.2017.02.009
Qian, Y., et al.: Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 569–576 (2017). https://doi.org/10.1109/ASRU.2017.8268987
Schumann, R., Angkititrakul, P.: Incorporating asr errors with attention-based, jointly trained RNN for intent detection and slot filling. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6059–6063, April 2018. https://doi.org/10.1109/ICASSP.2018.8461598
Shivakumar, P.G., Li, H., Knight, K., Georgiou, P.G.: Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. CoRR abs/1802.02607 (2018), http://arxiv.org/abs/1802.02607
Song, S., Zhang, N., Huang, H.: Named entity recognition based on conditional random fields. Clust. Comput. 22, 1–12 (2019). https://doi.org/10.1007/s10586-017-1146-3
Twiefel, J., Baumann, T., Heinrich, S., Wermter, S.: Improving domain-independent cloud-based speech recognition with domain-dependent phonetic post-processing. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligencevol. vol. 2, pp. 1529–1535, July 2014
Viana-Cámara, R., Campos-Soberanis, M., Campos-Sobrino, D.: Modelo hıbrido fonético-neural para corrección en sistemas de reconocimiento del habla. Res. Comput. Sci. 149(8), 1163–1177 (2020)
Vorontsov, I., Kulakovskiy, I., Makeev, V.: Jaccard index based similarity measure to compare transcription factor binding site models. Algorith. Mol. Biol. AMB 8, 23 (2013). https://doi.org/10.1186/1748-7188-8-23
Vtyurina, A., Fourney, A., Morris, M., Findlater, L., White, R.: Bridging screen readers and voice assistants for enhanced eyes-free web search. In: he World Wide Web Conference, pp. 3590–3594, May 2019). https://doi.org/10.1145/3308558.3314136
Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval, July 2002. https://doi.org/10.1145/243199.243258
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Campos-Soberanis, M., Campos-Sobrino, D., Viana-Cámara, R. (2021). Improving a Conversational Speech Recognition System Using Phonetic and Neural Transcript Correction. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2021. Lecture Notes in Computer Science(), vol 13068. Springer, Cham. https://doi.org/10.1007/978-3-030-89820-5_4
Download citation
DOI: https://doi.org/10.1007/978-3-030-89820-5_4
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-89819-9
Online ISBN: 978-3-030-89820-5
eBook Packages: Computer ScienceComputer Science (R0)