Skip to main content

Improving a Conversational Speech Recognition System Using Phonetic and Neural Transcript Correction

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13068))

Abstract

This article describes the successful implementation of a conversational speech recognition system applied to telephonic sales performed by an autonomous agent. Our implementation uses a post-processing corrector based on phonetic representations of text and subsequent neural network classifier. The classifier assesses the proposed correction’s relevance to reduce the errors in the transcript sent to a downstream Natural Language Understanding engine. The experiments were carried on correcting transcripts from real audios of orders placed by customers of a large bottling company. We measured the Word Error Rate of the corrected transcripts against human-annotated ground-truth to verify the improvement produced by the system. To evaluate the corrections’ impact on the entities detected by the Natural Language Understanding engine, we used Jaccard distance, Precision, Recall, and \(F_1\). Results show that the implemented system and architecture enhance the transcript relative Word Error Rate on a 39% and Jaccard distance on 13% in comparison to the Automatic Speech Recognition baseline, making them suitable for real-time telephonic sales systems implementation.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   89.00
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   119.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Bassil, Y., Alwani, M.: Post-editing ERRO correction algorithm for speech recognition using bing spelling suggestion. Int. J. Adv. Comput. Sci. Appl. 3 (2012). https://doi.org/10.14569/IJACSA.2012.030217

  2. Berg, M.: Modelling of natural dialogues in the context of speech-based information and control systems. Ph.D. thesis, July 2014

    Google Scholar 

  3. Campos-Sobrino, D., Campos-Soberanis, M., Martínez-Chin, I., Uc-Cetina, V.: Corrección de errores del reconocedor de voz de google usando métricas de distancia fonética. Res. Comput. Sci. 148(1), 57–70 (2019)

    Article  Google Scholar 

  4. Chan, W., Jaitly, N., Le, Q.V., Vinyals, O.: Listen, attend and spell: A neural network for large vocabulary conversational speech recognition. In: ICASSP (2016). http://williamchan.ca/papers/wchan-icassp-2016.pdf

  5. Chorowski, J., Bahdanau, D., Cho, K., Bengio, Y.: End-to-end continuous speech recognition using attention-based recurrent NN: first results. In: NIPS 2014 Workshop on Deep Learning, December 2014 (2014)

    Google Scholar 

  6. Errattahi, R., Hannani, A.E., Ouahmane, H.: Automatic speech recognition errors detection and correction: a review. Procedia Comput. Sci. 128, 32–37 (2018). https://doi.org/10.1016/j.procs.2018.03.005, http://www.sciencedirect.com/science/article/pii/S1877050918302187, 1st International Conference on Natural Language and Speech Processing

  7. Fang, A., Filice, S., Limsopatham, N., Rokhlenko, O.: Using Phoneme Representations to Build Predictive Models Robust to ASR Errors, pp. 699–708. ACM, July 2020. https://doi.org/10.1145/3397271.3401050

  8. Ghannay, S., Caubrière, A., Estève, Y., Laurent, A., Morin, E.: End-to-end named entity extraction from speech. In: EEE Spoken Language Technology Workshop (2018)

    Google Scholar 

  9. Graves, A., Jaitly, N.: Towards end-to-end speech recognition with recurrent neural networks. In: 31st International Conference on Machine Learning (ICML 2014), vol. 5, pp. 1764–1772, January 2014

    Google Scholar 

  10. Graves, A.: Sequence transduction with recurrent neural networks (2012)

    Google Scholar 

  11. Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural ’networks. In: Proceedings of the 23rd International Con-ference on Machine Learning, vol. 2006, pp. 369–376, January 2006. https://doi.org/10.1145/1143844.1143891

  12. Haghani, P., et al.: From audio to semantics: approaches to end-to-end spoken language understanding. In: Spoken Language Technology Workshop (2018)

    Google Scholar 

  13. Liao, J., et al.: Improving readability for automatic speech recognition transcription (2020)

    Google Scholar 

  14. Limsopatham, N., Rokhlenko, O., Carmel, D.: Research challenges in building a voice-based artificial personal shopper - position paper. In: Proceedings of the 2018 EMNLP Workshop SCAI: The 2nd International Workshop on Search-Oriented Conversational AIpp, 40–45, January 2018. https://doi.org/10.18653/v1/W18-5706

  15. Lugosch, L., Ravanelli, M., Ignoto, P., Tomar, V.S., Bengio, Y.: Speech model pre-training for end-to-end spoken language understanding (2019)

    Google Scholar 

  16. Ogawa, A., Hori, T.: Error detection and accuracy estimation in automatic speech recognition using deep bidirectional recurrent neural networks. Speech Commun. 89 (2017).https://doi.org/10.1016/j.specom.2017.02.009

  17. Qian, Y., et al.: Exploring ASR-free end-to-end modeling to improve spoken language understanding in a cloud-based dialog system. In: 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 569–576 (2017). https://doi.org/10.1109/ASRU.2017.8268987

  18. Schumann, R., Angkititrakul, P.: Incorporating asr errors with attention-based, jointly trained RNN for intent detection and slot filling. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 6059–6063, April 2018. https://doi.org/10.1109/ICASSP.2018.8461598

  19. Shivakumar, P.G., Li, H., Knight, K., Georgiou, P.G.: Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. CoRR abs/1802.02607 (2018), http://arxiv.org/abs/1802.02607

  20. Song, S., Zhang, N., Huang, H.: Named entity recognition based on conditional random fields. Clust. Comput. 22, 1–12 (2019). https://doi.org/10.1007/s10586-017-1146-3

  21. Twiefel, J., Baumann, T., Heinrich, S., Wermter, S.: Improving domain-independent cloud-based speech recognition with domain-dependent phonetic post-processing. In: Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligencevol. vol. 2, pp. 1529–1535, July 2014

    Google Scholar 

  22. Viana-Cámara, R., Campos-Soberanis, M., Campos-Sobrino, D.: Modelo hıbrido fonético-neural para corrección en sistemas de reconocimiento del habla. Res. Comput. Sci. 149(8), 1163–1177 (2020)

    Google Scholar 

  23. Vorontsov, I., Kulakovskiy, I., Makeev, V.: Jaccard index based similarity measure to compare transcription factor binding site models. Algorith. Mol. Biol. AMB 8, 23 (2013). https://doi.org/10.1186/1748-7188-8-23

  24. Vtyurina, A., Fourney, A., Morris, M., Findlater, L., White, R.: Bridging screen readers and voice assistants for enhanced eyes-free web search. In: he World Wide Web Conference, pp. 3590–3594, May 2019). https://doi.org/10.1145/3308558.3314136

  25. Zobel, J., Dart, P.: Phonetic string matching: lessons from information retrieval, July 2002. https://doi.org/10.1145/243199.243258

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diego Campos-Sobrino .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Campos-Soberanis, M., Campos-Sobrino, D., Viana-Cámara, R. (2021). Improving a Conversational Speech Recognition System Using Phonetic and Neural Transcript Correction. In: Batyrshin, I., Gelbukh, A., Sidorov, G. (eds) Advances in Soft Computing. MICAI 2021. Lecture Notes in Computer Science(), vol 13068. Springer, Cham. https://doi.org/10.1007/978-3-030-89820-5_4

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-89820-5_4

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-89819-9

  • Online ISBN: 978-3-030-89820-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics