Abstract
In recent years, the use of automatic speech recognition (ASR) systems in meetings has been increasing, such as for minutes generation and speaker diarization. The problem is that ASR systems often misrecognize words because there is domain-specific content in meetings. In this paper, we propose a novel method for automatically post-editing ASR results by using presentation slides that meeting participants use and utterances adjacent to a target utterance. We focus on automatic post-editing rather than domain adaptation because of the ease of incorporating external information, and the method can be used for arbitrary speech recognition engines. In experiments, we found that our method can significantly improve the recognition accuracy of domain-specific words (proper nouns). We also found an improvement in the word error rate (WER).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Asami, T., Masumura, R., Yamaguchi, Y., Masataki, H., Aono, Y.: Domain adaptation of DNN acoustic models using knowledge distillation. In: Proceedings of ICASSP, pp. 5185–5189. IEEE (2017)
Camgoz, N.C., Koller, O., Hadfield, S., Bowden, R.: Multi-channel transformers for multi-articulatory sign language translation. In: Bartoli, A., Fusiello, A. (eds.) ECCV 2020. LNCS, vol. 12538, pp. 301–319. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-66823-5_18
Chang, F.J., Radfar, M., Mouchtaris, A., King, B., Kunzmann, S.: End-to-end multi-channel transformer for speech recognition. In: Proceedings of ICASSP, pp. 5884–5888. IEEE (2021)
Corona, R., Thomason, J., Mooney, R.: Improving black-box speech recognition using semantic parsing. In: Proceedings of the 8th IJCNLP, pp. 122–127 (2017)
Cucu, H., Buzo, A., Besacier, L., Burileanu, C.: Statistical error correction methods for domain-specific ASR systems. In: Dediu, A.-H., MartÃn-Vide, C., Mitkov, R., Truthe, B. (eds.) SLSP 2013. LNCS (LNAI), vol. 7978, pp. 83–92. Springer, Heidelberg (2013). https://doi.org/10.1007/978-3-642-39593-2_7
D’Haro, L.F., Banchs, R.E.: Automatic correction of ASR outputs by using machine translation. In: Proceedings of Interspeech, pp. 3469–3473 (2016)
Doan, T.M., Jacquenet, F., Largeron, C., Bernard, M.: A study of text summarization techniques for generating meeting minutes. In: Dalpiaz, F., Zdravkovic, J., Loucopoulos, P. (eds.) RCIS 2020. LNBIP, vol. 385, pp. 522–528. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-50316-1_33
Guo, J., Sainath, T.N., Weiss, R.J.: A spelling correction model for end-to-end speech recognition. In: Proceedings of ICASSP, pp. 5651–5655. IEEE (2019)
Hrinchuk, O., Popova, M., Ginsburg, B.: Correction of automatic speech recognition with transformer sequence-to-sequence model. In: Proceedings of ICASSP, pp. 7074–7078. IEEE (2020)
Iyer, R.M., Ostendorf, M.: Modeling long distance dependence in language: topic mixtures versus dynamic cache models. IEEE Trans. Speech Audio Process. 7(1), 30–39 (1999)
Jonson, R.: Dialogue context-based re-ranking of ASR hypotheses. In: Proceedings of IEEE 2006 Workshop on SLT, pp. 174–177 (2006)
Kingma, D.P., Ba, J.: Adam: a method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)
Kudo, T.: MeCab: yet another part-of-speech and morphological analyzer (2006). http://mecab.sourceforge.jp
Lewis, M., et al.: Bart: denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. arXiv preprint arXiv:1910.13461 (2019)
Li, M., Zhang, L., Ji, H., Radke, R.J.: Keep meeting summaries on topic: abstractive multi-modal meeting summarization. In: Proceedings of ACL, pp. 2190–2196 (2019)
Mani, A., Palaskar, S., Meripo, N.V., Konam, S., Metze, F.: ASR error correction and domain adaptation using machine translation. In: Proceedings of ICASSP, pp. 6344–6348. IEEE (2020)
Nagao, K.: Meeting analytics: creative activity support based on knowledge discovery from discussions. In: Proceedings of the 51st Hawaii International Conference on System Sciences (2018)
Raffel, C., et al.: Exploring the limits of transfer learning with a unified text-to-text transformer. arXiv preprint arXiv:1910.10683 (2019)
Sato, T., Hashimoto, T., Okumura, M.: Implementation of a word segmentation dictionary called mecab-ipadic-NEologd and study on how to use it effectively for information retrieval. In: Proceedings of the Twenty-Three Annual Meeting of the Association for Natural Language Processing, pp. NLP2017-B6. The Association for Natural Language Processing (2017)
Sun, S., Zhang, B., Xie, L., Zhang, Y.: An unsupervised deep domain adaptation approach for robust speech recognition. Neurocomputing 257, 79–87 (2017)
Wang, Q., Downey, C., Wan, L., Mansfield, P.A., Moreno, I.L.: Speaker diarization with LSTM. In: Proceedings of ICASSP, pp. 5239–5243. IEEE (2018)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Kamiya, K., Kawase, T., Higashinaka, R., Nagao, K. (2021). Using Presentation Slides and Adjacent Utterances for Post-editing of Speech Recognition Results for Meeting Recordings. In: EkÅ¡tein, K., Pártl, F., KonopÃk, M. (eds) Text, Speech, and Dialogue. TSD 2021. Lecture Notes in Computer Science(), vol 12848. Springer, Cham. https://doi.org/10.1007/978-3-030-83527-9_28
Download citation
DOI: https://doi.org/10.1007/978-3-030-83527-9_28
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-83526-2
Online ISBN: 978-3-030-83527-9
eBook Packages: Computer ScienceComputer Science (R0)