Autoblog 2021: The Importance of Language Models for Spontaneous Lecture Speech

Hernandez, Abner; Klumpp, Philipp; Das, Badhan; Maier, Andreas; Yang, Seung Hee

doi:10.1007/978-3-031-16270-1_24

Abner Hernandez¹¹,
Philipp Klumpp¹¹,
Badhan Das¹¹,
Andreas Maier¹¹ &
…
Seung Hee Yang¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 13502))

Included in the following conference series:

International Conference on Text, Speech, and Dialogue

1100 Accesses

Abstract

The demand for both quantity and quality of online educational resources has skyrocketed during the last two years’ pandemic. Entire course series had since been recorded and distributed online. To reach a broader audience, videos could be transcribed, combined with supplementary material (e.g. lecture slides) and published in the style of blog posts. This had been done previously for Autoblog 2020, a corpus of lecture recordings that had been converted to blog posts, using automated speech recognition (ASR) for subtitle creation. This work aims to introduce a second series of recorded and manually transcribed lecture videos. The corresponding data includes lecture videos, slides, and blog posts/transcripts with aligned slide images and is published under creative commons license. A state-of-the-art Wav2Vec ASR model was used for automatic transcription of the content, using different n-gram language models (LM). The results were compared to the human ground truth annotation. Findings indicated that the ASR performed well on spontaneous lecture speech. Furthermore, LMs trained on large amounts of data with fewer out-of-vocabulary words were outperformed by much smaller LMs estimated over in-domain language. Annotated lecture recordings were deemed helpful for the creation of task-specific ASR solutions as well as their validation against a human ground truth.

A. Hernandez and P. Klumpp—Equal Contribution.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Multimodal Corpus Analysis of Autoblog 2020: Lecture Videos in Machine Learning

“Teacher, Can You Say It Again?" Improving Automatic Speech Recognition Performance over Classroom Environments with Limited Data

Augmenting ASR for User-Generated Videos with Semi-supervised Training and Acoustic Model Adaptation for Spoken Content Retrieval

Notes

References

Baevski, A., Zhou, Y., Mohamed, A., Auli, M.: wav2vec 2.0: a framework for self-supervised learning of speech representations. Adv. Neural Inf. Process. Syst. 33, 12449–12460 (2020)
Google Scholar
Godfrey, J.J., Holliman, E.C., McDaniel, J.: Switchboard: telephone speech corpus for research and development. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 517–520. IEEE Computer Society (1992)
Google Scholar
Graves, A., Fernández, S., Gomez, F., Schmidhuber, J.: Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 369–376 (2006)
Google Scholar
Heafield, K.: KenLM: faster and smaller language model queries. In: Proceedings of the Sixth Workshop on Statistical Machine Translation, pp. 187–197. Association for Computational Linguistics, Edinburgh, Scotland, July 2011
Google Scholar
Heafield, K., Pouzyrevsky, I., Clark, J.H., Koehn, P.: Scalable modified Kneser-Ney language model estimation. In: Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, vol. 2: Short Papers, pp. 690–696 (2013)
Google Scholar
Hernandez, A., Yang, S.H.: Multimodal corpus analysis of Autoblog 2020: lecture videos in machine learning. In: Karpov, A., Potapova, R. (eds.) SPECOM 2021. LNCS (LNAI), vol. 12997, pp. 262–270. Springer, Cham (2021). https://doi.org/10.1007/978-3-030-87802-3_24
Chapter Google Scholar
Hernandez, F., Nguyen, V., Ghannay, S., Tomashenko, N., Estève, Y.: TED-LIUM 3: twice as much data and corpus repartition for experiments on speaker adaptation. In: Karpov, A., Jokisch, O., Potapova, R. (eds.) SPECOM 2018. LNCS (LNAI), vol. 11096, pp. 198–208. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-99579-3_21
Chapter Google Scholar
Jurafsky, D., Martin, J.H.: Speech and language processing. chapter 3: N-gram language models (3rd ed. draft). Available from: https://web.stanford.edu/~jurafsky/slp3/3.pdf (2018)
Kogure, S., Nishizaki, H., Tsuchiya, M., Yamamoto, K., Togashi, S., Nakagawa, S.: Speech recognition performance of CJLC: corpus of Japanese lecture contents. In: Ninth Annual Conference of the International Speech Communication Association. Citeseer (2008)
Google Scholar
Maekawa, K.: Corpus of spontaneous Japanese: its design and evaluation. In: ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (2003)
Google Scholar
Munteanu, C., Penn, G., Baecker, R.: Web-based language modelling for automatic lecture transcription. In: Eighth Annual Conference of the International Speech Communication Association (2007)
Google Scholar
Nakamura, M., Iwano, K., Furui, S.: Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance. Comput. Speech Lang. 22(2), 171–184 (2008)
Article Google Scholar
Nanjo, H., Kawahara, T.: Unsupervised language model adaptation for lecture speech recognition. In: ISCA & IEEE Workshop on Spontaneous Speech Processing and Recognition (2003)
Google Scholar
Park, A., Hazen, T.J., Glass, J.R.: Automatic processing of audio lectures for information retrieval: vocabulary selection and language modeling. In: Proceedings. (ICASSP 2005). IEEE International Conference on Acoustics, Speech, and Signal Processing, 2005, vol. 1, pp. I-497. IEEE (2005)
Google Scholar
Rousseau, A., Deléglise, P., Esteve, Y., et al.: Enhancing the TED-LIUM corpus with selected data for language modeling and more ted talks. In: LREC, pp. 3935–3939 (2014)
Google Scholar
Xu, Q., et al.: Self-training and pre-training are complementary for speech recognition. In: ICASSP 2021–2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 3030–3034. IEEE (2021)
Google Scholar
Yeganova, L., et al.: Findings of the WMT 2021 biomedical translation shared task: summaries of animal experiments as new test set. In: Proceedings of the Sixth Conference on Machine Translation, pp. 664–683. Association for Computational Linguistics, Online, November 2021
Google Scholar

Download references

Author information

Authors and Affiliations

Pattern Recognition Laboratory, Friedrich-Alexander Universität Erlangen-Nürnberg, Erlangen, Germany
Abner Hernandez, Philipp Klumpp, Badhan Das & Andreas Maier
Speech and Language Processing Laboratory, Friedrich-Alexander Universität Erlangen-Nürnberg, Erlangen, Germany
Seung Hee Yang

Authors

Abner Hernandez
View author publications
You can also search for this author in PubMed Google Scholar
Philipp Klumpp
View author publications
You can also search for this author in PubMed Google Scholar
Badhan Das
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Maier
View author publications
You can also search for this author in PubMed Google Scholar
Seung Hee Yang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Abner Hernandez .

Editor information

Editors and Affiliations

Faculty of Informatics, Masaryk University, Brno, Czech Republic
Petr Sojka
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Aleš Horák
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Ivan Kopeček
Faculty of Informatics, Masaryk University, Brno, Czech Republic
Karel Pala

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hernandez, A., Klumpp, P., Das, B., Maier, A., Yang, S.H. (2022). Autoblog 2021: The Importance of Language Models for Spontaneous Lecture Speech. In: Sojka, P., Horák, A., Kopeček, I., Pala, K. (eds) Text, Speech, and Dialogue. TSD 2022. Lecture Notes in Computer Science(), vol 13502. Springer, Cham. https://doi.org/10.1007/978-3-031-16270-1_24

Download citation

DOI: https://doi.org/10.1007/978-3-031-16270-1_24
Published: 16 September 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-16269-5
Online ISBN: 978-3-031-16270-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Autoblog 2021: The Importance of Language Models for Spontaneous Lecture Speech