Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents

Chaloupka, Josef; Nouza, Jan; Červa, Petr; Málek, Jiří

doi:10.1007/978-3-642-40585-3_26

Josef Chaloupka²⁰,
Jan Nouza²⁰,
Petr Červa²⁰ &
…
Jiří Málek²⁰

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8082))

Included in the following conference series:

International Conference on Text, Speech and Dialogue

2395 Accesses

Abstract

This paper deals with the task of adaptation of an existing Czech large-vocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rudé Právo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we have to apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ’downdating’ (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings from 8.5 to 6.4 millions, b) to identify 6.7 thousand history-conditioned word candidates to be added to the lexicon and c) to build a more appropriate LM. The adapted LVCSR system was evaluated on broadcast news from 1969-1989 where its word-error-rate decreased from 17.05 to 14.33%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Chen, S.S., Eide, E.M., Gales, M., Gopinath, R.A., Kanevsky, D., Olsen, P.: Recent im-provements to IBM’s speech recognition system for automatic transcription of broadcast news. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 37–40 (1999)
Google Scholar
Gauvain, J.L., Lamel, L., Adda, G.: The LIMSI Broadcast News transcription system. Speech Communication 37(1-2), 89–108 (2002)
Article MATH Google Scholar
Chu, S.M., Kuo, H., Liu, Y.Y., Qin, Y., Shi, Q., Zweig, G.: The IBM Mandarin Broadcast Speech Transcription System. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 2, pp. II-345–II-348 (2007)
Google Scholar
Nouza, J., Blavka, K., Bohac, M., Cerva, P., Zdansky, J., Silovsky, J., Prazak, J.: Voice Technology to Enable Sophisticated Access to Historical Audio Archive of the Czech Radio. In: Grana, C., Cucchiara, R. (eds.) MM4CH 2011. CCIS, vol. 247, pp. 27–38. Springer, Heidelberg (2012)
Chapter Google Scholar
Niwa, H., Kayashima, K., Shimeki, Y.: Postprocessing for Character Recognition Using Keyword Information. In: IAPR Workshop on Machine Vision Applications, Tokyo, pp. 519–522 (1992)
Google Scholar
Svitak, J.J.: Genetic algorithms for optical character recognition. Doctoral Dissertation, City University of New York, USA (2008) ISBN: 978-0-549-58576-3
Google Scholar
Guyon, I., Pereira, F.: Design of a Linguistic Postprocessor Using Variable Memory Length Markov Models. In: Proc. 3rd Int. Conf. Document Analysis and Recognition, Montreal, Canada, pp. 454–457 (1995)
Google Scholar
Smith, R.: Limits on the application of frequency-based language models to ocr. In: IEEE International Conference on Document Analysis and Recognition, pp. 538–542 (2011)
Google Scholar
Tong, X., Evans, D.A.: A Statistical Approach to Automatic OCR Error Correction in Context. In: Proc. of the Fourth Workshop on Very Large Corpora, pp. 88–100 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

SpeechLab, Faculty of Mechatronics, Informatics and Interdisciplinary Studies, Technical University of Liberec, Studentská 2, 461 17, Liberec, Czech Republic
Josef Chaloupka, Jan Nouza, Petr Červa & Jiří Málek

Authors

Josef Chaloupka
View author publications
You can also search for this author in PubMed Google Scholar
Jan Nouza
View author publications
You can also search for this author in PubMed Google Scholar
Petr Červa
View author publications
You can also search for this author in PubMed Google Scholar
Jiří Málek
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

University of West Bohemia, 306 14, Pilsen, Czech Republic
Ivan Habernal & Václav Matoušek &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chaloupka, J., Nouza, J., Červa, P., Málek, J. (2013). Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_26

Download citation

DOI: https://doi.org/10.1007/978-3-642-40585-3_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40584-6
Online ISBN: 978-3-642-40585-3
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents