Abstract
This paper deals with the task of adaptation of an existing Czech large-vocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rudé Právo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we have to apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ’downdating’ (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings from 8.5 to 6.4 millions, b) to identify 6.7 thousand history-conditioned word candidates to be added to the lexicon and c) to build a more appropriate LM. The adapted LVCSR system was evaluated on broadcast news from 1969-1989 where its word-error-rate decreased from 17.05 to 14.33%.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Chen, S.S., Eide, E.M., Gales, M., Gopinath, R.A., Kanevsky, D., Olsen, P.: Recent im-provements to IBM’s speech recognition system for automatic transcription of broadcast news. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 37–40 (1999)
Gauvain, J.L., Lamel, L., Adda, G.: The LIMSI Broadcast News transcription system. Speech Communication 37(1-2), 89–108 (2002)
Chu, S.M., Kuo, H., Liu, Y.Y., Qin, Y., Shi, Q., Zweig, G.: The IBM Mandarin Broadcast Speech Transcription System. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 2, pp. II-345–II-348 (2007)
Nouza, J., Blavka, K., Bohac, M., Cerva, P., Zdansky, J., Silovsky, J., Prazak, J.: Voice Technology to Enable Sophisticated Access to Historical Audio Archive of the Czech Radio. In: Grana, C., Cucchiara, R. (eds.) MM4CH 2011. CCIS, vol. 247, pp. 27–38. Springer, Heidelberg (2012)
Niwa, H., Kayashima, K., Shimeki, Y.: Postprocessing for Character Recognition Using Keyword Information. In: IAPR Workshop on Machine Vision Applications, Tokyo, pp. 519–522 (1992)
Svitak, J.J.: Genetic algorithms for optical character recognition. Doctoral Dissertation, City University of New York, USA (2008) ISBN: 978-0-549-58576-3
Guyon, I., Pereira, F.: Design of a Linguistic Postprocessor Using Variable Memory Length Markov Models. In: Proc. 3rd Int. Conf. Document Analysis and Recognition, Montreal, Canada, pp. 454–457 (1995)
Smith, R.: Limits on the application of frequency-based language models to ocr. In: IEEE International Conference on Document Analysis and Recognition, pp. 538–542 (2011)
Tong, X., Evans, D.A.: A Statistical Approach to Automatic OCR Error Correction in Context. In: Proc. of the Fourth Workshop on Very Large Corpora, pp. 88–100 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chaloupka, J., Nouza, J., Červa, P., Málek, J. (2013). Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-40585-3_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40584-6
Online ISBN: 978-3-642-40585-3
eBook Packages: Computer ScienceComputer Science (R0)