Skip to main content

Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents

  • Conference paper
Text, Speech, and Dialogue (TSD 2013)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 8082))

Included in the following conference series:

  • 2395 Accesses

Abstract

This paper deals with the task of adaptation of an existing Czech large-vocabulary speech recognition (LVCSR) system to the language used in previous historical epochs (before 1990). The goal is to fit its lexicon and language model (LM) so that the system could be employed for the automatic transcription of old spoken documents in the Czech Radio archive. The main problem is the lack of texts (in electronic form) from the 1945-1990 period. The only available and large enough source is digitized copies of Rudé Právo, the newspaper of the former Communist party of Czechoslovakia, the actual ruling body in the state. The newspaper has been scanned and converted into text via an OCR software. However, the amount of OCR errors is very high and so we have to apply several text pre-processing techniques to get a corpus suitable for the lexicon and language model ’downdating’ (i.e. adaptation to the past). The proposed techniques helped us a) to reduce the number of out-of-vocabulary strings from 8.5 to 6.4 millions, b) to identify 6.7 thousand history-conditioned word candidates to be added to the lexicon and c) to build a more appropriate LM. The adapted LVCSR system was evaluated on broadcast news from 1969-1989 where its word-error-rate decreased from 17.05 to 14.33%.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Chen, S.S., Eide, E.M., Gales, M., Gopinath, R.A., Kanevsky, D., Olsen, P.: Recent im-provements to IBM’s speech recognition system for automatic transcription of broadcast news. In: IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 1, pp. 37–40 (1999)

    Google Scholar 

  2. Gauvain, J.L., Lamel, L., Adda, G.: The LIMSI Broadcast News transcription system. Speech Communication 37(1-2), 89–108 (2002)

    Article  MATH  Google Scholar 

  3. Chu, S.M., Kuo, H., Liu, Y.Y., Qin, Y., Shi, Q., Zweig, G.: The IBM Mandarin Broadcast Speech Transcription System. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2007, vol. 2, pp. II-345–II-348 (2007)

    Google Scholar 

  4. Nouza, J., Blavka, K., Bohac, M., Cerva, P., Zdansky, J., Silovsky, J., Prazak, J.: Voice Technology to Enable Sophisticated Access to Historical Audio Archive of the Czech Radio. In: Grana, C., Cucchiara, R. (eds.) MM4CH 2011. CCIS, vol. 247, pp. 27–38. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  5. Niwa, H., Kayashima, K., Shimeki, Y.: Postprocessing for Character Recognition Using Keyword Information. In: IAPR Workshop on Machine Vision Applications, Tokyo, pp. 519–522 (1992)

    Google Scholar 

  6. Svitak, J.J.: Genetic algorithms for optical character recognition. Doctoral Dissertation, City University of New York, USA (2008) ISBN: 978-0-549-58576-3

    Google Scholar 

  7. Guyon, I., Pereira, F.: Design of a Linguistic Postprocessor Using Variable Memory Length Markov Models. In: Proc. 3rd Int. Conf. Document Analysis and Recognition, Montreal, Canada, pp. 454–457 (1995)

    Google Scholar 

  8. Smith, R.: Limits on the application of frequency-based language models to ocr. In: IEEE International Conference on Document Analysis and Recognition, pp. 538–542 (2011)

    Google Scholar 

  9. Tong, X., Evans, D.A.: A Statistical Approach to Automatic OCR Error Correction in Context. In: Proc. of the Fourth Workshop on Very Large Corpora, pp. 88–100 (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chaloupka, J., Nouza, J., Červa, P., Málek, J. (2013). Downdating Lexicon and Language Model for Automatic Transcription of Czech Historical Spoken Documents. In: Habernal, I., Matoušek, V. (eds) Text, Speech, and Dialogue. TSD 2013. Lecture Notes in Computer Science(), vol 8082. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40585-3_26

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40585-3_26

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40584-6

  • Online ISBN: 978-3-642-40585-3

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics