Skip to main content
Log in

General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes

  • Original Paper
  • Published:
Language Resources and Evaluation Aims and scope Submit manuscript

Abstract

The paper describes a general framework for mining large amounts of text data from a defined set of Web pages. The acquired data are meant to constitute a corpus for training robust and reliable language models and thus the framework needs to also incorporate algorithms for appropriate text processing and duplicity detection in order to secure quality and consistency of the data. As we expect the resulting corpus to be very large, we have also implemented topic detection algorithms that allow us to automatically select subcorpora for domain-specific language models. The description of the framework architecture and the implemented algorithms is complemented with a detailed evaluation section. It analyses the basic properties of the gathered Czech corpus containing more than one billion text tokens collected using the described framework, shows the results of the topic detection methods and finally also describes the design and outcomes of the automatic speech recognition experiments with domain-specific language models estimated from the collected data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

Notes

  1. Note that the number of occurrences of the word “Havel” is divided by the factor of ten in order to scale down to other two examples.

  2. For example the Unicode standard defines a special glyph for a ligature “fi”. These ligatures are substituted with the sequence of characters “f” and “i”.

  3. http://www.cs.hmc.edu/~geoff/ispell.html.

  4. We have considered using longer token sequences but as processed documents are typically rather short (545 words on average), the usage of higher order n-grams resulted in severe data sparsity.

  5. Note that assuming to know the topics before the actual broadcasting is not unrealistic—the main “themes” of each debate are published on the broadcaster website beforehand.

  6. Please note that even though our decoder can handle a lexicon with up to one million words (which makes it one of the world’s best in this aspect), it is still not able to accommodate all the words occurring in our corpora, not even just the ones that occurred at least five times—see Fig. 5.

References

  • Baroni, M. & Bernardini, S. (2004). Bootcat: Bootstrapping corpora and terms from the web. In In Proceedings of LREC 2004, pp. 1313–1316.

  • Broder, A. Z., Glassman, S. C., Manasse, M. S., & Zweig, G. (1997). Syntactic clustering of the web. Computer Networks and ISDN Systems, 29(8–13), 1157–1166.

    Article  Google Scholar 

  • Bulyko, I., Ostendorf, M., Siu, M., Ng, T., Stolcke, A., & Çetin, O. (2007). Web resources for language modeling in conversational speech recognition. ACM Transactions on Speech and Language Processing (TSLP), 5(1), 1:1–1:25.

    Google Scholar 

  • Fairon, C. (2006). Corporator: a tool for creating rss-based specialized corpora. In Proceedings of the 2nd international workshop on web as corpus, WAC ’06 (pp. 43–49). Stroudsburg, PA, USA: Association for Computational Linguistics.

  • Kanis, J., & Skorkovská, L. (2010). Comparison of different lemmatization approaches through the means of information retrieval performance. In: P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 93–100). Heidelberg: Springer.

    Google Scholar 

  • Kilgarriff, A. (2001). Comparing corpora. International Journal of Corpus Linguistics, 6(1), 97–133.

    Article  Google Scholar 

  • Kilgarriff, A., Reddy, S., Pomikálek, J., & PVS, A. (2010). A corpus factory for many languages. In N. Calzolari, K. Choukri, B. Maegaard, J. Mariani, J. Odijk, S. Piperidis, M. Rosner, & D. Tapias (Eds.), Proceedings of the seventh international conference on language resources and evaluation (LREC’10) (pp. 904–910). Valletta, Malta: European Language Resources Association (ELRA).

  • Kučera, K. (2002). The Czech National Corpus: Principles, design, and results. Literary and Linguistic Computing, 17(2), 245–257.

    Article  Google Scholar 

  • Li, P., Zhu, Q., Qian, P., & Fox, G. (2007). Constructing a large scale text corpus based on the grid and trustworthiness. In: V. Matousek & P. Mautner (Eds.), TSD. Lecture Notes in Computer Science (Vol. 4629, pp. 56–65). New York: Springer.

  • Malkin, M. & Venkatesan, R. (2005). Comparison of texts streams in the presence of mild adversaries. In Proceedings of the 2005 Australasian workshop on grid computing and e-research (Vol. 44, pp. 179–186). ACSW Frontiers ’05. Australian Computer Society, Inc.,.

  • Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to information retrieval. New York: Cambridge University Press.

    Book  Google Scholar 

  • Pomikálek, J. (2011). Removing boilerplate and duplicate content from web corpora. Ph.D. thesis, Faculty of Informatics, Masaryk University, Brno, Czech Republic.

  • Pražák, A., Loose, Z., Psutka, J., Radová, V., & Müller, L. (2011). Four-phase re-speaker training system. In Proceedings of SIGMAP 2011. Seville.

  • Psutka, J., Ircing, P., Psutka, J.V., Radová, V., Byrne, W., Hajič, J., Mírovský, J., & Gustman, S. (2003). Large vocabulary ASR for spontaneous Czech in the MALACH project. In Proceedings of Eurospeech 2003 (pp. 1821–1824). Geneva.

  • Psutka, J., Radová, V., Müller, L., Matoušek, J., Ircing, P., & Graff, D. (2001). Large broadcast news and read speech corpora of spoken Czech. In Proceedings of Eurospeech 2001 (pp. 2067–2070). Denmark: Aalborg.

  • Psutka, J., Švec, J., Psutka, J.V., Vaněk, J., Pražák, A., Šmídl, L., & Ircing, P. (2011). System for fast lexical and phonetic spoken term detection in a Czech cultural heritage archive. EURASIP Journal on Audio, Speech, and Music Processing, 10.

  • Sharoff, S. (2006). Creating general-purpose corpora using automated search engine queries. In WaCky! Working papers on the Web as Corpus (pp. 63–98). Gedit.

  • Spoustová, D., Spousta, M., & Pecina, P. (2010). Building a Web Corpus of Czech. In Proceedings of the seventh conference on international language resources and evaluation (LREC’10). Valletta, Malta.

  • Stolcke, A. (2002). SRILM—an extensible language modeling toolkit. In Proceedings of ICSLP 2002 (pp. 901–904). Denver.

  • Švec, J. (2010). The Voiar (Voice Archive) library. University of West Bohemia, Plzeň.

  • Švec, J., Hoidekr, J., Soutner, D., & Vavruška, J. (2011). Web text data mining for building large scale language modelling corpus. In: I. Habernal & V. Matoušek (Eds.), Text, speech and dialogue. Lecture Notes in Computer Science (Vol. 6836, pp. 356–363). Berlin / Heidelberg: Springer.

    Google Scholar 

  • Trmal, J., Pražák, A., Loose, Z., & Psutka, J. (2010). Online TV Captioning of Czech Parliamentary Sessions. In: Sojka, P., Horák, A., Kopeček, I., & Pala, K. (Eds.), Text, speech and dialogue. Lecture Notes in Artificial Intelligence (Vol. 6231, pp. 416–422). Berlin: Springer.

    Google Scholar 

  • Vaněk, J. & Psutka, J. (2010). Gender-dependent acoustic models fusion developed for automatic subtitling of parliament meetings broadcasted by the Czech TV. In: P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 431–438). Heidelberg: Springer.

    Google Scholar 

  • Zajíc, Z., Machlica, L., & Müller, L. (2010). Robust statistic estimates for adaptation in the task of speech recognition. In: P. Sojka, A. Horák, I. Kopeček, & K. Pala (Eds.), TSD 2010. LNCS (Vol. 6231, pp. 464–471). Heidelberg: Springer.

    Google Scholar 

  • Zelinka, J., Kanis, J., & Müller, L. (2005). Automatic transcription of numerals in inflectional languages. In: V. Matoušek, P. Mautner, & T. Pavelka (Eds.), Text, speech and dialogue. Lecture Notes in Computer Science (Vol. 3658, pp. 326–333). Berlin/Heidelberg: Springer.

    Chapter  Google Scholar 

Download references

Acknowledgements

This work has been supported by the grant of The University of West Bohemia, project No. SGS-2010-054 and by the Grant Agency of the Czech Republic, project No. GAČR P103/12/G084. The access to the MetaCentrum computing facilities provided under the programme Projects of Large Infrastructure for Research, Development, and Innovations LM2010005 funded by the Ministry of Education, Youth, and Sports of the Czech Republic is appreciated.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Pavel Ircing.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Švec, J., Lehečka, J., Ircing, P. et al. General framework for mining, processing and storing large amounts of electronic texts for language modeling purposes. Lang Resources & Evaluation 48, 227–248 (2014). https://doi.org/10.1007/s10579-013-9246-z

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10579-013-9246-z

Keywords

Navigation