Abstract
Many text mining algorithms and applications require the availability of large text corpora and certain statistics-based annotations. To ensure comparability of results a standardized corpus building process is required. Particularly noteworthy are all pre-processing procedures as they are crucial for the quality of the resulting data stock. This quality can be estimated by both evaluating the corpus building process and by statistical quality measurements on the corpus. Some of these approaches are described using the example of the Leipzig Corpora Collection.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
- 8.
- 9.
- 10.
- 11.
- 12.
- 13.
- 14.
- 15.
- 16.
References
Altmann G (1980) Prolegomena to menzerath’s law. Glottometrica 2:1–10
Baroni M, Bernardini S (2004) BootCaT: Bootstrapping corpora and terms from the web. In: Proceedings of LREC 2004
Biemann C (2006) Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: Student Research Workshop, pp 7–12. Association for Computational Linguistics
Bocek T, Hunt E, Hausheer D, Stiller B (2008) Fast similarity search in peer-to-peer networks. In: Network operations and management symposium, 2008. NOMS 2008, Salvador, 7–11 April 2008. IEEE, pp 240–247
Broeder D, Windhouwer M, van Uytvanck D, Goosen T, Trippel T (2012) CMDI: a component metadata infrastructure. In: Describing LRs with metadata: towards flexibility and interoperability in the documentation of LR workshop programme
Büchler M (2006) Flexibles Berechnen von Kookkurrenzen auf strukturierten und unstrukturierten Daten. Diploma Thesis, University of Leipzig
Cysouw M (2005) Quantitative methods in typology. In: Altmann G, Köhler R, Piotrowski R (eds) Quantitative linguistics: an international handbook. Mouton de Gruyter, Berlin, pp 554–578
Cysouw M (2008) Using the World Atlas of language structures. Introduction to the special issue of Sprachtypologie und Universalienforschung (STUF) 60(2):181–185
Duden (2009) Die deutsche rechtschreibung, Band 1, 25th edn. Dudenverlag, Mannheim/Wien/Zürich
Eckart T, Quasthoff U (2013) Statistical corpus and language comparison on comparable corpora. In: BUCC - Building and using comparable corpora. Springer, Berlin
Eckart T, Quasthoff U, Goldhahn D (2012) Language statistics-based quality assurance for large corpora. In: Proceedings of Asia pacific corpus linguistics conference 2012, Auckland
Fenk-Oczlon G, Fenk A, (1999) Cognition, quantitative linguistics, and systemic typology. Linguist Typol 3:151–177
Goldhahn D (2013) Quantitative Methoden in der Sprachtypologie: Nutzung korpusbasierter Statistiken. Dissertation, University of Leipzig, Leipzig
Goldhahn D, Eckart T, Quasthoff U (2012) Building large monolingual dictionaries at the leipzig corpora collection: from 100 to 200 languages. In: Proceedings of the 8th international conference on language resources and evaluation (LREC 2012)
Goldhahn D, Quasthoff U, Heyer G (2014) Corpus-based linguistic typology: a comprehensive approach. In: Proceedings of konvens 2014, Hildesheim
Guy JB (1991) Vowel identification: an old (but good) algorithm. Cryptologia 15(3):258–262
Halácsy P, Kornai A, Oravecz C (2007) HunPos: an open source trigram tagger. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp 209–212. Association for Computational Linguistics
Heid U, Schmid H, Eckart K, Hinrichs E (2010) A corpus representation format for linguistic web services: the D-SPIN text corpus format and its relationship with ISO standards. In: Proceedings of the seventh international conference on language resources and evaluation (LREC’10), 2010
Heyer G, Quasthoff U (2006) Calculating communities by link analysis of URLs. In: Innovative internet community systems. Springer, Berlin, pp 151–156
Kilgarriff A (2007) Googleology is bad science. Comput Linguist 33(1):147–151
Köhler R, Altmann G, Piotrowski R (2005) Quantitative linguistik (Quantitative linguistics). In: Ein internationales handbuch (An international handbook). De Gruyter, Berlin
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18(1):50–60
Phan X, Nguyen C, Le D, Nguyen L, Horiguchi S, Ha Q (2011) A hidden topic-based framework toward building applications with short web documents. Knowl Data Eng IEEE Trans 23(7):961–976
Quasthoff U, Biemann C (2006) Measuring monolinguality. In: The workshop programme of LREC 2006, p 38
Richter M, Quasthoff U, Hallsteinsdóttir E, Biemann C (2006) Exploiting the leipzig corpora collection. In: Proceedings of the IS-LTC 2006
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing, Manchester
Sharoff S (2006) Creating general-purpose corpora using automated search engine queries. In: Baroni M, Bernardini S (eds) WaCky! Working papers on the web as corpus. Gedit, Bologna
Sukhotin BV (1988) Optimization algorithms of deciphering as the elements of a linguistic theory. In: Proceedings of the 12th conference on computational linguistics-association for computational linguistics, vol 2, pp 645–648
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Zipf GK (1935) The psycho-biology of language: an introduction to dynamic philology. The MIT Press, Cambridge
Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambridge
Acknowledgements
We thank both anonymous reviewers and editors for their valuable hints and comments, which helped finalizing the chapter.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Quasthoff, U., Goldhahn, D., Eckart, T. (2014). Building Large Resources for Text Mining: The Leipzig Corpora Collection. In: Biemann, C., Mehler, A. (eds) Text Mining. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-12655-5_1
Download citation
DOI: https://doi.org/10.1007/978-3-319-12655-5_1
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12654-8
Online ISBN: 978-3-319-12655-5
eBook Packages: Computer ScienceComputer Science (R0)