Building Large Resources for Text Mining: The Leipzig Corpora Collection

Quasthoff, Uwe; Goldhahn, Dirk; Eckart, Thomas

doi:10.1007/978-3-319-12655-5_1

Uwe Quasthoff⁶,
Dirk Goldhahn⁶ &
Thomas Eckart⁶

Part of the book series: Theory and Applications of Natural Language Processing ((NLP))

3933 Accesses
4 Citations

Abstract

Many text mining algorithms and applications require the availability of large text corpora and certain statistics-based annotations. To ensure comparability of results a standardized corpus building process is required. Particularly noteworthy are all pre-processing procedures as they are crucial for the quality of the resulting data stock. This quality can be estimated by both evaluating the corpus building process and by statistical quality measurements on the corpus. Some of these approaches are described using the example of the Leipzig Corpora Collection.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 99.00; Price excludes VAT (USA)

Softcover Book: USD 129.99; Price excludes VAT (USA)

Hardcover Book: USD 129.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Altmann G (1980) Prolegomena to menzerath’s law. Glottometrica 2:1–10
MathSciNet Google Scholar
Baroni M, Bernardini S (2004) BootCaT: Bootstrapping corpora and terms from the web. In: Proceedings of LREC 2004
Google Scholar
Biemann C (2006) Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics: Student Research Workshop, pp 7–12. Association for Computational Linguistics
Google Scholar
Bocek T, Hunt E, Hausheer D, Stiller B (2008) Fast similarity search in peer-to-peer networks. In: Network operations and management symposium, 2008. NOMS 2008, Salvador, 7–11 April 2008. IEEE, pp 240–247
Google Scholar
Broeder D, Windhouwer M, van Uytvanck D, Goosen T, Trippel T (2012) CMDI: a component metadata infrastructure. In: Describing LRs with metadata: towards flexibility and interoperability in the documentation of LR workshop programme
Google Scholar
Büchler M (2006) Flexibles Berechnen von Kookkurrenzen auf strukturierten und unstrukturierten Daten. Diploma Thesis, University of Leipzig
Google Scholar
Cysouw M (2005) Quantitative methods in typology. In: Altmann G, Köhler R, Piotrowski R (eds) Quantitative linguistics: an international handbook. Mouton de Gruyter, Berlin, pp 554–578
Google Scholar
Cysouw M (2008) Using the World Atlas of language structures. Introduction to the special issue of Sprachtypologie und Universalienforschung (STUF) 60(2):181–185
Google Scholar
Duden (2009) Die deutsche rechtschreibung, Band 1, 25th edn. Dudenverlag, Mannheim/Wien/Zürich
Google Scholar
Eckart T, Quasthoff U (2013) Statistical corpus and language comparison on comparable corpora. In: BUCC - Building and using comparable corpora. Springer, Berlin
Google Scholar
Eckart T, Quasthoff U, Goldhahn D (2012) Language statistics-based quality assurance for large corpora. In: Proceedings of Asia pacific corpus linguistics conference 2012, Auckland
Google Scholar
Fenk-Oczlon G, Fenk A, (1999) Cognition, quantitative linguistics, and systemic typology. Linguist Typol 3:151–177
Google Scholar
Goldhahn D (2013) Quantitative Methoden in der Sprachtypologie: Nutzung korpusbasierter Statistiken. Dissertation, University of Leipzig, Leipzig
Google Scholar
Goldhahn D, Eckart T, Quasthoff U (2012) Building large monolingual dictionaries at the leipzig corpora collection: from 100 to 200 languages. In: Proceedings of the 8th international conference on language resources and evaluation (LREC 2012)
Google Scholar
Goldhahn D, Quasthoff U, Heyer G (2014) Corpus-based linguistic typology: a comprehensive approach. In: Proceedings of konvens 2014, Hildesheim
Google Scholar
Guy JB (1991) Vowel identification: an old (but good) algorithm. Cryptologia 15(3):258–262
Article MathSciNet Google Scholar
Halácsy P, Kornai A, Oravecz C (2007) HunPos: an open source trigram tagger. In: Proceedings of the 45th annual meeting of the ACL on interactive poster and demonstration sessions, pp 209–212. Association for Computational Linguistics
Google Scholar
Heid U, Schmid H, Eckart K, Hinrichs E (2010) A corpus representation format for linguistic web services: the D-SPIN text corpus format and its relationship with ISO standards. In: Proceedings of the seventh international conference on language resources and evaluation (LREC’10), 2010
Google Scholar
Heyer G, Quasthoff U (2006) Calculating communities by link analysis of URLs. In: Innovative internet community systems. Springer, Berlin, pp 151–156
Google Scholar
Kilgarriff A (2007) Googleology is bad science. Comput Linguist 33(1):147–151
Article Google Scholar
Köhler R, Altmann G, Piotrowski R (2005) Quantitative linguistik (Quantitative linguistics). In: Ein internationales handbuch (An international handbook). De Gruyter, Berlin
Google Scholar
Mann HB, Whitney DR (1947) On a test of whether one of two random variables is stochastically larger than the other. Ann Math Stat 18(1):50–60
Article MATH MathSciNet Google Scholar
Phan X, Nguyen C, Le D, Nguyen L, Horiguchi S, Ha Q (2011) A hidden topic-based framework toward building applications with short web documents. Knowl Data Eng IEEE Trans 23(7):961–976
Article Google Scholar
Quasthoff U, Biemann C (2006) Measuring monolinguality. In: The workshop programme of LREC 2006, p 38
Google Scholar
Richter M, Quasthoff U, Hallsteinsdóttir E, Biemann C (2006) Exploiting the leipzig corpora collection. In: Proceedings of the IS-LTC 2006
Google Scholar
Schmid H (1994) Probabilistic part-of-speech tagging using decision trees. In: Proceedings of international conference on new methods in language processing, Manchester
Google Scholar
Sharoff S (2006) Creating general-purpose corpora using automated search engine queries. In: Baroni M, Bernardini S (eds) WaCky! Working papers on the web as corpus. Gedit, Bologna
Google Scholar
Sukhotin BV (1988) Optimization algorithms of deciphering as the elements of a linguistic theory. In: Proceedings of the 12th conference on computational linguistics-association for computational linguistics, vol 2, pp 645–648
Google Scholar
Wilcoxon F (1945) Individual comparisons by ranking methods. Biom Bull 1(6):80–83
Article Google Scholar
Zipf GK (1935) The psycho-biology of language: an introduction to dynamic philology. The MIT Press, Cambridge
Google Scholar
Zipf GK (1949) Human behavior and the principle of least effort. Addison-Wesley, Cambridge
Google Scholar

Download references

Acknowledgements

We thank both anonymous reviewers and editors for their valuable hints and comments, which helped finalizing the chapter.

Author information

Authors and Affiliations

Natural Language Processing Group, University of Leipzig, Leipzig, Germany
Uwe Quasthoff, Dirk Goldhahn & Thomas Eckart

Authors

Uwe Quasthoff
View author publications
You can also search for this author in PubMed Google Scholar
Dirk Goldhahn
View author publications
You can also search for this author in PubMed Google Scholar
Thomas Eckart
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Uwe Quasthoff .

Editor information

Editors and Affiliations

Computer Science Department, Technische Universität Darmstadt FG Language Technology, Darmstadt, Germany
Chris Biemann
Computer Science Department, Goethe University WG Text Technology, Frankfurt am Main, Hessen, Germany
Alexander Mehler

Rights and permissions

Reprints and permissions

Copyright information

About this chapter

Cite this chapter

Quasthoff, U., Goldhahn, D., Eckart, T. (2014). Building Large Resources for Text Mining: The Leipzig Corpora Collection. In: Biemann, C., Mehler, A. (eds) Text Mining. Theory and Applications of Natural Language Processing. Springer, Cham. https://doi.org/10.1007/978-3-319-12655-5_1

Download citation

DOI: https://doi.org/10.1007/978-3-319-12655-5_1
Published: 13 December 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12654-8
Online ISBN: 978-3-319-12655-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics