Automatic Analysis of Large Text Corpora - A Contribution to Structuring WEB Communities

Heyer, Gerhard; Quasthoff, Uwe; Wolff, Christian

doi:10.1007/3-540-48080-3_2

Gerhard Heyer⁷,
Uwe Quasthoff⁷ &
Christian Wolff⁷

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2346))

Included in the following conference series:

International Workshop on Innovative Internet Community Systems

222 Accesses
2 Citations

Abstract

This paper describes a corpus linguistic analysis of large text corpora based on collocations with the aim of extracting semantic relations from unstructured text. We regard this approach as a viable method for generating and structuring information about WEB communities. Starting from a short description of our corpora as well as our language analysis tools, we discuss in depth the automatic generation of collocation sets. We further give examples of different types of relations that may be found in collocation sets for arbitrary terms. We conclude with a brief discussion of applying our approach to the analysis of a sample community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Armstrong, S. (ed.) (1993). Using Large Corpora. Computational Linguistics 19(1/2) (1993) [Special Issue on Corpus Processing, repr. MIT Press 1994].
Google Scholar
Bentley, J.; Sedgewick, R. (1998). “Ternary Search Trees.” In: Dr. Dobbs Journal, April 1998.
Google Scholar
Davidson, R., Harel, D. (1996). “Drawing Graphs Nicely Using Simulated Annealing.” In: ACM Transactions on Graphics 15(4), 301–331.
Article Google Scholar
Heyer, G.; Quasthoff, U.; Wolff, Ch. (2000). “Aiding Web Searches by Statistical Classification Tools.“ In: Knorz, G.; Kuhlen, R. (edd.) (2000). Informationskompetenz-Basiskompetenz in der Informationsgesellschaft. Proc. 7. Intern. Symposium f. Informationswissenschaft, ISI 2000, Darmstadt. Konstanz: UVK, 163–177.
Google Scholar
Heyer, G.; Läuter, M.; Quasihoff, U.; Wittig, Th.; Wolff, Ch. (2001). „Learning Relations using Collocations.” In: Maedche, Alexander; Staab, Steffen; Nedellec, C.; Hovy, E. (edd.). Proc. IJCAI Workshop on Ontology Learning, Seattle/WA, August 2001, 19–24.
Google Scholar
Krenn, B. (2000). “Distributional and Linguistic Implications of Collocation Identification.” In: Proc. Collocations Workshop, DGfS Conference, Marburg, March 2000.
Google Scholar
Krenn, B., 2000. Empirical Implications on Lexical Association Measures. Proceedings of the Ninth EURALEX International Congress. Stuttgart, Germany.
Google Scholar
Läuter, M., Quasthoff, U. (1999). “Kollokationen und semantisches Clustering.” In: Gippert, J. (ed.) (1999). Multilinguale Corpora. Codierung, Strukturierung, Analyse. Proc. 11. GLDV-Jahrestagung. Prague: Enigma Corporation, 34–41.
Google Scholar
Lemnitzer, L. (1998). “Komplexe lexikalische Einheiten in Text und Lexikon.” In: Heyer, G.; Wolff, Ch. (ed.). Linguistik und neue Medien. Wiesbaden: Dt. Universitätsverlag, 85–91.
Google Scholar
Manning, Ch.D.; Schütze, H. (1999). Foundations of Statistical Language Processing. Cambridge/MA, London: The MIT Press.
MATH Google Scholar
Milgram, S. (1992²). “The Small World Problem.” In: Milgram, S.; Sabini, J.; Silver, M. (eds.). The Individual in a Social World: Essays and Experiments. New York/NY: McGraw Hill.
Google Scholar
Quasthoff, U. (1998A). “Tools for Automatic Lexicon Maintenance: Acquisition, Error Correction, and the Generation of Missing Values.“ In: Proc. First International Conference on Language Resources and Evaluation [LREC], Granada, May 1998, Vol. II, 853–856.
Google Scholar
Quasthoff, U. (1998B). “Projekt der deutsche Wortschatz.” In: Heyer, G., Wolff, Ch. (eds.). Linguistik und neue Medien. Wiesbaden: Dt. Universitätsverlag, 93–99.
Google Scholar
Quasthoff, U.; Wolff, Ch. (2000). “An Infrastructure for Corpus-Based Monolingual Dictionaries.” In: Proc. LREC-2000. Second International Conference On Language Resources and Evaluation. Athens, May/June 2000, Vol. I, 241–246.
Google Scholar
Salton, Gerard (1989). Automatic Text Processing. The Transformation, Analysis, and Retrieval of Information by Computer. Reading/MA: Addison-Wesley.Sinclair, J. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press.
Google Scholar
Smadja F. (1993). “Retrieving Collocations from Text: Xtract.” In: Computational Linguistics 19(1) (1993), 143–177.
Google Scholar
Svartvik, J. (ed.) (1992). Directions in Corpus Linguistics: Proc. Nobel Symposium 82, Stockholm, 4–8 August 1991. Barlin: Mouton de Gruyter [= Trends in Linguistics Vol. 65].
Google Scholar
van der Vet, P.E.; Mars, N.J.I. (1998). “Bottom-Up Construction of Ontologies.” In: IEEE Transactions on Knowledge and Data Engineering 10(4) (1998), 513–526.
Article Google Scholar
Wulff, M.; Unger, H. (2000). “Message Chains as a new Form of Active Communication in the WOSNet.” In: Proc. High Performance Computing (HPC) 2000.
Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science Institute, Natural Language Processing Department, Leipzig University, Augustusplatz 10/11, D-04109, Leipzig
Gerhard Heyer, Uwe Quasthoff & Christian Wolff

Authors

Gerhard Heyer
View author publications
You can also search for this author in PubMed Google Scholar
Uwe Quasthoff
View author publications
You can also search for this author in PubMed Google Scholar
Christian Wolff
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

FB Informatik, Universität Rostock, 18051, Rostock, Germany
Herwig Unger
Institut für Mathematik, TU Illmenau, Postfach 10 05 65, 98684, Ilmenau, Germany
Thomas Böhme
College of Art and Sciences, Department of Computer Science, University of North Texas, 76203, Denton, TX, USA
Armin Mikler

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Heyer, G., Quasthoff, U., Wolff, C. (2002). Automatic Analysis of Large Text Corpora - A Contribution to Structuring WEB Communities. In: Unger, H., Böhme, T., Mikler, A. (eds) Innovative Internet Computing Systems. IICS 2002. Lecture Notes in Computer Science, vol 2346. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48080-3_2

Download citation

DOI: https://doi.org/10.1007/3-540-48080-3_2
Published: 18 September 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43790-1
Online ISBN: 978-3-540-48080-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics