Skip to main content

Automatic Analysis of Large Text Corpora - A Contribution to Structuring WEB Communities

  • Conference paper
  • First Online:
Innovative Internet Computing Systems (IICS 2002)

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2346))

Included in the following conference series:

Abstract

This paper describes a corpus linguistic analysis of large text corpora based on collocations with the aim of extracting semantic relations from unstructured text. We regard this approach as a viable method for generating and structuring information about WEB communities. Starting from a short description of our corpora as well as our language analysis tools, we discuss in depth the automatic generation of collocation sets. We further give examples of different types of relations that may be found in collocation sets for arbitrary terms. We conclude with a brief discussion of applying our approach to the analysis of a sample community.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  • Armstrong, S. (ed.) (1993). Using Large Corpora. Computational Linguistics 19(1/2) (1993) [Special Issue on Corpus Processing, repr. MIT Press 1994].

    Google Scholar 

  • Bentley, J.; Sedgewick, R. (1998). “Ternary Search Trees.” In: Dr. Dobbs Journal, April 1998.

    Google Scholar 

  • Davidson, R., Harel, D. (1996). “Drawing Graphs Nicely Using Simulated Annealing.” In: ACM Transactions on Graphics 15(4), 301–331.

    Article  Google Scholar 

  • Heyer, G.; Quasthoff, U.; Wolff, Ch. (2000). “Aiding Web Searches by Statistical Classification Tools.“ In: Knorz, G.; Kuhlen, R. (edd.) (2000). Informationskompetenz-Basiskompetenz in der Informationsgesellschaft. Proc. 7. Intern. Symposium f. Informationswissenschaft, ISI 2000, Darmstadt. Konstanz: UVK, 163–177.

    Google Scholar 

  • Heyer, G.; Läuter, M.; Quasihoff, U.; Wittig, Th.; Wolff, Ch. (2001). „Learning Relations using Collocations.” In: Maedche, Alexander; Staab, Steffen; Nedellec, C.; Hovy, E. (edd.). Proc. IJCAI Workshop on Ontology Learning, Seattle/WA, August 2001, 19–24.

    Google Scholar 

  • Krenn, B. (2000). “Distributional and Linguistic Implications of Collocation Identification.” In: Proc. Collocations Workshop, DGfS Conference, Marburg, March 2000.

    Google Scholar 

  • Krenn, B., 2000. Empirical Implications on Lexical Association Measures. Proceedings of the Ninth EURALEX International Congress. Stuttgart, Germany.

    Google Scholar 

  • Läuter, M., Quasthoff, U. (1999). “Kollokationen und semantisches Clustering.” In: Gippert, J. (ed.) (1999). Multilinguale Corpora. Codierung, Strukturierung, Analyse. Proc. 11. GLDV-Jahrestagung. Prague: Enigma Corporation, 34–41.

    Google Scholar 

  • Lemnitzer, L. (1998). “Komplexe lexikalische Einheiten in Text und Lexikon.” In: Heyer, G.; Wolff, Ch. (ed.). Linguistik und neue Medien. Wiesbaden: Dt. Universitätsverlag, 85–91.

    Google Scholar 

  • Manning, Ch.D.; Schütze, H. (1999). Foundations of Statistical Language Processing. Cambridge/MA, London: The MIT Press.

    MATH  Google Scholar 

  • Milgram, S. (19922). “The Small World Problem.” In: Milgram, S.; Sabini, J.; Silver, M. (eds.). The Individual in a Social World: Essays and Experiments. New York/NY: McGraw Hill.

    Google Scholar 

  • Quasthoff, U. (1998A). “Tools for Automatic Lexicon Maintenance: Acquisition, Error Correction, and the Generation of Missing Values.“ In: Proc. First International Conference on Language Resources and Evaluation [LREC], Granada, May 1998, Vol. II, 853–856.

    Google Scholar 

  • Quasthoff, U. (1998B). “Projekt der deutsche Wortschatz.” In: Heyer, G., Wolff, Ch. (eds.). Linguistik und neue Medien. Wiesbaden: Dt. Universitätsverlag, 93–99.

    Google Scholar 

  • Quasthoff, U.; Wolff, Ch. (2000). “An Infrastructure for Corpus-Based Monolingual Dictionaries.” In: Proc. LREC-2000. Second International Conference On Language Resources and Evaluation. Athens, May/June 2000, Vol. I, 241–246.

    Google Scholar 

  • Salton, Gerard (1989). Automatic Text Processing. The Transformation, Analysis, and Retrieval of Information by Computer. Reading/MA: Addison-Wesley.Sinclair, J. (1991). Corpus Concordance Collocation. Oxford: Oxford University Press.

    Google Scholar 

  • Smadja F. (1993). “Retrieving Collocations from Text: Xtract.” In: Computational Linguistics 19(1) (1993), 143–177.

    Google Scholar 

  • Svartvik, J. (ed.) (1992). Directions in Corpus Linguistics: Proc. Nobel Symposium 82, Stockholm, 4–8 August 1991. Barlin: Mouton de Gruyter [= Trends in Linguistics Vol. 65].

    Google Scholar 

  • van der Vet, P.E.; Mars, N.J.I. (1998). “Bottom-Up Construction of Ontologies.” In: IEEE Transactions on Knowledge and Data Engineering 10(4) (1998), 513–526.

    Article  Google Scholar 

  • Wulff, M.; Unger, H. (2000). “Message Chains as a new Form of Active Communication in the WOSNet.” In: Proc. High Performance Computing (HPC) 2000.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2002 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Heyer, G., Quasthoff, U., Wolff, C. (2002). Automatic Analysis of Large Text Corpora - A Contribution to Structuring WEB Communities. In: Unger, H., Böhme, T., Mikler, A. (eds) Innovative Internet Computing Systems. IICS 2002. Lecture Notes in Computer Science, vol 2346. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-48080-3_2

Download citation

  • DOI: https://doi.org/10.1007/3-540-48080-3_2

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43790-1

  • Online ISBN: 978-3-540-48080-8

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics