Skip to main content

Multilingual Document Clustering, Topic Extraction and Data Transformations

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2258))

Abstract

This paper describes a statistics-based approach for clustering documents and for extracting cluster topics. Relevant Expressions (REs) are extracted from corpora and used as clustering base features. These features are transformed and then by using an approach based on Principal Components Analysis, a small set of document classification features is obtained. The best number of clusters is found by Model- Based Clustering Analysis. Data transformations to approximate to normal distribution are done and results are discussed. The most important REs are extracted from each cluster and taken as cluster topics.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Silva, J. F., Dias, G., Guilloré, S., Lopes, G. P. 1999. Using Local Maxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. Lectures Notes in Artificial Intelligence, Springer-Verlag, volume 1695, pages 113–132.

    Google Scholar 

  2. Silva, J. F., Lopes, G. P. 1999. A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In Proceedings of the 6th Meeting on the Mathematics of Language, pages 369–381, Orlando.

    Google Scholar 

  3. Escoufier Y., L'Hermier, H. 1978. A propos de la Comparaison Graphique des Matrices de Variance. Biometrischc Zeitschrift, 20, pages 477–483.

    Google Scholar 

  4. Fraley, C., Raftery, A. E. 1998. How many clusters? Which clustering method?-Answers via model-based cluster analysis. The computer Journal, 41, pages 578–588.

    Google Scholar 

  5. Johnson R. A., Wichern, D. W. 1988. Applied Multivariate Statistical Analysis, second edition. Prentice-Hall.

    Google Scholar 

  6. Box, G. E. P., D. R. Cox. 1964. An Analysis of Transformations, (with discussion). Journal of the Royal Statistical Society (B), 26, no. 2, pages 211–252.

    Google Scholar 

  7. Wilks, Y., Gaizauskas, R. 1999. Lasie Jumps the Gate. In Tomek Strzalkowski, editor, Natural Language Information Retrieval. Kluwer Academic Publishers, pages 200–214.

    Google Scholar 

  8. Radev, D. R., Hongyan, J., Makgorzata, B. 2000. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. Proceedings of the ANLP/NAACL Workshop on Summarization.

    Google Scholar 

  9. Ando, R. K., Lee L. 2001. Iterative Residual Rescaling: An Analysis and Generalization of LSI. To appear in the proceedings of SIGIR 2001.

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2001 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Silva, J., Mexia, J., Coelho, C.A., Lopes, G. (2001). Multilingual Document Clustering, Topic Extraction and Data Transformations. In: Brazdil, P., Jorge, A. (eds) Progress in Artificial Intelligence. EPIA 2001. Lecture Notes in Computer Science(), vol 2258. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45329-6_11

Download citation

  • DOI: https://doi.org/10.1007/3-540-45329-6_11

  • Published:

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-43030-8

  • Online ISBN: 978-3-540-45329-1

  • eBook Packages: Springer Book Archive

Publish with us

Policies and ethics