Abstract
This paper describes a statistics-based approach for clustering documents and for extracting cluster topics. Relevant Expressions (REs) are extracted from corpora and used as clustering base features. These features are transformed and then by using an approach based on Principal Components Analysis, a small set of document classification features is obtained. The best number of clusters is found by Model- Based Clustering Analysis. Data transformations to approximate to normal distribution are done and results are discussed. The most important REs are extracted from each cluster and taken as cluster topics.
This is a preview of subscription content, log in via an institution.
Buying options
Tax calculation will be finalised at checkout
Purchases are for personal use only
Learn about institutional subscriptionsPreview
Unable to display preview. Download preview PDF.
References
Silva, J. F., Dias, G., Guilloré, S., Lopes, G. P. 1999. Using Local Maxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. Lectures Notes in Artificial Intelligence, Springer-Verlag, volume 1695, pages 113–132.
Silva, J. F., Lopes, G. P. 1999. A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In Proceedings of the 6th Meeting on the Mathematics of Language, pages 369–381, Orlando.
Escoufier Y., L'Hermier, H. 1978. A propos de la Comparaison Graphique des Matrices de Variance. Biometrischc Zeitschrift, 20, pages 477–483.
Fraley, C., Raftery, A. E. 1998. How many clusters? Which clustering method?-Answers via model-based cluster analysis. The computer Journal, 41, pages 578–588.
Johnson R. A., Wichern, D. W. 1988. Applied Multivariate Statistical Analysis, second edition. Prentice-Hall.
Box, G. E. P., D. R. Cox. 1964. An Analysis of Transformations, (with discussion). Journal of the Royal Statistical Society (B), 26, no. 2, pages 211–252.
Wilks, Y., Gaizauskas, R. 1999. Lasie Jumps the Gate. In Tomek Strzalkowski, editor, Natural Language Information Retrieval. Kluwer Academic Publishers, pages 200–214.
Radev, D. R., Hongyan, J., Makgorzata, B. 2000. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. Proceedings of the ANLP/NAACL Workshop on Summarization.
Ando, R. K., Lee L. 2001. Iterative Residual Rescaling: An Analysis and Generalization of LSI. To appear in the proceedings of SIGIR 2001.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2001 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Silva, J., Mexia, J., Coelho, C.A., Lopes, G. (2001). Multilingual Document Clustering, Topic Extraction and Data Transformations. In: Brazdil, P., Jorge, A. (eds) Progress in Artificial Intelligence. EPIA 2001. Lecture Notes in Computer Science(), vol 2258. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45329-6_11
Download citation
DOI: https://doi.org/10.1007/3-540-45329-6_11
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43030-8
Online ISBN: 978-3-540-45329-1
eBook Packages: Springer Book Archive