Multilingual Document Clustering, Topic Extraction and Data Transformations

Silva, Joaquim; Mexia, João; Coelho, Carlos A.; Lopes, Gabriel

doi:10.1007/3-540-45329-6_11

Multilingual Document Clustering, Topic Extraction and Data Transformations

Joaquim Silva²,
João Mexia³,
Carlos A. Coelho⁴ &
…
Gabriel Lopes²

Conference paper
First Online: 01 January 2002

625 Accesses
4 Citations

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 2258))

Abstract

This paper describes a statistics-based approach for clustering documents and for extracting cluster topics. Relevant Expressions (REs) are extracted from corpora and used as clustering base features. These features are transformed and then by using an approach based on Principal Components Analysis, a small set of document classification features is obtained. The best number of clusters is found by Model- Based Clustering Analysis. Data transformations to approximate to normal distribution are done and results are discussed. The most important REs are extracted from each cluster and taken as cluster topics.

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Silva, J. F., Dias, G., Guilloré, S., Lopes, G. P. 1999. Using Local Maxs Algorithm for the Extraction of Contiguous and Non-contiguous Multiword Lexical Units. Lectures Notes in Artificial Intelligence, Springer-Verlag, volume 1695, pages 113–132.
Google Scholar
Silva, J. F., Lopes, G. P. 1999. A Local Maxima Method and a Fair Dispersion Normalization for Extracting Multiword Units. In Proceedings of the 6th Meeting on the Mathematics of Language, pages 369–381, Orlando.
Google Scholar
Escoufier Y., L'Hermier, H. 1978. A propos de la Comparaison Graphique des Matrices de Variance. Biometrischc Zeitschrift, 20, pages 477–483.
Google Scholar
Fraley, C., Raftery, A. E. 1998. How many clusters? Which clustering method?-Answers via model-based cluster analysis. The computer Journal, 41, pages 578–588.
Google Scholar
Johnson R. A., Wichern, D. W. 1988. Applied Multivariate Statistical Analysis, second edition. Prentice-Hall.
Google Scholar
Box, G. E. P., D. R. Cox. 1964. An Analysis of Transformations, (with discussion). Journal of the Royal Statistical Society (B), 26, no. 2, pages 211–252.
Google Scholar
Wilks, Y., Gaizauskas, R. 1999. Lasie Jumps the Gate. In Tomek Strzalkowski, editor, Natural Language Information Retrieval. Kluwer Academic Publishers, pages 200–214.
Google Scholar
Radev, D. R., Hongyan, J., Makgorzata, B. 2000. Centroid-based summarization of multiple documents: sentence extraction, utility-based evaluation, and user studies. Proceedings of the ANLP/NAACL Workshop on Summarization.
Google Scholar
Ando, R. K., Lee L. 2001. Iterative Residual Rescaling: An Analysis and Generalization of LSI. To appear in the proceedings of SIGIR 2001.
Google Scholar

Download references

Author information

Authors and Affiliations

DI / FCT/ Universidade Nova de Lisboa, Quinta da Torre, 2725, Monte da Caparica, Portugal
Joaquim Silva & Gabriel Lopes
DM/ FCT/ Universidade Nova de Lisboa, Quinta da Torre, 2725, Monte da Caparica, Portugal
João Mexia
DM/ ISA / Universidade Técnica de Lisboa, Tapada da Ajuda, Portugal
Carlos A. Coelho

Authors

Joaquim Silva
View author publications
You can also search for this author in PubMed Google Scholar
João Mexia
View author publications
You can also search for this author in PubMed Google Scholar
Carlos A. Coelho
View author publications
You can also search for this author in PubMed Google Scholar
Gabriel Lopes
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Faculty of Economics LIACC, Laboratório de Inteligência Artificial e Ciência de Computadores, University of Porto, Rua do Campo Alegre, 823, 4150-180, Porto, Portugal
Pavel Brazdil & Alípio Jorge &

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Silva, J., Mexia, J., Coelho, C.A., Lopes, G. (2001). Multilingual Document Clustering, Topic Extraction and Data Transformations. In: Brazdil, P., Jorge, A. (eds) Progress in Artificial Intelligence. EPIA 2001. Lecture Notes in Computer Science(), vol 2258. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45329-6_11

Download citation

DOI: https://doi.org/10.1007/3-540-45329-6_11
Published: 23 April 2002
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-43030-8
Online ISBN: 978-3-540-45329-1
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics