Abstract
A major problem encountered by text clustering practitioners is the difficulty of determining a priori which is the optimal text representation and clustering technique for a given clustering problem. As a step towards building robust document partitioning systems, we present a strategy based on a hierarchical consensus clustering architecture that operates on a wide diversity of document representations and partitions. The conducted experiments show that the proposed method is capable of yielding a consensus clustering that is comparable to the best individual clustering available even in the presence of a large number of poor individual labelings, outperforming classic non-hierarchical consensus approaches in terms of performance and computational cost.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Deerwester, S., et al.: Indexing by Latent Semantic Analysis. Journal American Society Information Science 6(41), 391–407 (1990)
Kolenda, T., Hansen, L.K., Sigurdsson, S.: Independent Components in Text. In: Girolami, M. (ed.) Advances in Independent Component Analysis, pp. 241–262. Springer, Heidelberg (2000)
Lee, D.D., Seung, H.S.: Learning the Parts of Objects by Non-Negative Matrix Factorization. Nature 401, 788–791 (1999)
Sebastiani, F.: Machine Learning in Automated Text Categorization. ACM Computing Surveys 34(1), 1–47 (2002)
Shafiei, M., et al.: A Systematic Study of Document Representation and Dimension Reduction for Text Clustering. Technical Report CS-2006-05. Dalhousie University (2006)
Strehl, A., Ghosh, J.: Cluster Ensembles – A Knowledge Reuse Framework for Combining Multiple Partitions. JMLR 3, 583–617 (2002)
Author information
Authors and Affiliations
Editor information
Rights and permissions
Copyright information
© 2007 Springer Berlin Heidelberg
About this paper
Cite this paper
Sevillano, X., Cobo, G., Alías, F., Socoró, J.C. (2007). A Hierarchical Consensus Architecture for Robust Document Clustering. In: Amati, G., Carpineto, C., Romano, G. (eds) Advances in Information Retrieval. ECIR 2007. Lecture Notes in Computer Science, vol 4425. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-71496-5_82
Download citation
DOI: https://doi.org/10.1007/978-3-540-71496-5_82
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-71494-1
Online ISBN: 978-3-540-71496-5
eBook Packages: Computer ScienceComputer Science (R0)