Abstract
In this paper, we describe a framework for clustering documents according to their mixtures of topics. The proposed framework combines the expressiveness of generative models for document representation with a properly chosen information-theoretic distance measure to group the documents via an agglomerative hierarchical clustering scheme. The clustering solution obtained at each level of the dendrogram reflects an organization of the documents into sets of topics, while being produced without the effort needed for a soft/fuzzy clustering method. Experimental results obtained on large, real-world collections of documents evidence the effectiveness of our approach in detecting non-overlapping clusters that contain documents sharing similar mixtures of topics.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Ali, S.M., Silvey, S.D.: A General Class of Coefficients of Divergence of One Distribution from Another. J. Royal Statistical Soc. 28(1), 131–142 (1966)
Bhattacharyya, A.: On a Measure of Divergence Between Two Statistical Populations Defined by their Probability Distributions. Bull. Calcutta Mathematical Soc. 35, 99–110 (1943)
Blei, D.M., Ng, A.Y., Jordan, M.I.: Latent Dirichlet Allocation. J. Machine Learning Research 3, 993–1022 (2003)
Chernoff, H.: A Measure of Asymptotic Efficiency for Tests of a Hypothesis Based on the sum of Observations. Annals of Mathematical Statistics 23(4), 493–507 (1952)
Cover, T.M., Thomas, J.A.: Elements of Information Theory, 2nd edn. Wiley-Interscience, Hoboken (2006)
Deerwester, S., Dumais, S.T., Furnas, G.W., Landauer, T.K., Harshman, R.: Indexing by latent semantic analysis. J. American Soc. for Information Science 41, 391–407 (1990)
Bellegarda, J.R.: Exploiting both local and global constraints for multi-spanstatistical language modeling. Acoustics, Speech and Signal Processing 2, 677–680 (1998)
Kailath, T.: The Divergence and Bhattacharyya Distance Measures in Signal Selection. IEEE Trans. on Comm. Tech. 15(1), 52–60 (1967)
Kim, Y.-M., Pessiot, J.-F., Amini, M.-R., Gallinari, P.: An extension of PLSA for document clustering. In: Proc. of ACM CIKM, pp. 1345–1346 (2008)
Kullback, S.: Information Theory and Statistics. Wiley, Chichester (1959)
Kullback, S., Leibler, R.A.: On Information and Sufficiency. Ann. Math. Stat. 22(1), 79–86 (1951)
Kummamuru, K., Dhawale, A., Krishnapuram, R.: Fuzzy co-clustering of documents and keywords. In: Proc. of IEEE Int. Conf. on Fuzzy Systems, vol. 2, pp. 772–777 (2003)
Larsen, B., Aone, C.: Fast and effective text mining using linear-time document clustering. In: Proc. of ACM KDD, pp. 16–22 (1999)
Lewis, D.D., Yang, Y., Rose, T.G., Dietterich, G., Li, F.: RCV1: A new Benchmark Collection for Text Categorization Research. J. Machine Learning Research 5, 361–397 (2004)
Murtagh, F.: A Survey of Recent Advances in Hierarchical Clustering Algorithm. The Computer Journal 26(4), 354–359 (1983)
Sato, I., Nakagawa, H.: Knowledge discovery of multiple-topic document using parametric mixture model with dirichlet prior. In: Proc. of ACM KDD, pp. 590–598. ACM, New York (2007)
Hofmann, T.: Unsupervised Learning by Probabilistic Latent Semantic Analysis. Machine Learning 42(1-2), 177–196 (2001)
Landauer, T.K., Dumais, S.T.: A Solution to Plato’s Problem: The Latent Semantic Analysis Theory of Acquisition, Induction, and Representation of Knowledge. Psychological Review 104(2), 211–240 (1997)
Ueda, N., Saito, K.: Parametric Mixture Models for Multi-Labeled Text. In: Proc. of Neural Information Processing Systems, pp. 721–728 (2002)
Wolfe, M.B.W., Schreiner, M.E., Rehder, B., Laham, D., Foltz, P.W., Kintsch, W., Landauer, T.K.: Learning from text: Matching readers and texts by latent semantic analysis. Discourse Processes 25(2/3), 309–336 (1998)
Zhao, Y., Karypis, G.: Soft clustering criterion functions for partitional document clustering: a summary of results. In: Proc. of ACM CIKM, pp. 246–247 (2004)
Zhong, S., Ghosh, J.: A unified framework for model-based clustering. J. Machine Learning Research 4, 1001–1037 (2003)
Zhong, S., Ghosh, J.: Generative model-based document clustering: a comparative study. Knowl. Inf. Syst. 8(3), 374–384 (2005)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2009 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Ponti, G., Tagarelli, A. (2009). Topic-Based Hard Clustering of Documents Using Generative Models. In: Rauch, J., RaÅ›, Z.W., Berka, P., Elomaa, T. (eds) Foundations of Intelligent Systems. ISMIS 2009. Lecture Notes in Computer Science(), vol 5722. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-04125-9_26
Download citation
DOI: https://doi.org/10.1007/978-3-642-04125-9_26
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-04124-2
Online ISBN: 978-3-642-04125-9
eBook Packages: Computer ScienceComputer Science (R0)