Abstract
Large document repositories need to be organized, summarized and labeled in order to be used effectively. Previous clustering studies focused on organizing, and paid little attention to producing cluster labels. Without informative labels, users need to browse many documents to get a sense of what the clusters contain. Human labeling of clusters is not viable when clustering is performed on demand or for very few users. It is desirable to automatically generate informative cluster descriptions (CDs), in order to give users a high-level sense about the clusters, and to help repository managers to produce the final cluster labels.
This paper studies CDs in the form of small term sets for document clusters, and investigates how to measure the quality or fidelity of CDs and how to construct high quality CDs. We propose to use a CD-based classification for simulating how to interpret CDs, and to use the F-score of the classification to measure CD quality. Since directly searching good CDs using F-score is too expensive, we consider a surrogate quality measure, the CDD measure, which combines three factors: coverage, disjointness, and diversity. We give a search strategy for constructing CDs, namely a layer-based replacement method called PagodaCD. Experimental results show that the algorithm is efficient and can produce high quality CDs. CDs produced by PagodaCD also exhibit a monotone quality behavior.
This work was partially supported by a grant from AFRL. Lijun Chen was also partially supported by a DAGSI scholarship.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining, KDD (2002)
Fung, B.C., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proc. of SIAM Int. Conf. on Data Mining (2003)
Hearst, M.A., Karger, D.R., Pedersen, J.O.: Scatter/gather as a tool for the navigation of retrieval results. In: Working Notes of AAAI Fall Symp. (1995)
Karypis, G.: Cluto: A clustering toolkit (release 2.1.1) (2003)
Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2, 139–172 (1987)
Gordon, A.: Classification, 2nd edn. Chapman & Hall, Boca Raton (1999)
Hotho, A., Stumme, G.: Conceptual clustering of text clusters. In: Proceedings of FGML Workshop, pp. 37–45 (2002)
Hovy, E., Lin, C.Y.: Automated text summarization in summarist (1997)
DUC: Document understand conferences (2005), http://duc.nist.gov
Maybury, M.T., Mani, I.: Automatic summarization. Tutorial on ACL (2001)
Mooney, R.J., Bunescu, R.: Mining knowledge from text using information extraction. SIGKDD explorations 7(1), 3–10 (2005)
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM-SIGMOD, pp. 103–114 (1996)
Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. In: SIGMOD, pp. 73–84 (1998)
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. of the ACM SIGMOD int’l conference on management of data, pp. 94–105 (1998)
van Rijsbergen, C.J.: Information Retireval. Butterworths, London (1979)
Cunningham, P., Carney, J.: Diversity versus quality in classification ensembles based on feat ure selection. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS, vol. 1810, pp. 109–116. Springer, Heidelberg (2000)
Shapire, R.: The strength of weak learnability. ML 5(2), 197–227 (1990)
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of KDD Workshop on Text Mining (2000)
Biswas, G., Weinberg, J.B., Fisher, D.H.: ITERATE: A conceptual clustering algorithm for data mining. IEEE Tran. 28C, 219–230 (1998)
Gluck, M.A., Corter, J.E.: Information, uncertainty, and the utility of categories. In: Proc. of the Seventh Annual Conference of the Cognitive Science Society (1985)
Lewis, D.D.: Reuters-21578 text categorixation test collection (1997)
Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences. In: Proc. of the 5th ACM SIGKDD (1999)
Dong, G., Zhang, X., Wong, L., Li, J.: CAEP: Classification by aggregating emerging patterns. In: Discovery Science, pp. 30–42 (1999)
Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on multiple class-association rules. In: ICDM, pp. 369–376 (2001)
Han, J., Fu, Y.: Exploration of the power of attribute-oriented induction in data mining. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 399–421 (1996)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chen, L., Dong, G. (2006). Succinct and Informative Cluster Descriptions for Document Repositories. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_10
Download citation
DOI: https://doi.org/10.1007/11775300_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35225-9
Online ISBN: 978-3-540-35226-6
eBook Packages: Computer ScienceComputer Science (R0)