Skip to main content

Succinct and Informative Cluster Descriptions for Document Repositories

  • Conference paper
Advances in Web-Age Information Management (WAIM 2006)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4016))

Included in the following conference series:

Abstract

Large document repositories need to be organized, summarized and labeled in order to be used effectively. Previous clustering studies focused on organizing, and paid little attention to producing cluster labels. Without informative labels, users need to browse many documents to get a sense of what the clusters contain. Human labeling of clusters is not viable when clustering is performed on demand or for very few users. It is desirable to automatically generate informative cluster descriptions (CDs), in order to give users a high-level sense about the clusters, and to help repository managers to produce the final cluster labels.

This paper studies CDs in the form of small term sets for document clusters, and investigates how to measure the quality or fidelity of CDs and how to construct high quality CDs. We propose to use a CD-based classification for simulating how to interpret CDs, and to use the F-score of the classification to measure CD quality. Since directly searching good CDs using F-score is too expensive, we consider a surrogate quality measure, the CDD measure, which combines three factors: coverage, disjointness, and diversity. We give a search strategy for constructing CDs, namely a layer-based replacement method called PagodaCD. Experimental results show that the algorithm is efficient and can produce high quality CDs. CDs produced by PagodaCD also exhibit a monotone quality behavior.

This work was partially supported by a grant from AFRL. Lijun Chen was also partially supported by a DAGSI scholarship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

Similar content being viewed by others

References

  1. Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining, KDD (2002)

    Google Scholar 

  2. Fung, B.C., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proc. of SIAM Int. Conf. on Data Mining (2003)

    Google Scholar 

  3. Hearst, M.A., Karger, D.R., Pedersen, J.O.: Scatter/gather as a tool for the navigation of retrieval results. In: Working Notes of AAAI Fall Symp. (1995)

    Google Scholar 

  4. Karypis, G.: Cluto: A clustering toolkit (release 2.1.1) (2003)

    Google Scholar 

  5. Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2, 139–172 (1987)

    Google Scholar 

  6. Gordon, A.: Classification, 2nd edn. Chapman & Hall, Boca Raton (1999)

    MATH  Google Scholar 

  7. Hotho, A., Stumme, G.: Conceptual clustering of text clusters. In: Proceedings of FGML Workshop, pp. 37–45 (2002)

    Google Scholar 

  8. Hovy, E., Lin, C.Y.: Automated text summarization in summarist (1997)

    Google Scholar 

  9. DUC: Document understand conferences (2005), http://duc.nist.gov

  10. Maybury, M.T., Mani, I.: Automatic summarization. Tutorial on ACL (2001)

    Google Scholar 

  11. Mooney, R.J., Bunescu, R.: Mining knowledge from text using information extraction. SIGKDD explorations 7(1), 3–10 (2005)

    Article  Google Scholar 

  12. Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM-SIGMOD, pp. 103–114 (1996)

    Google Scholar 

  13. Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. In: SIGMOD, pp. 73–84 (1998)

    Google Scholar 

  14. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. of the ACM SIGMOD int’l conference on management of data, pp. 94–105 (1998)

    Google Scholar 

  15. van Rijsbergen, C.J.: Information Retireval. Butterworths, London (1979)

    Google Scholar 

  16. Cunningham, P., Carney, J.: Diversity versus quality in classification ensembles based on feat ure selection. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS, vol. 1810, pp. 109–116. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  17. Shapire, R.: The strength of weak learnability. ML 5(2), 197–227 (1990)

    Google Scholar 

  18. Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)

    MATH  MathSciNet  Google Scholar 

  19. Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of KDD Workshop on Text Mining (2000)

    Google Scholar 

  20. Biswas, G., Weinberg, J.B., Fisher, D.H.: ITERATE: A conceptual clustering algorithm for data mining. IEEE Tran. 28C, 219–230 (1998)

    Google Scholar 

  21. Gluck, M.A., Corter, J.E.: Information, uncertainty, and the utility of categories. In: Proc. of the Seventh Annual Conference of the Cognitive Science Society (1985)

    Google Scholar 

  22. Lewis, D.D.: Reuters-21578 text categorixation test collection (1997)

    Google Scholar 

  23. Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences. In: Proc. of the 5th ACM SIGKDD (1999)

    Google Scholar 

  24. Dong, G., Zhang, X., Wong, L., Li, J.: CAEP: Classification by aggregating emerging patterns. In: Discovery Science, pp. 30–42 (1999)

    Google Scholar 

  25. Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on multiple class-association rules. In: ICDM, pp. 369–376 (2001)

    Google Scholar 

  26. Han, J., Fu, Y.: Exploration of the power of attribute-oriented induction in data mining. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 399–421 (1996)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Chen, L., Dong, G. (2006). Succinct and Informative Cluster Descriptions for Document Repositories. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_10

Download citation

  • DOI: https://doi.org/10.1007/11775300_10

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-35225-9

  • Online ISBN: 978-3-540-35226-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics