Succinct and Informative Cluster Descriptions for Document Repositories

Chen, Lijun; Dong, Guozhu

doi:10.1007/11775300_10

Lijun Chen¹⁹ &
Guozhu Dong¹⁹

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 4016))

Included in the following conference series:

International Conference on Web-Age Information Management

1249 Accesses
1 Citations

Abstract

Large document repositories need to be organized, summarized and labeled in order to be used effectively. Previous clustering studies focused on organizing, and paid little attention to producing cluster labels. Without informative labels, users need to browse many documents to get a sense of what the clusters contain. Human labeling of clusters is not viable when clustering is performed on demand or for very few users. It is desirable to automatically generate informative cluster descriptions (CDs), in order to give users a high-level sense about the clusters, and to help repository managers to produce the final cluster labels.

This paper studies CDs in the form of small term sets for document clusters, and investigates how to measure the quality or fidelity of CDs and how to construct high quality CDs. We propose to use a CD-based classification for simulating how to interpret CDs, and to use the F-score of the classification to measure CD quality. Since directly searching good CDs using F-score is too expensive, we consider a surrogate quality measure, the CDD measure, which combines three factors: coverage, disjointness, and diversity. We give a search strategy for constructing CDs, namely a layer-based replacement method called PagodaCD. Experimental results show that the algorithm is efficient and can produce high quality CDs. CDs produced by PagodaCD also exhibit a monotone quality behavior.

This work was partially supported by a grant from AFRL. Lijun Chen was also partially supported by a DAGSI scholarship.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Clustering Retrieved Web Documents to Speed Up Web Searches

Measurement of clustering effectiveness for document collections

Article Open access 10 January 2022

What was the Query? Generating Queries for Document Sets with Applications in Cluster Labeling

References

Beil, F., Ester, M., Xu, X.: Frequent term-based text clustering. In: Proc. 8th Int. Conf. on Knowledge Discovery and Data Mining, KDD (2002)
Google Scholar
Fung, B.C., Wang, K., Ester, M.: Hierarchical document clustering using frequent itemsets. In: Proc. of SIAM Int. Conf. on Data Mining (2003)
Google Scholar
Hearst, M.A., Karger, D.R., Pedersen, J.O.: Scatter/gather as a tool for the navigation of retrieval results. In: Working Notes of AAAI Fall Symp. (1995)
Google Scholar
Karypis, G.: Cluto: A clustering toolkit (release 2.1.1) (2003)
Google Scholar
Fisher, D.H.: Knowledge acquisition via incremental conceptual clustering. Machine Learning 2, 139–172 (1987)
Google Scholar
Gordon, A.: Classification, 2nd edn. Chapman & Hall, Boca Raton (1999)
MATH Google Scholar
Hotho, A., Stumme, G.: Conceptual clustering of text clusters. In: Proceedings of FGML Workshop, pp. 37–45 (2002)
Google Scholar
Hovy, E., Lin, C.Y.: Automated text summarization in summarist (1997)
Google Scholar
DUC: Document understand conferences (2005), http://duc.nist.gov
Maybury, M.T., Mani, I.: Automatic summarization. Tutorial on ACL (2001)
Google Scholar
Mooney, R.J., Bunescu, R.: Mining knowledge from text using information extraction. SIGKDD explorations 7(1), 3–10 (2005)
Article Google Scholar
Zhang, T., Ramakrishnan, R., Livny, M.: BIRCH: an efficient data clustering method for very large databases. In: Proc. ACM-SIGMOD, pp. 103–114 (1996)
Google Scholar
Guha, S., Rastogi, R., Shim, K.: CURE: An efficient clustering algorithm for large databases. In: SIGMOD, pp. 73–84 (1998)
Google Scholar
Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. of the ACM SIGMOD int’l conference on management of data, pp. 94–105 (1998)
Google Scholar
van Rijsbergen, C.J.: Information Retireval. Butterworths, London (1979)
Google Scholar
Cunningham, P., Carney, J.: Diversity versus quality in classification ensembles based on feat ure selection. In: Lopez de Mantaras, R., Plaza, E. (eds.) ECML 2000. LNCS, vol. 1810, pp. 109–116. Springer, Heidelberg (2000)
Chapter Google Scholar
Shapire, R.: The strength of weak learnability. ML 5(2), 197–227 (1990)
Google Scholar
Breiman, L.: Bagging predictors. Machine Learning 24(2), 123–140 (1996)
MATH MathSciNet Google Scholar
Steinbach, M., Karypis, G., Kumar, V.: A comparison of document clustering techniques. In: Proceedings of KDD Workshop on Text Mining (2000)
Google Scholar
Biswas, G., Weinberg, J.B., Fisher, D.H.: ITERATE: A conceptual clustering algorithm for data mining. IEEE Tran. 28C, 219–230 (1998)
Google Scholar
Gluck, M.A., Corter, J.E.: Information, uncertainty, and the utility of categories. In: Proc. of the Seventh Annual Conference of the Cognitive Science Society (1985)
Google Scholar
Lewis, D.D.: Reuters-21578 text categorixation test collection (1997)
Google Scholar
Dong, G., Li, J.: Efficient mining of emerging patterns: Discovering trends and differences. In: Proc. of the 5th ACM SIGKDD (1999)
Google Scholar
Dong, G., Zhang, X., Wong, L., Li, J.: CAEP: Classification by aggregating emerging patterns. In: Discovery Science, pp. 30–42 (1999)
Google Scholar
Li, W., Han, J., Pei, J.: CMAR: Accurate and efficient classification based on multiple class-association rules. In: ICDM, pp. 369–376 (2001)
Google Scholar
Han, J., Fu, Y.: Exploration of the power of attribute-oriented induction in data mining. In: Fayyad, U., Piatetsky-Shapiro, G., Smyth, P., Uthurusamy, R. (eds.) Advances in Knowledge Discovery and Data Mining, pp. 399–421 (1996)
Google Scholar

Download references

Author information

Authors and Affiliations

Wright State University, Dayton, OH, 45435, USA
Lijun Chen & Guozhu Dong

Authors

Lijun Chen
View author publications
You can also search for this author in PubMed Google Scholar
Guozhu Dong
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Chinese University of Hong Kong, Hong Kong, China
Jeffrey Xu Yu
Institute of Industrial Science, The University of Tokyo, 4-6-1 Komaba, Meguro-ku, 153-8505, Tokyo, Japan
Masaru Kitsuregawa
Department of Computing, Hong Kong Polytechnic University, Hong Kong
Hong Va Leong

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Chen, L., Dong, G. (2006). Succinct and Informative Cluster Descriptions for Document Repositories. In: Yu, J.X., Kitsuregawa, M., Leong, H.V. (eds) Advances in Web-Age Information Management. WAIM 2006. Lecture Notes in Computer Science, vol 4016. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11775300_10

Download citation

DOI: https://doi.org/10.1007/11775300_10
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-35225-9
Online ISBN: 978-3-540-35226-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Succinct and Informative Cluster Descriptions for Document Repositories

Abstract

Access this chapter

Preview

Similar content being viewed by others

Clustering Retrieved Web Documents to Speed Up Web Searches

Measurement of clustering effectiveness for document collections

What was the Query? Generating Queries for Document Sets with Applications in Cluster Labeling

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Succinct and Informative Cluster Descriptions for Document Repositories

Abstract

Access this chapter

Preview

Similar content being viewed by others

Clustering Retrieved Web Documents to Speed Up Web Searches

Measurement of clustering effectiveness for document collections

What was the Query? Generating Queries for Document Sets with Applications in Cluster Labeling

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation