Abstract
The general goal of clustering is to group data elements such that the intra-group similarities are high and the inter-group similarities are low. We present a clustering algorithm called CBC (Clustering By Committee) that is shown to produce higher quality clusters in document clustering tasks as compared to several well known clustering algorithms. It initially discovers a set of tight clusters (high intra-group similarity), called committees, that are well scattered in the similarity space (low inter-group similarity). The union of the committees is but a subset of all elements. The algorithm proceeds by assigning elements to their most similar committee. Evaluating cluster quality has always been a difficult task. We present a new evaluation methodology based on the editing distance between output clusters and manually constructed classes (the answer key). This evaluation measure is more intuitive and easier to interpret than previous evaluation measures.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Buckley, C. and Lewit, A. F. 1985. Optimization of inverted vector searches. In Proceedings of SIGIR-85. pp. 97–110.
Church, K. and Hanks, P. 1989. Word association norms, mutual information, and lexicography. In Proceedings ofACL-89. pp. 76–83. Vancouver, Canada.
Cutting, D. R.; Karger, D.; Pedersen, J.; and Tukey, J. W. 1992. Scatter/Gather: A cluster-based approach to browsing large document collections. In Proceedings of SIGIR-92. pp.318–329. Copenhagen, Denmark.
Guha, S.; Rastogi, R.; and Kyuseok, S. 1999. ROCK: A robust clustering algorithm for categorical attributes. In Proceedings ofICDE’99. pp. 512–521. Sydney, Australia.
Hearst, M. A. and Pedersen, J. O. 1996. Reexamining the cluster hypothesis: Scatter/Gather on retrieval results. In Proceedings of SIGIR-96. pp. 76–84. Zurich, Switzerland.
Jain, A.K.; Murty, M.N.; and Flynn, P.J. 1999. Data Clustering: A Review. ACM Computing Surveys 31(3):264–323.
Jardine, N. and van Rijsbergen, C. J. 1971. The use of hierarchical clustering in information retrieval. Information Storage and Retreival, 7:217–240.
Karypis, G.; Han, E.-H.; and Kumar, V. 1999. Chameleon: A hierarchical clustering algorithm using dynamic modeling. IEEE Computer: Special Issue on Data Analysis and Mining 32(8): 68–75.
Kaufmann, L. and Rousseeuw, P. J. 1987. Clustering by means of medoids. In Dodge, Y. (Ed.) Statistical Data Analysis based on the L1 Norm. pp. 405–416. Elsevier/North Holland, Amsterdam.
Koller, D. and Sahami, M. 1997. Hierarchically classifying documents using very few words. In Proceedings of ICML-97. pp. 170–176. Nashville, TN.
McQueen, J. 1967. Some methods for classification and analysis of multivariate observations. In Proceedings of 5 th Berkeley Symposium on Mathematics, Statistics and Probability, 1:281–298.
Salton, G. and McGill, M. J. 1983. Introduction to Modern Information Retrieval. McGraw Hill.
Steinbach, M.; Karypis, G.; and Kumar, V. 2000. A comparison of document clustering techniques. Technical Report #00-034. Department of Computer Science and Engineering, University of Minnesota.
van Rijsbergen, C. J. 1979. Information Retrieval, second edition. London: Buttersworth. Available at: http://www.dcs.gla.ac.uk/Keith/Preface.html
Wagstaff, K. and Cardie, C. 2000. Clustering with instance-level constraints. In Proceedings of ICML-2000. pp. 1103–1110. Palo Alto, CA.
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2002 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Pantel, P., Lin, D. (2002). Efficiently Clustering Documents with Committees. In: Ishizuka, M., Sattar, A. (eds) PRICAI 2002: Trends in Artificial Intelligence. PRICAI 2002. Lecture Notes in Computer Science(), vol 2417. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-45683-X_46
Download citation
DOI: https://doi.org/10.1007/3-540-45683-X_46
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-44038-3
Online ISBN: 978-3-540-45683-4
eBook Packages: Springer Book Archive