skip to main content
10.1145/1031171.1031267acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
Article

Document clustering based on cluster validation

Published: 13 November 2004 Publication History

Abstract

This paper presents a cluster validation based document clustering algorithm, which is capable of identifying both important feature words and true model order (cluster number). Important feature subset is selected by optimizing a cluster validity criterion subject to some constraint. For achieving model order identification capability, this feature selection procedure is conducted for each possible value of cluster number. The feature subset and cluster number which maximize the cluster validity criterion are chosen as our answer. We have applied our algorithm to several datasets from 20Newsgroup corpus. Experimental results show that our algorithm can find important feature subset, estimate the model order and yield higher micro-averaged precision than other four document clustering algorithms which require cluster number to be provided.

References

[1]
Cutting, D. R., Karger, D. R., Pederson, J. O., & Tukey, J. W. (1992) Scatter/Gather: A Cluster-based Approach to Browsing Large Document Collections. Proc. of the 15th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval(pp.318--329).
[2]
Dhillon, I. S. (2001) Co-Clustering Documents and Words Using Bipartite Spectral Graph Partitioning. Proc. of the 7th ACM SIGKDD Conf. on Knowledge Discovery and Data Mining.
[3]
Dhillon, I. S., Mallela, S., & Modha, S. (2003) Information-Theoretic Co-Clustering. Proc. of the 9th ACM SIGKDD Int. Conf. on Knowledge Discovery and Data Mining.
[4]
Ding, C., He, X., Zha, H., Gu, M., & Simon, H. (2001) A Min-max Cut Algorithm for Graph Partitioning and Data Clustering. Proc. of the IEEE Int. Conf. on Data Mining(pp. 107--114).
[5]
Dy, J. G., & Brodley, C.E. (2000) Feature Subset Selection and Order Identification for Unsupervised Learning. Proc. of the 17th Int. Conf. on Machine Learning(pp. 247--254).
[6]
El-Yaniv, R., & Souroujon, O. (2002) Iterative Double Clustering for Unsupervised and Semi-supervised Learning. Advances in Neural Information Processing Systems 15.
[7]
El-Hamdouchi, A., & Willett, P. (1986) Hierarchic Document Classification Using Ward's Clustering Method. Proc. of the 9th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval(pp.149--156).
[8]
Lange, T., Braun, M., Roth, V., & Buhmann, J. M. (2002) Stability-Based Model Selection. Advances in Neural Information Processing Systems 15.
[9]
Law, M. H., Figueiredo, M., & Jain, A. K. (2002) Feature Selection in Mixture-Based Clustering. Advances in Neural Information Processing Systems 15.
[10]
Levine, E., & Domany, E. (2001) Resampling Method for Unsupervised Estimation of Cluster Validity. Neural Computation, Vol. 13, 2573--2593.
[11]
Lin, J. (1991) Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information Theory, 37:1, 145--150.
[12]
Liu, X., Gong, Y., Xu, W., & Zhu, S. (2002) Document Clustering with Cluster Refinement and Model Selection Capabilities. Proc. of the 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval(pp.191--198).
[13]
Modha, D. S., & Spangler, W. S. (2003) Feature Weighting in k-Means Clustering. Machine Learning, 52:3, 217--237.
[14]
Pantel, P., & Lin, D. (2002) Document Clustering with Committees. Proc. of the 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval(pp.199--206).
[15]
Pudil, P., Novovicova, J., & Kittler, J. (1994) Floating Search Methods in Feature Selection. Pattern Recognigion Letters, Vol. 15, 1119--1125.
[16]
Schütze, H., & Silverstein, C. (1997) Projections for Efficient Document Clustering. Proc. of the 20th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval(pp.74--81).
[17]
Slonim, N., Friedman, N. & Tishby, N. (2002) Unsupervised Document Classification Using Sequential Information Maximization. Proc. of the 25th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval.
[18]
Slonim, N., & Tishby, N. (2000) Document Clustering Using Word Clusters via the Information Bottleneck. Proc. of the 23th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval.
[19]
Tibshirani, R., Walther, G., Botstein, D., & Brown, P. (2001) Cluster Validation by Prediction Strength. Technical Report, Statistics Department, Stanford University.
[20]
Tishby, N., Pereira, F., & Bialek, W. (1999) The Information Bottleneck Method. Proc. of the 37th Allerton Conference on Communication, Control and Computing.
[21]
Vaithyanathan, S., & Dom, B. (1999) Model Selection in Unsupervised Learning with Applications To Document Clustering. Proc. of the 16th Int. Conf. on Machine Learning(pp. 433--443).
[22]
Willett, P.(1980) Document Clustering Using an Inverted File Approach. Journal of Information Science, Vol. 2, 223--231.
[23]
Xu, W., Liu, X., & Gong, Y.(2003) Document Clustering Based on Non-Negative Matrix Factorization. Proc. of the 26th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval(pp.74--81).
[24]
Zha, H., He, X, Ding, C., Gu, M., & Simon, H. D. (2001) Bipartite Graph Partitioning and Data Clustering. Proc. of the 10th ACM Conf. on Information and Knowledge Management(pp.25--32).
[25]
Zamir, O., & Etzioni, O.(1998) Web Document Clustering: A Feasibility Demonstration. Proc. of the 21th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval(pp.46--54).

Cited By

View all
  • (2023)Co-Designing for Transparency: Lessons from Building a Document Organization Tool in the Criminal Justice DomainProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency10.1145/3593013.3594093(1463-1478)Online publication date: 12-Jun-2023
  • (2018)CCDLC Detection Framework-Combining Clustering with Deep Learning Classification for Semantic Clones2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA.2018.00111(701-706)Online publication date: Dec-2018
  • (2018)Exploring overlapping clusters using dynamic re-scaling and samplingKnowledge and Information Systems10.1007/s10115-006-0005-y10:3(295-313)Online publication date: 29-Dec-2018
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
CIKM '04: Proceedings of the thirteenth ACM international conference on Information and knowledge management
November 2004
678 pages
ISBN:1581138741
DOI:10.1145/1031171
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 13 November 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. cluster number estimation
  2. cluster validation
  3. document clustering
  4. feature selection

Qualifiers

  • Article

Conference

CIKM04
Sponsor:
CIKM04: Conference on Information and Knowledge Management
November 8 - 13, 2004
D.C., Washington, USA

Acceptance Rates

Overall Acceptance Rate 1,861 of 8,427 submissions, 22%

Upcoming Conference

CIKM '25

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)1
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Jan 2025

Other Metrics

Citations

Cited By

View all
  • (2023)Co-Designing for Transparency: Lessons from Building a Document Organization Tool in the Criminal Justice DomainProceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency10.1145/3593013.3594093(1463-1478)Online publication date: 12-Jun-2023
  • (2018)CCDLC Detection Framework-Combining Clustering with Deep Learning Classification for Semantic Clones2018 17th IEEE International Conference on Machine Learning and Applications (ICMLA)10.1109/ICMLA.2018.00111(701-706)Online publication date: Dec-2018
  • (2018)Exploring overlapping clusters using dynamic re-scaling and samplingKnowledge and Information Systems10.1007/s10115-006-0005-y10:3(295-313)Online publication date: 29-Dec-2018
  • (2017)Multitask fuzzy Bregman co-clustering approach for clustering data with multisource featuresNeurocomputing10.1016/j.neucom.2017.03.062247:C(102-114)Online publication date: 19-Jul-2017
  • (2014)Exploiting named entities for bilingual news clusteringJournal of the Association for Information Science and Technology10.1002/asi.2317566:2(363-376)Online publication date: 6-May-2014
  • (2012)Hybrid Cluster Validation TechniquesAdvances in Computer Science, Engineering & Applications10.1007/978-3-642-30111-7_25(267-273)Online publication date: 2012
  • (2006)Chinese multi-document summarization using adaptive clustering and global search strategyProceedings of the 9th Pacific Rim international conference on Artificial intelligence10.5555/1757898.1758063(1135-1139)Online publication date: 7-Aug-2006
  • (2006)Document re-ranking using cluster validation and label propagationProceedings of the 15th ACM international conference on Information and knowledge management10.1145/1183614.1183713(690-697)Online publication date: 6-Nov-2006
  • (2006)Chinese Multi-document Summarization Using Adaptive Clustering and Global Search StrategyPRICAI 2006: Trends in Artificial Intelligence10.1007/978-3-540-36668-3_148(1135-1139)Online publication date: 2006
  • (2006)Multi-document summarization using a clustering-based hybrid strategyProceedings of the Third Asia conference on Information Retrieval Technology10.1007/11880592_53(608-614)Online publication date: 16-Oct-2006
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media