ABSTRACT
Traditional document clustering techniques group similar documents without any user interaction. Although such methods minimize user effort, the clusters they generate are often not in accord with their users' conception of the document collection. In this paper we describe a new framework and experiments with it exploring how clustering might be improved by including user supervision at the level of selecting features that are used to distinguish between documents. Our features are based on the words that appear in documents (see §4.1 for details.) We conjecture that clusters better matching user expectations can be generated with user input at the feature level. In order to verify our conjecture, we propose a novel iterative framework which involves users interactively selecting the features used to cluster documents. Unlike existing semi-supervised clustering, which asks users to label constraints between documents, this framework interactively asks users to label features. The proposed method ranks all features based on the recent clusters using cluster-based feature selection and presents a list of highly ranked features to users for labeling. The feature set for next clustering iteration includes both features accepted by users and other highly ranked features. The experimental results on several real datasets demonstrate that the feature set obtained using the new interactive framework can produce clusters that better match the user's expectations. Moreover, we quantify and evaluate the effect of reweighting previously accepted features and of user effort.
- S. Basu, A. Banerjee, and R. Mooney. Semi-supervised clustering by seeding. In International Conference on Machine Learning, pages 19--26, 2002. Google ScholarDigital Library
- S. Basu, M. Bilenko, and R. Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 59--68. ACM, 2004. Google ScholarDigital Library
- H. Cheng, K. Hua, and K. Vu. Constrained locally weighted clustering. Proceedings of the PVLDB'08, 1(1): 90--101, 2008. Google ScholarDigital Library
- F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. In Proceedings of the 2003 ACM Symposium on Applied Computing, pages 784--788. ACM, 2003. Google ScholarDigital Library
- A. Dempster, N. Laird, D. Rubin, et al. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1--38, 1977.Google ScholarCross Ref
- B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM Research Division, 2001.Google Scholar
- Y. Hu, E. Milios, and J. Blustein. Interactive Document Clustering Using Iterative Class-Based Feature Selection. Technical report, Faculty of Computer Science, Dalhousie University, Canada, 2010.Google Scholar
- R. Huang and W. Lam. An active learning framework for semi-supervised document clustering with language modeling. Data & Knowledge Engineering, 68(1): 49--67, 2009. Google ScholarDigital Library
- X. Ji and W. Xu. Document clustering with prior knowledge. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 412. ACM, 2006. Google ScholarDigital Library
- D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 148--156, 1994.Google ScholarCross Ref
- B. Liu, X. Li, W. Lee, and P. Yu. Text classification by labeling words. In Proceedings of the National Conference on Artificial Intelligence, pages 425--430, 2004. Google ScholarDigital Library
- H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In Proceedings of IJCAI 05: The 19th International Joint Conference on Artificial Intelligence, pages 841--846, 2005. Google ScholarDigital Library
- L. Rigutini and M. Maggini. A semi-supervised document clustering algorithm based on EM. In Proceedings of the 2005 IEEE/WIC/ACM International conference on Web Intelligence (WI'05), 2005. Google ScholarDigital Library
- B. Tang, M. Shepherd, E. Milios, and M. Heywood. Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering. In International Workshop on Feature Selection for Data Mining, in conjunction with 2005 SIAM International Conference on Data Mining, Newport Beach, California, April 23 2005.Google Scholar
- W. Tang, H. Xiong, S. Zhong, and J. Wu. Enhancing semi-supervised clustering: a feature projection perspective. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 707--716. ACM, 2007. Google ScholarDigital Library
Index Terms
- Interactive feature selection for document clustering
Recommendations
Enhancing semi-supervised document clustering with feature supervision
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied ComputingTraditional semi-supervised clustering uses only limited user supervision in the form of labeled instances and pairwise instance constraints to aid unsupervised clustering. However, user supervision can also be provided in alternative forms for document ...
A unified framework for document clustering with dual supervision
Semi-supervised clustering algorithms for general problems use a small amount of labeled instances or pairwise instance constraints to aid the unsupervised clustering. However, user supervision can also be provided in alternative forms for document ...
Semi-supervised document clustering with dual supervision through seeding
SAC '12: Proceedings of the 27th Annual ACM Symposium on Applied ComputingSemi-supervised clustering algorithms for general problems use a small amount of labeled instances or pairwise instance constraints to aid the unsupervised clustering. However, user supervision can also be provided in alternative forms for document ...
Comments