skip to main content
10.1145/1982185.1982436acmconferencesArticle/Chapter ViewAbstractPublication PagessacConference Proceedingsconference-collections
research-article

Interactive feature selection for document clustering

Published:21 March 2011Publication History

ABSTRACT

Traditional document clustering techniques group similar documents without any user interaction. Although such methods minimize user effort, the clusters they generate are often not in accord with their users' conception of the document collection. In this paper we describe a new framework and experiments with it exploring how clustering might be improved by including user supervision at the level of selecting features that are used to distinguish between documents. Our features are based on the words that appear in documents (see §4.1 for details.) We conjecture that clusters better matching user expectations can be generated with user input at the feature level. In order to verify our conjecture, we propose a novel iterative framework which involves users interactively selecting the features used to cluster documents. Unlike existing semi-supervised clustering, which asks users to label constraints between documents, this framework interactively asks users to label features. The proposed method ranks all features based on the recent clusters using cluster-based feature selection and presents a list of highly ranked features to users for labeling. The feature set for next clustering iteration includes both features accepted by users and other highly ranked features. The experimental results on several real datasets demonstrate that the feature set obtained using the new interactive framework can produce clusters that better match the user's expectations. Moreover, we quantify and evaluate the effect of reweighting previously accepted features and of user effort.

References

  1. S. Basu, A. Banerjee, and R. Mooney. Semi-supervised clustering by seeding. In International Conference on Machine Learning, pages 19--26, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Basu, M. Bilenko, and R. Mooney. A probabilistic framework for semi-supervised clustering. In Proceedings of the tenth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 59--68. ACM, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. H. Cheng, K. Hua, and K. Vu. Constrained locally weighted clustering. Proceedings of the PVLDB'08, 1(1): 90--101, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. F. Debole and F. Sebastiani. Supervised term weighting for automated text categorization. In Proceedings of the 2003 ACM Symposium on Applied Computing, pages 784--788. ACM, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Dempster, N. Laird, D. Rubin, et al. Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society. Series B (Methodological), 39(1): 1--38, 1977.Google ScholarGoogle ScholarCross RefCross Ref
  6. B. Dom. An information-theoretic external cluster-validity measure. Technical Report RJ 10219, IBM Research Division, 2001.Google ScholarGoogle Scholar
  7. Y. Hu, E. Milios, and J. Blustein. Interactive Document Clustering Using Iterative Class-Based Feature Selection. Technical report, Faculty of Computer Science, Dalhousie University, Canada, 2010.Google ScholarGoogle Scholar
  8. R. Huang and W. Lam. An active learning framework for semi-supervised document clustering with language modeling. Data & Knowledge Engineering, 68(1): 49--67, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. X. Ji and W. Xu. Document clustering with prior knowledge. In Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, page 412. ACM, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. D. Lewis and J. Catlett. Heterogeneous uncertainty sampling for supervised learning. In Proceedings of the Eleventh International Conference on Machine Learning, pages 148--156, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  11. B. Liu, X. Li, W. Lee, and P. Yu. Text classification by labeling words. In Proceedings of the National Conference on Artificial Intelligence, pages 425--430, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. H. Raghavan, O. Madani, and R. Jones. Interactive feature selection. In Proceedings of IJCAI 05: The 19th International Joint Conference on Artificial Intelligence, pages 841--846, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. L. Rigutini and M. Maggini. A semi-supervised document clustering algorithm based on EM. In Proceedings of the 2005 IEEE/WIC/ACM International conference on Web Intelligence (WI'05), 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. B. Tang, M. Shepherd, E. Milios, and M. Heywood. Comparing and Combining Dimension Reduction Techniques for Efficient Text Clustering. In International Workshop on Feature Selection for Data Mining, in conjunction with 2005 SIAM International Conference on Data Mining, Newport Beach, California, April 23 2005.Google ScholarGoogle Scholar
  15. W. Tang, H. Xiong, S. Zhong, and J. Wu. Enhancing semi-supervised clustering: a feature projection perspective. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 707--716. ACM, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Interactive feature selection for document clustering

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SAC '11: Proceedings of the 2011 ACM Symposium on Applied Computing
          March 2011
          1868 pages
          ISBN:9781450301138
          DOI:10.1145/1982185

          Copyright © 2011 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 21 March 2011

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • research-article

          Acceptance Rates

          Overall Acceptance Rate1,650of6,669submissions,25%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader