skip to main content
10.1145/1277741.1277760acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Regularized clustering for documents

Published:23 July 2007Publication History

ABSTRACT

In recent years, document clustering has been receiving more and more attentions as an important and fundamental technique for unsupervised document organization, automatictopic extraction, and fast information retrieval or filtering. In this paper, we propose a novel method for clustering documents using regularization. Unlike traditional globally regularized clustering methods, our method first construct a local regularized linear label predictor for each document vector, and then combine all those local regularizers with a global smoothness regularizer. So we call our algorithm Clustering with Local and Global Regularization (CLGR). We will show that the cluster memberships of the documents can be achieved by eigenvalue decomposition of a sparse symmetric matrix, which can be efficiently solved by iterative methods. Finally our experimental evaluations on several datasets are presented to show the superiorities of CLGR over traditional document clustering methods.

References

  1. L. Baker and A. McCallum. Distributional Clustering of Words for Text Classification. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15 (6):1373--1396. June 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. M. Belkin and P. Niyogi. Towards a Theoretical Foundation for Laplacian-Based Manifold Methods. In Proceedings of the 18th Conference on Learning Theory (COLT). 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Belkin, P. Niyogi and V. Sindhwani. Manifold Regularization: a Geometric Framework for Learning from Examples. Journal of Machine Learning Research 7, 1--48, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Boley. Principal Direction Divisive Partitioning. Data mining and knowledge discovery, 2:325--344, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation, 4:888--900, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. K. Chan, D. F. Schlag and J. Y. Zien. Spectral K-way Ratio-Cut Partitioning and Clustering. IEEE Trans. Computer-Aided Design, 13:1088--1096, Sep. 1994.Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. D. R. Cutting, D. R. Karger, J. O. Pederson and J. W. Tukey. Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. I. S. Dhillon and D. S. Modha. Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning, vol. 42(1), pages 143--175, January 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM Data Mining Conference, 2005.Google ScholarGoogle ScholarCross RefCross Ref
  11. C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the 1st International Conference on Data Mining (ICDM), pages 107--114, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. C. Ding, T. Li, W. Peng, and H. Park. Orthogonal Nonnegative Matrix Tri-Factorizations for Clustering. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, Inc., 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. T. Li, S. Ma, and M. Ogihara. Document Clustering via Adaptive Subspace Iteration. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. T. Li and C. Ding. The Relationships Among Various Nonnegative Matrix Factorization Methods for Clustering. In Proceedings of the 6th International Conference on Data Mining (ICDM). 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Liu and Y. Gong. Document Clustering with Cluster Refinement and Model Selection Capabilities. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. WebACE: A Web Agent for Document Categorization and Exploration. In Proceedings of the 2nd International Conference on Autonomous Agents (Agents98 ). ACM Press, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. M. Hein, J. Y. Audibert, and U. von Luxburg. From Graphs to Manifolds - Weak and Strong Pointwise Consistency of Graph Laplacians. In Proceedings of the 18th Conference on Learning Theory (COLT), 470--485. 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. J. He, M. Lan, C. -L. Tan, S. -Y. Sung, and H. -B. Low. Initialization of Cluster Refinement Algorithms: A Review and Comparative Study. In Proceedings of International Joint Conference on Neural Networks, 2004.Google ScholarGoogle Scholar
  20. A. Y. Ng, M. I. Jordan, Y. Weiss. On Spectral Clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14. 2002.Google ScholarGoogle Scholar
  21. B. SchÄolkopf and A. Smola. Learning with Kernels. The MIT Press. Cambridge, Massachusetts. 2002.Google ScholarGoogle Scholar
  22. J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. A. Strehl and J. Ghosh. Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3:583--617, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. V. N. Vapnik. The Nature of Statistical Learning Theory. Berlin: Springer-Verlag, 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Wu, M. and SchÄolkopf, B. A Local Learning Approach for Clustering. In Advances in Neural Information Processing Systems 18. 2006.Google ScholarGoogle Scholar
  26. S. X. Yu, J. Shi. Multiclass Spectral Clustering. In Proceedings of the International Conference on Computer Vision, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. W. Xu, X. Liu and Y. Gong. Document Clustering Based On Non-Negative Matrix Factorization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. H. Zha, X. He, C. Ding, M. Gu and H. Simon. Spectral Relaxation for K-means Clustering. In NIPS 14. 2001.Google ScholarGoogle Scholar
  29. T. Zhang and F. J. Oles. Text Categorization Based on Regularized Linear Classification Methods. Journal of Information Retrieval, 4:5--31, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. L. Zelnik-Manor and P. Perona. Self-Tuning Spectral Clustering. In NIPS 17. 2005.Google ScholarGoogle Scholar
  31. D. Zhou, O. Bousquet, T. N. Lal, J. Weston and B. Scholkopf. Learning with Local and Global Consistency. NIPS 17, 2005.Google ScholarGoogle Scholar

Index Terms

  1. Regularized clustering for documents

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in
        • Published in

          cover image ACM Conferences
          SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
          July 2007
          946 pages
          ISBN:9781595935977
          DOI:10.1145/1277741

          Copyright © 2007 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 23 July 2007

          Permissions

          Request permissions about this article.

          Request Permissions

          Check for updates

          Qualifiers

          • Article

          Acceptance Rates

          Overall Acceptance Rate792of3,983submissions,20%

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader