Article

Regularized clustering for documents

Authors:
Fei Wang

Tsinghua University

Tsinghua University
View Profile

,
Changshui Zhang

Tsinghua University

Tsinghua University
View Profile

,
Tao Li

Florida International University

Florida International University
View Profile

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2007Pages 95–102https://doi.org/10.1145/1277741.1277760

Published:23 July 2007Publication History

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 95–102

ABSTRACT

In recent years, document clustering has been receiving more and more attentions as an important and fundamental technique for unsupervised document organization, automatictopic extraction, and fast information retrieval or filtering. In this paper, we propose a novel method for clustering documents using regularization. Unlike traditional globally regularized clustering methods, our method first construct a local regularized linear label predictor for each document vector, and then combine all those local regularizers with a global smoothness regularizer. So we call our algorithm Clustering with Local and Global Regularization (CLGR). We will show that the cluster memberships of the documents can be achieved by eigenvalue decomposition of a sparse symmetric matrix, which can be efficiently solved by iterative methods. Finally our experimental evaluations on several datasets are presented to show the superiorities of CLGR over traditional document clustering methods.

References

L. Baker and A. McCallum. Distributional Clustering of Words for Text Classification. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 1998. Google ScholarDigital Library
M. Belkin and P. Niyogi. Laplacian Eigenmaps for Dimensionality Reduction and Data Representation. Neural Computation, 15 (6):1373--1396. June 2003. Google ScholarDigital Library
M. Belkin and P. Niyogi. Towards a Theoretical Foundation for Laplacian-Based Manifold Methods. In Proceedings of the 18th Conference on Learning Theory (COLT). 2005. Google ScholarDigital Library
M. Belkin, P. Niyogi and V. Sindhwani. Manifold Regularization: a Geometric Framework for Learning from Examples. Journal of Machine Learning Research 7, 1--48, 2006. Google ScholarDigital Library
D. Boley. Principal Direction Divisive Partitioning. Data mining and knowledge discovery, 2:325--344, 1998. Google ScholarDigital Library
L. Bottou and V. Vapnik. Local learning algorithms. Neural Computation, 4:888--900, 1992. Google ScholarDigital Library
P. K. Chan, D. F. Schlag and J. Y. Zien. Spectral K-way Ratio-Cut Partitioning and Clustering. IEEE Trans. Computer-Aided Design, 13:1088--1096, Sep. 1994.Google ScholarDigital Library
D. R. Cutting, D. R. Karger, J. O. Pederson and J. W. Tukey. Scatter/Gather: A Cluster-Based Approach to Browsing Large Document Collections. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 1992. Google ScholarDigital Library
I. S. Dhillon and D. S. Modha. Concept Decompositions for Large Sparse Text Data using Clustering. Machine Learning, vol. 42(1), pages 143--175, January 2001. Google ScholarDigital Library
C. Ding, X. He, and H. Simon. On the equivalence of nonnegative matrix factorization and spectral clustering. In Proceedings of the SIAM Data Mining Conference, 2005.Google ScholarCross Ref
C. Ding, X. He, H. Zha, M. Gu, and H. D. Simon. A min-max cut algorithm for graph partitioning and data clustering. In Proceedings of the 1st International Conference on Data Mining (ICDM), pages 107--114, 2001. Google ScholarDigital Library
C. Ding, T. Li, W. Peng, and H. Park. Orthogonal Nonnegative Matrix Tri-Factorizations for Clustering. In Proceedings of the Twelfth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2006. Google ScholarDigital Library
R. O. Duda, P. E. Hart, and D. G. Stork. Pattern Classification. John Wiley & Sons, Inc., 2001. Google ScholarDigital Library
T. Li, S. Ma, and M. Ogihara. Document Clustering via Adaptive Subspace Iteration. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2004. Google ScholarDigital Library
T. Li and C. Ding. The Relationships Among Various Nonnegative Matrix Factorization Methods for Clustering. In Proceedings of the 6th International Conference on Data Mining (ICDM). 2006. Google ScholarDigital Library
X. Liu and Y. Gong. Document Clustering with Cluster Refinement and Model Selection Capabilities. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2002. Google ScholarDigital Library
E. Han, D. Boley, M. Gini, R. Gross, K. Hastings, G. Karypis, V. Kumar, B. Mobasher, and J. Moore. WebACE: A Web Agent for Document Categorization and Exploration. In Proceedings of the 2nd International Conference on Autonomous Agents (Agents98 ). ACM Press, 1998. Google ScholarDigital Library
M. Hein, J. Y. Audibert, and U. von Luxburg. From Graphs to Manifolds - Weak and Strong Pointwise Consistency of Graph Laplacians. In Proceedings of the 18th Conference on Learning Theory (COLT), 470--485. 2005. Google ScholarDigital Library
J. He, M. Lan, C. -L. Tan, S. -Y. Sung, and H. -B. Low. Initialization of Cluster Refinement Algorithms: A Review and Comparative Study. In Proceedings of International Joint Conference on Neural Networks, 2004.Google Scholar
A. Y. Ng, M. I. Jordan, Y. Weiss. On Spectral Clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14. 2002.Google Scholar
B. SchÄolkopf and A. Smola. Learning with Kernels. The MIT Press. Cambridge, Massachusetts. 2002.Google Scholar
J. Shi and J. Malik. Normalized Cuts and Image Segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence, 22(8):888--905, 2000. Google ScholarDigital Library
A. Strehl and J. Ghosh. Cluster Ensembles - A Knowledge Reuse Framework for Combining Multiple Partitions. Journal of Machine Learning Research, 3:583--617, 2002. Google ScholarDigital Library
V. N. Vapnik. The Nature of Statistical Learning Theory. Berlin: Springer-Verlag, 1995. Google ScholarDigital Library
Wu, M. and SchÄolkopf, B. A Local Learning Approach for Clustering. In Advances in Neural Information Processing Systems 18. 2006.Google Scholar
S. X. Yu, J. Shi. Multiclass Spectral Clustering. In Proceedings of the International Conference on Computer Vision, 2003. Google ScholarDigital Library
W. Xu, X. Liu and Y. Gong. Document Clustering Based On Non-Negative Matrix Factorization. In Proceedings of the International ACM SIGIR Conference on Research and Development in Information Retrieval, 2003. Google ScholarDigital Library
H. Zha, X. He, C. Ding, M. Gu and H. Simon. Spectral Relaxation for K-means Clustering. In NIPS 14. 2001.Google Scholar
T. Zhang and F. J. Oles. Text Categorization Based on Regularized Linear Classification Methods. Journal of Information Retrieval, 4:5--31, 2001. Google ScholarDigital Library
L. Zelnik-Manor and P. Perona. Self-Tuning Spectral Clustering. In NIPS 17. 2005.Google Scholar
D. Zhou, O. Bousquet, T. N. Lal, J. Weston and B. Scholkopf. Learning with Local and Global Consistency. NIPS 17, 2005.Google Scholar

Index Terms

Regularized clustering for documents
1. Computing methodologies
  1. Machine learning
2. Information systems
  1. Information retrieval
    1. Retrieval tasks and goals
      1. Clustering and classification
  2. Information systems applications
    1. Data mining
      1. Clustering

Recommendations

Hybrid Bisect K-Means Clustering Algorithm
BCGIN '11: Proceedings of the 2011 International Conference on Business Computing and Global Informatization

In this paper, we present a hybrid clustering algorithm that combines divisive and agglomerative hierarchical clustering algorithm. Our method uses bisect K-means for divisive clustering algorithm and Unweighted Pair Group Method with Arithmetic Mean (...
Read More
A similarity assessment technique for effective grouping of documents

Display Omitted Document clustering refers to the task of grouping similar documents and segregating dissimilar documents. It is very useful to find meaningful categories from a large corpus. In practice, the task to categorize a corpus is not so easy, ...
Read More
Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach

As electronic commerce and knowledge economy environments proliferate, both individuals and organizations increasingly generate and consume large amounts of online information, typically available as textual documents. To manage this ever-increasing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval
July 2007
946 pages
ISBN:9781595935977
DOI:10.1145/1277741
General Chairs:
Wessel Kraaij
TNO, The Netherlands
,
Arjen P. de Vries
CWI, The Netherlands
,
Program Chairs:
Charles L. A. Clarke
University of Waterloo, Canada
,
Norbert Fuhr
University of Duisburg-Essen, Germany
,
Noriko Kando
National Institute of Informatics, Japan
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 23 July 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
document clustering
regularization
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 31
  Total Citations
  View Citations
- 1,198
  Total Downloads
- Downloads (Last 12 months)5
- Downloads (Last 6 weeks)0
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Regularized clustering for documents

SIGIR '07: Proceedings of the 30th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Hybrid Bisect K-Means Clustering Algorithm

A similarity assessment technique for effective grouping of documents

Accommodating Individual Preferences in the Categorization of Documents: A Personalized Clustering Approach