skip to main content
10.1145/1148170.1148241acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Document clustering with prior knowledge

Published: 06 August 2006 Publication History

Abstract

Document clustering is an important tool for text analysis and is used in many different applications. We propose to incorporate prior knowledge of cluster membership for document cluster analysis and develop a novel semi-supervised document clustering model. The method models a set of documents with weighted graph in which each document is represented as a vertex, and each edge connecting a pair of vertices is weighted with the similarity value of the two corresponding documents. The prior knowledge indicates pairs of documents that known to belong to the same cluster. Then, the prior knowledge is transformed into a set of constraints. The document clustering task is accomplished by finding the best cuts of the graph under the constraints. We apply the model to the Normalized Cut method to demonstrate the idea and concept. Our experimental evaluations show that the proposed document clustering model reveals remarkable performance improvements with very limited training samples, and hence is a very effective semi-supervised classification tool.

References

[1]
S. Basu, A. Banerjee, and R. J. Mooney Semi-supervised Clustering by Seeding, Proceedings of the 19th International Conference on Machine Learning (ICML-2002), pp. 19--26, Sydney, 2002.
[2]
K. P. Bennett, and A. Demiriz Semi-Supervised Support Vector Machines, Advances in Neural Information Processing Systems, pp 368--374, 1998
[3]
D. Cutting, D. Karger, J. Pederson, and J. Tukey. A cluster-based approach to browsing large document collections. Proceedings of ACM SIGIR, 1992.
[4]
P. K. Chan, D. F. Schlag, and J. Y. Zien. Spectral k-way ratio-cut partitioning an clustering. IEEE Transaction Computer-Aided Design, 13:1088--1096, September 1994.
[5]
C. Ding, X. He, H. Zha, M. Gu, and H. Simon. Spectral Min-Max Cut for Graph Partitioning and Data Clustering. Proc. of 1st IEEE Int'l Conf. Data Mining. San Jose, CA, 2001. pp.107--114.
[6]
D. Eichmann, M. Ruiz, and P. Srinivasan. Cluster-Based adaptive and batch filtering. In Proceedings of the 7th Text Retrieval Conference. NIST, 2000
[7]
G. H. Golub and C. F. Van Loan. Matrix Computations. John Hopkins Press, 1999
[8]
J. Hartigan and M. Wong. A k-means clustering algorithm. Applied Statistics, 28:100--108, 1979
[9]
A. K. Jain, M. N. Murty, and P. J. Flynn. Data Clustering: A Review. ACM Computing Surveys, Vol. 31, No. 3, Septermber 1999.
[10]
T. Joachims Transductive inference for text classification using support vector machines. Proc. 16th International Conf. on Machine Learning, pp.200--209, Morgan Kaufmann, San Francisco, CA.
[11]
T. Joachim. Transductive Learning via Spectral Graph Partitioning. Proceedings of the International Conference on Machine Learning, pp.290--297, 2003.
[12]
R. S. Michalski and R. E. Stepp. Learning from observation: Conceptual clustering. Machine Learning, an Artificial Intelligence Approach, pages 331--363. Tioga Publishing Co., Palo Alto, CA, 1983.
[13]
T. Mitchell. The role of unlabeled data in supervised learning. Proceedings of the Sixth International Colloquium on Cognitive Science, 1999.
[14]
J.L.Neto, A.D.Santos, C.A.A. Kaestner, and A.A. Freitas. Document Clustering and Text Summarization. 4th International Conference on Practical Applications of Knowledge Discovery and Data Ming, London, 2000.
[15]
K. Nigam, R. Ghani, Analyzing the effectiveness and applicability of co-training, Ninth International Conference on Information and Knowledge Management, pp. 86--93, 2000.
[16]
S. Siersdorfer and S. Sizov Restrictive Clustering and Metaclustering for Self-organizing Document Collections. Proceedings of ACM SIGIR, 2004.
[17]
J. Shi and J. Malik. Normalized cuts and image segmentation. IEEE Transaction on Pattern Analysis and Machine Intelligence, 2000.
[18]
F. Wilcoxon. Individual comparisons by ranking methods. Biometrics, 1, 80--93.
[19]
K. Wagstaff, S. Rogers, and S. Schroedl. Constrained K-means clustering with background knowledge. Proceedings of the 18th Internation Conference on Machine Learning, 577--584, 2001
[20]
K. Wagstaff and C. Cardie. Clustering with instance-level constraints. Proceedings of the 17th Internation Conference on Machine Learning, 1103--1110, 2001
[21]
Y. Yang and X. Liu A re-examination of text categorization methods. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp 42--49, 1999.
[22]
S. X. Yu and J. Shi Grouping with Bias. Neural Information Processing Systems, 2001.
[23]
S. X. Yu and J. Shi Segmentation Given Partial Grouping Constraints IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 26, No2, February 2004.
[24]
H. J. Zeng, Q. C. He, Z. Chen, W. Y. Ma, and J. Ma Learning to clustering web search results. Proceedings of ACM SIGIR Conference on Research and Development in Information Retrieval, pp 210--216, 2004
[25]
H. Zha, C. Ding, M. Gu and H. Simon. Spectral relaxation for k-means clustering. Proceedings of Advances in Neural Information Processing Systems, vol 14, 2002.
[26]
http://svmlight.joachims.org/
[27]
X. Zhu, Z. Ghahramani, J. Lafferty. Semi-Supervised Learning Using Gaussian Fields and Harmonic Functions. The Twentieth International Conference on Machine Learning, 2003.

Cited By

View all
  • (2025)Coherency-Constrained Spectral Clustering for Power Network ReductionIEEE Open Access Journal of Power and Energy10.1109/OAJPE.2025.353861912(88-99)Online publication date: 2025
  • (2024)A Sequential Min K-Cut Approach for Sum Rate Maximization of Clustered Cell-Free NetworkingICC 2024 - IEEE International Conference on Communications10.1109/ICC51166.2024.10622541(4900-4905)Online publication date: 9-Jun-2024
  • (2024)Contrastive Learning with Transformer Initialization and Clustering Prior for Text RepresentationApplied Soft Computing10.1016/j.asoc.2024.112162166(112162)Online publication date: Nov-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval
August 2006
768 pages
ISBN:1595933697
DOI:10.1145/1148170
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 August 2006

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. clustering with prior knowledge
  2. semi-supervised learning
  3. spectral clustering

Qualifiers

  • Article

Conference

SIGIR06
Sponsor:
SIGIR06: The 29th Annual International SIGIR Conference
August 6 - 11, 2006
Washington, Seattle, USA

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)15
  • Downloads (Last 6 weeks)2
Reflects downloads up to 17 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Coherency-Constrained Spectral Clustering for Power Network ReductionIEEE Open Access Journal of Power and Energy10.1109/OAJPE.2025.353861912(88-99)Online publication date: 2025
  • (2024)A Sequential Min K-Cut Approach for Sum Rate Maximization of Clustered Cell-Free NetworkingICC 2024 - IEEE International Conference on Communications10.1109/ICC51166.2024.10622541(4900-4905)Online publication date: 9-Jun-2024
  • (2024)Contrastive Learning with Transformer Initialization and Clustering Prior for Text RepresentationApplied Soft Computing10.1016/j.asoc.2024.112162166(112162)Online publication date: Nov-2024
  • (2023)Generalized Probabilistic Clustering Projection Models for Discrete Data2023 International Symposium on Networks, Computers and Communications (ISNCC)10.1109/ISNCC58260.2023.10323873(1-7)Online publication date: 23-Oct-2023
  • (2019)Clustering and Its Extensions in the Social Media DomainAdaptive Resonance Theory in Social Media Data Clustering10.1007/978-3-030-02985-2_2(15-44)Online publication date: 1-May-2019
  • (2018)Document ClusteringInformation Retrieval and Management10.4018/978-1-5225-5191-1.ch003(47-64)Online publication date: 2018
  • (2018)Partition Level Constrained ClusteringIEEE Transactions on Pattern Analysis and Machine Intelligence10.1109/TPAMI.2017.276394540:10(2469-2483)Online publication date: 1-Oct-2018
  • (2018)A Prior Knowledge Based Approach to Improving Accuracy of Web Services Clustering2018 IEEE International Conference on Services Computing (SCC)10.1109/SCC.2018.00008(1-8)Online publication date: Jul-2018
  • (2018)Clustering data with partial background informationInternational Journal of Machine Learning and Cybernetics10.1007/s13042-018-0790-0Online publication date: 5-Feb-2018
  • (2018)Clustering of biomedical documents using ontology-based TF-IGM enriched semantic smoothing model for telemedicine applicationsCluster Computing10.1007/s10586-018-2023-4Online publication date: 20-Mar-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media