On Knowledge-Enhanced Document Clustering

On Knowledge-Enhanced Document Clustering

Manjeet Rege, Josan Koruthu, Reynold Bailey
Copyright: © 2012 |Volume: 2 |Issue: 3 |Pages: 11
ISSN: 2155-6377|EISSN: 2155-6385|EISBN13: 9781466612631|DOI: 10.4018/ijirr.2012070105
Cite Article Cite Article

MLA

Rege, Manjeet, et al. "On Knowledge-Enhanced Document Clustering." IJIRR vol.2, no.3 2012: pp.72-82. http://doi.org/10.4018/ijirr.2012070105

APA

Rege, M., Koruthu, J., & Bailey, R. (2012). On Knowledge-Enhanced Document Clustering. International Journal of Information Retrieval Research (IJIRR), 2(3), 72-82. http://doi.org/10.4018/ijirr.2012070105

Chicago

Rege, Manjeet, Josan Koruthu, and Reynold Bailey. "On Knowledge-Enhanced Document Clustering," International Journal of Information Retrieval Research (IJIRR) 2, no.3: 72-82. http://doi.org/10.4018/ijirr.2012070105

Export Reference

Mendeley
Favorite Full-Issue Download

Abstract

Document clustering plays an important role in text analytics by finding natural groupings of documents based on their similarity determined by the words appearing in them. Many of the clustering algorithms accessible through various text analytics tools are completely unsupervised in nature. That is, they are unable to incorporate any domain knowledge that might be available about the documents to improve the clustering accuracy and relevance. The authors present a graph partitioning based semi-supervised document clustering algorithm. The user provides knowledge about few of the documents in the form of “must-link” and “cannot-link” constraints between pairs of documents. A “must-link” constraint between two documents expresses the fact that the user feels that the two corresponding documents must be clustered irrespective of their dissimilarity. Similarly, a “cannot-link” signifies that the two documents should never be clustered together no matter how similar they might happen to be. These constraints are then incorporated into a graph partitioning based into a computationally efficient document clustering algorithm. Through experiments performed on publicly available text datasets, the proposed framework is validated.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.