Reference Hub3
Clustering with Proximity Graphs: Exact and Efficient Algorithms

Clustering with Proximity Graphs: Exact and Efficient Algorithms

Michail Kazimianec, Nikolaus Augsten
Copyright: © 2013 |Volume: 3 |Issue: 4 |Pages: 21
ISSN: 2155-6393|EISSN: 2155-6407|EISBN13: 9781466635920|DOI: 10.4018/ijkbo.2013100105
Cite Article Cite Article

MLA

Kazimianec, Michail, and Nikolaus Augsten. "Clustering with Proximity Graphs: Exact and Efficient Algorithms." IJKBO vol.3, no.4 2013: pp.84-104. http://doi.org/10.4018/ijkbo.2013100105

APA

Kazimianec, M. & Augsten, N. (2013). Clustering with Proximity Graphs: Exact and Efficient Algorithms. International Journal of Knowledge-Based Organizations (IJKBO), 3(4), 84-104. http://doi.org/10.4018/ijkbo.2013100105

Chicago

Kazimianec, Michail, and Nikolaus Augsten. "Clustering with Proximity Graphs: Exact and Efficient Algorithms," International Journal of Knowledge-Based Organizations (IJKBO) 3, no.4: 84-104. http://doi.org/10.4018/ijkbo.2013100105

Export Reference

Mendeley
Favorite Full-Issue Download

Abstract

Graph Proximity Cleansing (GPC) is a string clustering algorithm that automatically detects cluster borders and has been successfully used for string cleansing. For each potential cluster a so-called proximity graph is computed, and the cluster border is detected based on the proximity graph. However, the computation of the proximity graph is expensive and the state-of-the-art GPC algorithms only approximate the proximity graph using a sampling technique. Further, the quality of GPC clusters has never been compared to standard clustering techniques like k-means, density-based, or hierarchical clustering. In this article the authors propose two efficient algorithms, PG-DS and PG-SM, for the exact computation of proximity graphs. The authors experimentally show that our solutions are faster even if the sampling-based algorithms use very small sample sizes. The authors provide a thorough experimental evaluation of GPC and conclude that it is very efficient and shows good clustering quality in comparison to the standard techniques. These results open a new perspective on string clustering in settings, where no knowledge about the input data is available.

Request Access

You do not own this content. Please login to recommend this title to your institution's librarian or purchase it from the IGI Global bookstore.