Skip to main content
Log in

PHD: an efficient data clustering scheme using partition space technique for knowledge discovery in large databases

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Rapid technological advances imply that the amount of data stored in databases is rising very fast. However, data mining can discover helpful implicit information in large databases. How to detect the implicit and useful information with lower time cost, high correctness, high noise filtering rate and fit for large databases is of priority concern in data mining, specifying why considerable clustering schemes have been proposed in recent decades. This investigation presents a new data clustering approach called PHD, which is an enhanced version of KIDBSCAN. PHD is a hybrid density-based algorithm, which partitions the data set by K-means, and then clusters the resulting partitions with IDBSCAN. Finally, the closest pairs of clusters are merged until the natural number of clusters of data set is reached. Experimental results reveal that the proposed algorithm can perform the entire clustering, and efficiently reduce the run-time cost. They also indicate that the proposed new clustering algorithm conducts better than several existing well-known schemes such as the K-means, DBSCAN, IDBSCAN and KIDBSCAN algorithms. Consequently, the proposed PHD algorithm is efficient and effective for data clustering in large databases.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Similar content being viewed by others

References

  1. Agrawal R, Gehrke J, Gunopulos D, Raghavan P (1998) Automatic subspace clustering of high dimensional data for data mining applications. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 94–105

  2. Borah B, Bhattacharyya DK (2004) An improved sampling-based DBSCAN for large spatial databases. In: Proceedings of international conference on intelligent sensing and information processing, pp 92–96

  3. Breitenbach M, Grudic GZ (2005) Clustering through ranking on manifolds. In: Proceedings of the 22nd international conference on machine learning, pp 73–80

  4. Chen Y, Rege M, Dong M, Hua J (2008) Non-negative matrix factorization for semi-supervised data clustering. Knowl Inf Syst 17(3):355–379

    Article  Google Scholar 

  5. Cheng H, Hua KA, Vu K (2008) Constrained locally weighted clustering. In: Proceedings of the VLDB endowment, pp 90–101

  6. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining, pp 226–231

  7. Filippone M, Camastra F, Masulli F, Rovetta S (2008) A survey of kernel and spectral methods for clustering. Pattern Recogn 41:176–190

    Article  MATH  Google Scholar 

  8. Fisher R (1936) The use of multiple measurements in taxonomic problems. Ann Eugen 7:179–188

    Google Scholar 

  9. Guha S, Rastogi R, Shim K (1998) CURE: An efficient clustering algorithm for large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 73–84

  10. Guha S, Rastogi R, Shim K (1999) ROCK: A robust clustering algorithm for categorical attributes. In: Proceedings of the 15th international conference on data engineering, pp 512–521

  11. Karypis G, Han EH, Kumar V (1999) CHAMELEON: A hierarchical clustering using dynamic modeling. IEEE Comput 32(8):68–75

    Google Scholar 

  12. MacQueen JB (1967) Some methods for classification and analysis of multivariate observations. In: Proceedings of the 5th Berkeley symposium on mathematical statistics and probability, vol 1, pp 281–297

  13. Tsai C-F, Liu C-W (2006) KIDBSCAN: A new efficient data clustering algorithm for data mining in large databases. Lect Notes Comput Sci (LNCS) 4029:702–711

    Article  Google Scholar 

  14. Tsai C-F, Yen C-C (2007) ANGEL: A new effective and efficient hybrid clustering technique for large databases. Lect Notes Comput Sci (LNCS) 4426:817–824

    Article  Google Scholar 

  15. equation:UCI Repository. http://www.sgi.com/tech/mlc/db/

  16. Wang T-P, Tsai C-F (2006) GDH: An effective and efficient approach to detect arbitrary patterns in clusters with noises in very large databases. Master thesis, National Pingtung University of Science and Technology, Taiwan

  17. Ward JH Jr (1963) Hierarchical grouping to optimize an objective function. J Am Stat Assoc 58(301):236–244

    Article  Google Scholar 

  18. Zhang T, Ramakrishnan R (1996) BIRCH: An efficient data clustering method for very large databases. In: Proceedings of the ACM SIGMOD international conference on management of data, pp 103–114

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Cheng-Fa Tsai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Tsai, CF., Yeh, HF., Chang, JF. et al. PHD: an efficient data clustering scheme using partition space technique for knowledge discovery in large databases. Appl Intell 33, 39–53 (2010). https://doi.org/10.1007/s10489-010-0239-y

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-010-0239-y

Keywords

Navigation