Skip to main content
Log in

Multi-step density-based clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Data mining in large databases of complex objects from scientific, engineering or multimedia applications is getting more and more important. In many areas, complex distance measures are first choice but also simpler distance functions are available which can be computed much more efficiently. In this paper, we will demonstrate how the paradigm of multi-step query processing which relies on exact as well as on lower-bounding approximated distance functions can be integrated into the two density-based clustering algorithms DBSCAN and OPTICS resulting in a considerable efficiency boost. Our approach tries to confine itself to ɛ-range queries on the simple distance functions and carries out complex distance computations only at that stage of the clustering algorithm where they are compulsory to compute the correct clustering result. Furthermore, we will show how our approach can be used for approximated clustering allowing the user to find an individual trade-off between quality and efficiency. In order to assess the quality of the resulting clusterings, we introduce suitable quality measures which can be used generally for evaluating the quality of approximated partitioning and hierarchical clusterings. In a broad experimental evaluation based on real-world test data sets, we demonstrate that our approach accelerates the generation of exact density-based clusterings by more than one order of magnitude. Furthermore, we show that our approximated clustering approach results in high quality clusterings where the desired quality is scalable with respect to (w.r.t.) the overall number of exact distance computations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agrawal R, Faloutsos C, Swami A (1993) Efficient similarity search in sequence databases. In: Proceedings of the 4th. international conference on foundations of data organization and algorithms (FODO'93), Evanston. ILL. Lecture notes in computer science (LNCS), vol 730, pp 69–84, Springer

  2. Ankerst M, Breunig MM, Kriegel H-P, Sander J (1999) OPTICS: Ordering points to identify the clustering structure. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD '99), Philadelphia, PA, pp 49–60

  3. Braunmüller B, Ester M, Kriegel H-P, Sander J (2000) Efficiently supporting multiple similarity queries for mining in metric databases. In: Proceedings of the international conference on data engineering (ICDE, 2000), San Diego, CA, pp 256–267

  4. Brecheisen S, Kriegel H-P, Kröger P, Pfeifle M (2004) Visually mining through cluster hierarchies. In: Proceedings of the SIAM international conference on data mining (SDM'04), Lake Buena Vista, FL, pp 400–412

  5. Böhm C, Braunmüller B, Breunig M, Kriegel H-P (2000) High performance clustering based on the similarity join. In: Proceedings of the 9th international conference on information and knowledge management (CIKM, 2000), Washington, DC, pp 298–313

  6. Chávez E, Navarro G, Beaza-Yates R, Marroquín J (2001) Searching in metric spaces. ACM Comput Surv 33(3): 273–321

    Article  Google Scholar 

  7. Ciaccia P, Patella M, Zezula P (1997) M-Tree: An efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd international conference of very large data bases, Athens, Greece, pp 426–435

  8. Eiter T, Mannila H (1997) Distance measures for point sets and their computation. Acta Informatica 34(2): 103–133

    Article  MathSciNet  Google Scholar 

  9. Ester M, Kriegel H-P, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd international conference on knowledge discovery and data mining (KDD' 96), AAAI Press, Portland, OR, pp 291–316

  10. Fonseca MJ, Jorge JA (2003) Indexing high-dimensional data for content-based retrieval in large databases. In: Proceedings of the 8th international conference on database systems for advanced applications (DASFAA' 03), Kyoto, Japan, pp 267–274

  11. Gaede VGO (1998) Multidimensional access methods. ACM Comput Surv 30(2): 170–231

    Article  Google Scholar 

  12. Guttman A (1984) R-trees: A dynamic index structure for spatial searching. In: Proceedings of the ACM SIGMOD international conference on management of data (SIGMOD' 84), pp 47–57

  13. Jain AK, Murty MN, Flynn PJ (1999) Data clustering: A review. ACM Comput Surv 31(3): 265–323

    Article  Google Scholar 

  14. Kailing K, Kriegel H-P, Pryakhin A, Schubert M (2004a) Clustering multi-represented objects with noise. In: Proceedings of the 8th Pacific-Asia conference on knowledge discovery and data mining (PAKDD' 04), Sydney, Australia, pp 394–403

  15. Kailing K, Kriegel H-P, Schönauer S, Seidl T (2004b) Efficient similarity search for hierarchical data in large databases. In: Proceedings of the 9th international conference on extending database technology (EDBT' 04), Heraklion, Greece, pp 676–693

  16. Kriegel H-P, Brecheisen S, Kröger P, Pfeifle M, Schubert M (2003a)

  17. Kriegel H-P, Kröger P, Mashael Z, Pfeifle M, Pötke M, Seidl T (2003b) Effective similarity search on voxelized CAD objects. In: Proceedings of the 8th international conference on database systems for advanced applications (DASFAA' 03), Kyoto, Japan, pp 27–36

  18. Ramon J, Bruynooghe M (2001) A polynomial time computable metric between point sets. Acta Informatica 37:765–780

    Article  MATH  MathSciNet  Google Scholar 

  19. Sander J, Qin X, Lu Z, Niu N, Kovarsky A (2003) Automatic extraction of clusters from hierarchical clustering representations. In: Proceedings of the 7th Pacific-Asia conference on knowledge discovery and data mining (PAKDD, 2003), Seoul, Korea, pp 75–87

  20. Wang JTL, Wang X, Lin KI, Shasha D, Shapiro BA, Zhang K (1999) Evaluating a class of distance-mapping algorithms for data mining and clustering. In: Proceedings of the 5th international conference on knowledge discovery and data mining (KDD' 99), San Diego, CA, pp 307–311

  21. Zhang K, Wang J, Shasha D (1996) On the editing distance between undirected acyclic graphs. Int J Found Comput Sci 7(1):43–57

    Article  MATH  Google Scholar 

  22. Zhou J, Sander S (2003) Data bubbles for non-vector data: Speeding-up hierarchical clustering in arbitrary metric spaces. In: Proceedings of the 29th international conference on very large databases (VLDB' 03), Berlin, Germany, pp 452–463

Download references

Author information

Authors and Affiliations

Authors

Additional information

Stefan Brecheisen is a teaching and research assistant in Prof.$ Hans-Peter Kriegel's group. He works in the field of similarity search in spatial objects.

Hans-Peter Kriegel is a full professor at the University of Munich and head of the database group since 1991. He studied computer science at the University of Karlsruhe, Germany, and finished his doctoral thesis there in 1976. He has more than 200 publications in international journals and reviewed conference proceedings. His research interests are database systems for complex objects (molecular biology, medical science, multimedia, CAD, etc.), in particular query processing, similarity search, high-dimensional index structures, as well as knowledge discovery in databases and data mining.

Martin Pfeifle is a teaching and research assistant in Prof.$ Hans-Peter Kriegel's group. He finished his doctoral thesis on “Spatial Database Support for Virtual Engineering” in the spring of 2004.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Brecheisen, S., Kriegel, HP. & Pfeifle, M. Multi-step density-based clustering. Knowl Inf Syst 9, 284–308 (2006). https://doi.org/10.1007/s10115-005-0217-6

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-005-0217-6

Navigation