DBDC: Density Based Distributed Clustering

Januzaj, Eshref; Kriegel, Hans-Peter; Pfeifle, Martin

doi:10.1007/978-3-540-24741-8_7

Eshref Januzaj¹¹,
Hans-Peter Kriegel¹¹ &
Martin Pfeifle¹¹

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 2992))

Included in the following conference series:

International Conference on Extending Database Technology

2316 Accesses
57 Citations

Abstract

Clustering has become an increasingly important task in modern application domains such as marketing and purchasing assistance, multimedia, molecular biology as well as many others. In most of these areas, the data are originally collected at different sites. In order to extract information from these data, they are merged at a central site and then clustered. In this paper, we propose a different approach. We cluster the data locally and extract suitable representatives from these clusters. These representatives are sent to a global server site where we restore the complete clustering based on the local representatives. This approach is very efficient, because the local clustering can be carried out quickly and independently from each other. Furthermore, we have low transmission cost, as the number of transmitted representatives is much smaller than the cardinality of the complete data set. Based on this small number of representatives, the global clustering can be done very efficiently. For both the local and the global clustering, we use a density based clustering algorithm. The combination of both the local and the global clustering forms our new DBDC (Density Based Distributed Clustering) algorithm. Furthermore, we discuss the complex problem of finding a suitable quality measure for evaluating distributed clusterings. We introduce two quality criteria which are compared to each other and which allow us to evaluate the quality of our DBDC algorithm. In our experimental evaluation, we will show that we do not have to sacrifice clustering quality in order to gain an efficiency advantage when using our distributed clustering approach.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: OPTICS: Ordering Points To Identify the Clustering Structure. In: Proc. ACM SIGMOD, Philadelphia, PA, pp. 49–60 (1999)
Google Scholar
Agrawal, R., Shafer, J.C.: Parallel mining of association rules: Design, implementation, and experience. IEEE Trans. Knowledge and Data Eng. 8, 962–969 (1996)
Article Google Scholar
Beckmann, N., Kriegel, H.-P., Schneider, R., Seeger, B.: The R*-tree: An Efficient and Robust Access Method for Points and Rectangles. In: Proc. ACM SIGMOD Int. Conf. on Management of Data (SIGMOD 1990), Atlantic City, NJ, pp. 322–331. ACM Press, New York (1990)
Chapter Google Scholar
Ciaccia, P., Patella, M., Zezula, P.: M-tree: An Efficient Access Method for Similarity Search in Metric Spaces. In: Proc. 23rd Int. VLDB, Athens, Greece, pp. 426–435 (1997)
Google Scholar
Dhillon, I.S., Modh, D.S.: A Data-Clustering Algorithm On Distributed Memory Multiprocessors. In: SIGKDD 1999 (1999)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Wimmer, M., Xu, X.: Incremental Clustering for Mining in a Data Warehousing Environment. In: VLDB 1998 (1998)
Google Scholar
Ester, M., Kriegel, H.-P., Sander, J., Xu, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proc. 2nd Int. Conf. on Knowledge Discovery and Data Mining (KDD 1996), Portland, OR, pp. 226–231. AAAI Press, Menlo Park (1996)
Google Scholar
Ertöz, L., Steinbach, M., Kumar, V.: Finding Clusters of Different Sizes, Shapes, and Densities in Noisy, High Dimensional Data. In: SIAM International Conference on Data Mining (2003)
Google Scholar
Forman, G., Zhang, B.: Distributed Data Clustering Can Be Efficient and Exact. SIGKDD Explorations 2(2), 34–38 (2000)
Article Google Scholar
Hanisch, R.J.: Distributed Data Systems and Services for Astronomy and the Space Sciences. In: Manset, N., Veillet, C., Crabtree, D. (eds.) Astronomical Data Analysis Software and Systems IX. ASP Conf. Ser., vol. 216, ASP, San Francisco (2000)
Google Scholar
Hartigan, J.A.: Clustering Algorithms. Wiley, Chichester (1975)
MATH Google Scholar
Han, E.H., Karypis, G., Kumar, V.: Scalable parallel data mining for association rules. In: SIGMOD Record: Proceedings of the 1997 ACM-SIGMOD Conference on Management of Data, Tucson, AZ, USA, pp. 277–288 (1997)
Google Scholar
Jain, A.K., Dubes, R.C.: Algorithms for Clustering Data. Prentice-Hall Inc., Englewood Cliffs (1988)
MATH Google Scholar
Johnson, E., Kargupta, H.: Hierarchical Clustering From Distributed, Heterogeneous Data. In: Zaki, M.J., Ho, C.-T. (eds.) KDD 1999. LNCS (LNAI), vol. 1759, pp. 221–244. Springer, Heidelberg (2000)
Chapter Google Scholar
Jain, A.K., Murty, M.N., Flynn, P.J.: Data Clustering: A Review. ACM Computing Surveys 31(3), 265–323 (1999)
Article Google Scholar
Kargupta, H., Chan, P.: Advances in Distributed and Parallel Knowledge Discovery. AAAI/MIT Press (2000)
Google Scholar
Shafer, J., Agrawal, R., Mehta, M.: A scalable parallel classifier for data mining. In: Proc. 22nd International Conference on VLDB, Mumbai, India (1996)
Google Scholar
Srivastava, A., Han, E.H., Kumar, V., Singh, V.: Parallel formulations of decision-tree classification algorithms. In: Proc. 1998 International Conference on Parallel Processing (1998)
Google Scholar
Samatova, N.F., Ostrouchov, G., Geist, A., Melechko, A.V.: RACHET: An Efficient Cover-Based Merging of Clustering Hierarchies from Distributed Datasets. Distributed and Parallel Databases 11(2), 157–180 (2002)
MATH Google Scholar
Sayal, M., Scheuermann, P.: A Distributed Clustering Algorithm for Web-Based Access Patterns. In: Proceedings of the 2nd ACM-SIGMOD Workshop on Distributed and Parallel Knowledge Discovery, Boston (August 2000)
Google Scholar
Xu, X., Jäger, J., Kriegel, H.-P.: A Fast Parallel Clustering Algorithm for Large Spatial Databases. Data Mining and Knowledge Discovery 3, 263–290 (1999)
Article Google Scholar
Zaki, M.J., Parthasarathy, S., Ogihara, M., Li, W.: New parallel algorithms for fast discovery of association rule. Data Mining and Knowledge Discovery 1, 343–373 (1997)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Institute for Computer Science, University of Munich,
Eshref Januzaj, Hans-Peter Kriegel & Martin Pfeifle

Authors

Eshref Januzaj
View author publications
You can also search for this author in PubMed Google Scholar
Hans-Peter Kriegel
View author publications
You can also search for this author in PubMed Google Scholar
Martin Pfeifle
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Purdue University,
Elisa Bertino
Laboratory of Distributed Multimedia Information Systems and Applications, Technical University of Crete (MUSIC/TUC) Chania, 73100, Crete, Greece
Stavros Christodoulakis
Institute of Computer Science, FO.R.T.H., Vassilika Vouton, P.O. Box 1385, GR 71110, Heraklion, Greece
Dimitris Plexousakis
Department of Computer Science, University of Crete, P.O.Box 2208, GR 71409, Heraklion, Greece
Vassilis Christophides
National and Kapodistrian University of Athens, Greece
Manolis Koubarakis
IPD, Universität Karlsruhe, Am Fasanengarten 5, 76131, Karlsruhe,
Klemens Böhm
Department of Computer Science and Communication, University of Insubria, 22100, Varese, Italy
Elena Ferrari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Januzaj, E., Kriegel, HP., Pfeifle, M. (2004). DBDC: Density Based Distributed Clustering. In: Bertino, E., et al. Advances in Database Technology - EDBT 2004. EDBT 2004. Lecture Notes in Computer Science, vol 2992. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-24741-8_7

Download citation

DOI: https://doi.org/10.1007/978-3-540-24741-8_7
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-21200-3
Online ISBN: 978-3-540-24741-8
eBook Packages: Springer Book Archive

Publish with us

Policies and ethics