Skip to main content

CLUS: Parallel Subspace Clustering Algorithm on Spark

  • Conference paper
  • First Online:
New Trends in Databases and Information Systems (ADBIS 2015)

Part of the book series: Communications in Computer and Information Science ((CCIS,volume 539))

Included in the following conference series:

  • East European Conference on Advances in Databases and Information Systems

Abstract

Subspace clustering techniques were proposed to discover hidden clusters that only exist in certain subsets of the full feature spaces. However, the time complexity of such algorithms is at most exponential with respect to the dimensionality of the dataset. In addition, datasets are generally too large to fit in a single machine under the current big data scenarios. The extremely high computational complexity, which results in poor scalability with respect to both size and dimensionality of these datasets, give us strong motivations to propose a parallelized subspace clustering algorithm able to handle large high dimensional data. To the best of our knowledge, there are no other parallel subspace clustering algorithms that run on top of new generation big data distributed platforms such as MapReduce and Spark. In this paper we introduce CLUS: a novel parallel solution of subspace clustering based on SUBCLU algorithm. CLUS uses a new dynamic data partitioning method specifically designed to continuously optimize the varying size and content of required data for each iteration in order to fully take advantage of Spark’s in-memory primitives. This method minimizes communication cost between nodes, maximizes their CPU usage, and balances the load among them. Consequently the execution time is significantly reduced. Finally, we conduct several experiments with a series of real and synthetic datasets to demonstrate the scalability, accuracy and the nearly linear speedup with respect to number of nodes of the implementation.

The research leading to these results has been developed within the ONTIC project,which has received funding from the European Union’s Seventh Framework Programme (FP7/2007-2011) under grant agreement no. 619633.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Forgy, E.W.: Cluster analysis of multivariate data: efficiency versus interpretability of classifications. Biometrics 21, 768–769 (1965)

    Google Scholar 

  2. Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise, pp. 226–231. AAAI Press (1996)

    Google Scholar 

  3. Kaufman, L., Rousseeuw, P.J.: Finding Groups in Data. John Wiley & Sons (1990)

    Google Scholar 

  4. Müller, E., Günnemann, S., Assent, I., Seidl, T.: Evaluating clustering in subspace projections of high dimensional data. In: Proc. VLDB, vol. 2(1) (2009)

    Google Scholar 

  5. Pearson, K.: On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2(11), 559–572 (1901)

    Article  Google Scholar 

  6. Peng, H., Long, F., Ding, C.: Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. Pattern Analysis and Machine Intelligence 27(8), 1226–1238 (2005)

    Article  Google Scholar 

  7. Zimek, A., Assent, I., Vreeken, J.: Frequent pattern mining algorithms for data clustering. In: Frequent Pattering Mining, chapter 16, pp. 403–423. Springer International Publishing (2014)

    Google Scholar 

  8. Kailing, K., Kriegel, H.P., Kröger, P.: Density-connected subspace clustering for high-dimensional data. In: Proc. SIAM, pp. 246–257 (2004)

    Google Scholar 

  9. Dean, J., Ghemawat, S.: MapReduce: simplified data In Proc. on large clusters. Communications of the ACM 51(1), 107–113 (2008)

    Article  Google Scholar 

  10. Shvachko, K., et al.: The hadoop distributed file system. In: 2010 IEEE 26th Symposium on Mass Storage Systems and Technologies (MSST). IEEE (2010)

    Google Scholar 

  11. Zaharia, M., et al.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proc. USENIX (2012)

    Google Scholar 

  12. Parsons, L., Haque, E., Liu, H.: Subspace clustering for high dimensional data: a review. ACM SIGKDD Explorations Newsletter 6(1), 90–105 (2004)

    Article  Google Scholar 

  13. Sim, K., Gopalkrishnan, V., Zimek, A., Cong, G.: A survey on enhanced subspace clustering. Data Mining and Knowledge Discovery 26(2) (2013)

    Google Scholar 

  14. Agrawal, R., Gehrke, J., Gunopulos, D., Raghavan, P.: Automatic subspace clustering of high dimensional data for data mining applications. In: Proc. ACM SIGMOD, pp. 94–105 (1998)

    Google Scholar 

  15. Cheng, C., Fu, A., Zhang, Y.: Entropy-based subspace clustering for mining numerical data. In: Proc. SIGKDD, pp. 84–93 (1999)

    Google Scholar 

  16. Aggarwal, C.C., Wolf, J.L., Yu, P.S., Procopiuc, C., Park, J.S.: Fast algorithms for projected clustering. In: Proc. ACM SIGMOD, pp. 61–72 (1999)

    Google Scholar 

  17. Aggarwal, C.C., Yu, P.S.: Finding generalized projected clusters in high dimensional spaces. In: Proc. ACM SIGMOD, pp. 70–81 (2000)

    Google Scholar 

  18. Sequeira, K., Zaki, M.: SCHISM: a new approach for interesting subspace mining. In: Proc. ICDM, pp. 186–193 (2004)

    Google Scholar 

  19. Liu, G., Sim, K., Li, J., Wong, L.: Efficient mining of distance-based subspace clusters. Statistical Analysis and Data Mining 2(5–6), 427–444 (2010)

    MathSciNet  Google Scholar 

  20. Assent, I., Krieger, R., Müller, E., Seidl, T.: INSCY: indexing subspace clusters with in-process-removal of redundancy. In: Proc. ICDM, pp. 719–724 (2008)

    Google Scholar 

  21. Moise, G., Sander, J.: Finding non-redundant, statistically significant regions in high dimensional data: a novel approach to projected and subspace clustering. In: Proc. SIGKDD, pp. 533–541 (2008)

    Google Scholar 

  22. Gunnemann, S., Farber, I., Boden, B., Seidl, T.: Subspace clustering meets dense subgraph mining: a synthesis of two paradigms. In: Proc. ICDM (2010)

    Google Scholar 

  23. Goil, S., Nagesh, H., Choudhary, A.: MAFIA: efficient and scalable subspace clustering for very large data sets. In: Proc. SIGKDD (1999)

    Google Scholar 

  24. Spark. https://spark.apache.org/

  25. Domenoconi, C., Papadopoulos, D., Gunopulos, D., Ma, S.: Subspace clustering of high dimensional data. In: Proc. SIAM (2004)

    Google Scholar 

  26. Nazerzadeh, H., Ghodsi, M., Sadjadian, S.: Parallel subspace clustering. In: Proc. the 10th Annual Conference of Computer Society of Iran (2005)

    Google Scholar 

  27. Achtert, E., Kriegel, H.-P., Zimek, A.: ELKI: a software system for evaluation of subspace clustering algorithms. In: Ludäscher, B., Mamoulis, N. (eds.) SSDBM 2008. LNCS, vol. 5069, pp. 580–585. Springer, Heidelberg (2008)

    Chapter  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Bo Zhu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2015 Springer International Publishing Switzerland

About this paper

Cite this paper

Zhu, B., Mara, A., Mozo, A. (2015). CLUS: Parallel Subspace Clustering Algorithm on Spark. In: Morzy, T., Valduriez, P., Bellatreche, L. (eds) New Trends in Databases and Information Systems. ADBIS 2015. Communications in Computer and Information Science, vol 539. Springer, Cham. https://doi.org/10.1007/978-3-319-23201-0_20

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-23201-0_20

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-23200-3

  • Online ISBN: 978-3-319-23201-0

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics