Skip to main content
Log in

DASC: data aware algorithm for scalable clustering

  • Regular Paper
  • Published:
Knowledge and Information Systems Aims and scope Submit manuscript

Abstract

Emergence of MapReduce (MR) framework for scaling data mining and machine learning algorithms provides for Volume, while handling of Variety and Velocity needs to be skilfully crafted in algorithms. So far, scalable clustering algorithms have focused solely on Volume, taking advantage of the MR framework. In this paper we present a MapReduce algorithm—data aware scalable clustering (DASC), which is capable of handling the 3 Vs of big data by virtue of being (i) single scan and distributed to handle Volume, (ii) incremental to cope with Velocity and (iii) versatile in handling numeric and categorical data to accommodate Variety. DASC algorithm incrementally processes infinitely growing data set stored on distributed file system and delivers quality clustering scheme while ensuring recency of patterns. The up-to-date synopsis is preserved by the algorithm for the data seen so far. Each new data increment is processed and merged with the synopsis. Since the synopsis itself may grow very large in size, the algorithm stores it as a file. This makes DASC algorithm truly scalable. Exclusive clusters are obtained on demand by applying connected component analysis (CCA) algorithm over the synopsis. CCA presents subtle roadblock to effective parallelism during clustering. This problem is overcome by accomplishing the task in two stages. In the first stage, hyperclusters are identified based on prevailing data characteristics. The second stage utilizes this knowledge to determine the degree of parallelism, thereby making DASC data aware. Hyperclusters are distributed over the available compute nodes for discovering embedded clusters in parallel. Staged approach for clustering yields dual advantage of improved parallelism and desired complexity in \(\mathcal {MRC}^0\) class. DASC algorithm is empirically compared with incremental Kmeans and Scalable Kmeans++ algorithms. Experimentation on real-world and synthetic data with approximately 1.2 billion data points demonstrates effectiveness of DASC algorithm. Empirical observations of DASC execution are in consonance with the theoretical analysis with respect to stability in resources utilization and execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12

Similar content being viewed by others

References

  1. Cordeiro FRL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C (2011) Clustering very large multi-dimensional datasets with MapReduce. In: Proceedings of the 17th ACM SIGKDD, New York. ACM, pp 690–698

  2. Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the seventeenth international conference on knowledge discovery and data mining. ACM, pp 681–689

  3. Zhou P, Lei J, Ye W (2011) Large-scale datasets clustering based on MapReduce and Hadoop. J Comput Inf Syst 7(16):5956–5963

    Google Scholar 

  4. Aggarwal CC (ed) (2007) Data streams: models and algorithms. Springer, New York

    MATH  Google Scholar 

  5. Barbara D (2002) Requirements for clustering data streams. SIGKDD Explor 3(2):23

    Article  Google Scholar 

  6. Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13

    MATH  Google Scholar 

  7. Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of international conference on very large data bases, pp 81–92

  8. Amini A, Teh YW, Saboohi H (2014) On density-based data streams clustering algorithms: a survey. J Comput Sci Technol 29(1):116–141

    Article  Google Scholar 

  9. Bhatnagar V, Kaur S, Chakravarthy S (2014) Clustering data streams using grid-based synopsis. Knowl Inf Syst 41:127–152

    Article  Google Scholar 

  10. Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the thirteenth International conference on knowledge discovery and data mining. ACM

  11. Forestiero A, Pizzuti C, Spezzano G (2013) A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Min Knowl Discov 26(1):1–26

    Article  MathSciNet  Google Scholar 

  12. Lin J, Lin H (2011) A density-based clustering over evolving heterogeneous data stream. Int J Digit Content Technol Its Appl 5:325–330

    Google Scholar 

  13. Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining, pp 326–337

  14. Orlowska ME, Sun X, Li X (2006) Can exclusive clustering on streaming data be achieved? SIGKDD Explor 8(2):102–108

    Article  Google Scholar 

  15. Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Record 33:32–37

    Article  Google Scholar 

  16. Akioka S (2013) Task graphs for stream mining algorithms. In: Proceedings of first international workshop on big dynamic distributed data. ACM, pp 55–60

  17. Hadian A, Shahrivari S (2014) High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. J Supercomput 69(2):845–863

    Article  Google Scholar 

  18. Lv Z, Hu Y, Zhong H, Wu J, Li B, Zhao H (2010) Parallel K-means clustering of remote sensing images based on MapReduce. In: Proceedings of the 2010 international conference on web information systems and mining, WISM’10. Springer, Berlin, pp 162–170

  19. Wang S, Dutta H (2011) PARABLE: a parallel random-partition based hierarchical clustering algorithm for the MapReduce framework. http://hdl.handle.net/10022/AC:P:11821

  20. Zhanquan S (2013) A parallel clustering method study based on MapReduce. In: International workshop on cloud computing and information security. Atlantis Press

  21. Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on MapReduce. In: Proceedings of the 1st international conference on cloud computing. Springer, pp 674–679

  22. Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM Commun 51(1):107–113

    Article  Google Scholar 

  23. The Apache Software Foundation (1999). http://hadoop.apache.org/, http://hadoop.apache.org/hdfs/

  24. Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In: Proceedings of the 19th ACM symposium on operating systems principles. ACM, pp 29–43

  25. Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable K-means++. Proc VLDB Endow 5(7):622–633

    Article  Google Scholar 

  26. Li Q, Wang P, Wang W, Hu H, Li Z, Li J (2014) An efficient K-means clustering algorithm on MapReduce. In: Proceedings of the 19th international conference on database systems for advanced applications, pp 357–371

  27. He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J (2011) MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: Proceedings of the 17th international conference on parallel and distributed systems. IEEE, pp 473–480

  28. Kim Y, Shim K, Kim M-S, Lee JS (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce. Inf Syst 42(0):15–35. ISSN 0306-4379

  29. Ganglia (2000) High Performance Monitoring Tool. University of California, Berkeley, http://ganglia.sourceforge.net/

  30. UCI KDD Archive (1999) KDD CUP 99 Intrusion Data. http://kdd.ics.uci.edu//databases/kddcup99

  31. Cardoso Margarida GMS (2014) Wholesale customers data. http://archive.ics.uci.edu/ml/datasets/Wholesale+customers

  32. Asuncion A, Newman DJ (2007) UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Covertype

  33. Bhatt R, Dhall A (2012) Skin segmentation data. http://archive.ics.uci.edu/ml/datasets/Skin+Segmentation

  34. Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. ACM J Exp Algorithmics 17(1):327–338

    MathSciNet  MATH  Google Scholar 

  35. Tan P-N, Steinbach M, Kumar V (2014) Introduction to data mining, 2nd edn. Pearson Education, Limited, New York City

    Google Scholar 

  36. Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for MapReduce. In: Proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms, SODA ’10, pp 938–948, Philadelphia. Society for Industrial and Applied Mathematics. http://dl.acm.org/citation.cfm?id=1873601.1873677

Download references

Acknowledgments

We thank authors of [17] for making BigCross data set available to us. We also thank Cluster Innovation Centre, Delhi University, for permitting us to use its computing facility.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Sharanjit Kaur.

Additional information

This work was done when Vasudha Bhatnagar was visiting South Asian University, New Delhi, India.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Bhatnagar, V., Kaur, S., Saxena, R. et al. DASC: data aware algorithm for scalable clustering. Knowl Inf Syst 50, 851–881 (2017). https://doi.org/10.1007/s10115-016-0958-4

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10115-016-0958-4

Keywords

Navigation