DASC: data aware algorithm for scalable clustering

Bhatnagar, Vasudha; Kaur, Sharanjit; Saxena, Rakhi; Khanna, Dhriti

doi:10.1007/s10115-016-0958-4

DASC: data aware algorithm for scalable clustering

Regular Paper
Published: 01 June 2016

Volume 50, pages 851–881, (2017)
Cite this article

Knowledge and Information Systems Aims and scope Submit manuscript

Vasudha Bhatnagar¹,
Sharanjit Kaur²,
Rakhi Saxena³ &
…
Dhriti Khanna⁴

576 Accesses
2 Citations
Explore all metrics

Abstract

Emergence of MapReduce (MR) framework for scaling data mining and machine learning algorithms provides for Volume, while handling of Variety and Velocity needs to be skilfully crafted in algorithms. So far, scalable clustering algorithms have focused solely on Volume, taking advantage of the MR framework. In this paper we present a MapReduce algorithm—data aware scalable clustering (DASC), which is capable of handling the 3 Vs of big data by virtue of being (i) single scan and distributed to handle Volume, (ii) incremental to cope with Velocity and (iii) versatile in handling numeric and categorical data to accommodate Variety. DASC algorithm incrementally processes infinitely growing data set stored on distributed file system and delivers quality clustering scheme while ensuring recency of patterns. The up-to-date synopsis is preserved by the algorithm for the data seen so far. Each new data increment is processed and merged with the synopsis. Since the synopsis itself may grow very large in size, the algorithm stores it as a file. This makes DASC algorithm truly scalable. Exclusive clusters are obtained on demand by applying connected component analysis (CCA) algorithm over the synopsis. CCA presents subtle roadblock to effective parallelism during clustering. This problem is overcome by accomplishing the task in two stages. In the first stage, hyperclusters are identified based on prevailing data characteristics. The second stage utilizes this knowledge to determine the degree of parallelism, thereby making DASC data aware. Hyperclusters are distributed over the available compute nodes for discovering embedded clusters in parallel. Staged approach for clustering yields dual advantage of improved parallelism and desired complexity in \(\mathcal {MRC}^0\) class. DASC algorithm is empirically compared with incremental Kmeans and Scalable Kmeans++ algorithms. Experimentation on real-world and synthetic data with approximately 1.2 billion data points demonstrates effectiveness of DASC algorithm. Empirical observations of DASC execution are in consonance with the theoretical analysis with respect to stability in resources utilization and execution time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Overview of Scalable Partitional Methods for Big Data Clustering

An Optimized K-means Clustering Approach on Top of MapReduce

Big Data Analysis Using Hybrid Meta-Heuristic Optimization Algorithm and MapReduce Framework

References

Cordeiro FRL, Traina Junior C, Machado Traina AJ, López J, Kang U, Faloutsos C (2011) Clustering very large multi-dimensional datasets with MapReduce. In: Proceedings of the 17th ACM SIGKDD, New York. ACM, pp 690–698
Ene A, Im S, Moseley B (2011) Fast clustering using MapReduce. In: Proceedings of the seventeenth international conference on knowledge discovery and data mining. ACM, pp 681–689
Zhou P, Lei J, Ye W (2011) Large-scale datasets clustering based on MapReduce and Hadoop. J Comput Inf Syst 7(16):5956–5963
Google Scholar
Aggarwal CC (ed) (2007) Data streams: models and algorithms. Springer, New York
MATH Google Scholar
Barbara D (2002) Requirements for clustering data streams. SIGKDD Explor 3(2):23
Article Google Scholar
Faria ER, Barros RC, Hruschka ER, de Carvalho ACPLF, Gama J (2013) Data stream clustering: a survey. ACM Comput Surv 46(1):13
MATH Google Scholar
Aggarwal CC, Han J, Wang J, Yu PS (2003) A framework for clustering evolving data streams. In: Proceedings of international conference on very large data bases, pp 81–92
Amini A, Teh YW, Saboohi H (2014) On density-based data streams clustering algorithms: a survey. J Comput Sci Technol 29(1):116–141
Article Google Scholar
Bhatnagar V, Kaur S, Chakravarthy S (2014) Clustering data streams using grid-based synopsis. Knowl Inf Syst 41:127–152
Article Google Scholar
Chen Y, Tu L (2007) Density-based clustering for real-time stream data. In: Proceedings of the thirteenth International conference on knowledge discovery and data mining. ACM
Forestiero A, Pizzuti C, Spezzano G (2013) A single pass algorithm for clustering evolving data streams based on swarm intelligence. Data Min Knowl Discov 26(1):1–26
Article MathSciNet Google Scholar
Lin J, Lin H (2011) A density-based clustering over evolving heterogeneous data stream. Int J Digit Content Technol Its Appl 5:325–330
Google Scholar
Cao F, Ester M, Qian W, Zhou A (2006) Density-based clustering over an evolving data stream with noise. In: Proceedings of the sixth SIAM international conference on data mining, pp 326–337
Orlowska ME, Sun X, Li X (2006) Can exclusive clustering on streaming data be achieved? SIGKDD Explor 8(2):102–108
Article Google Scholar
Park NH, Lee WS (2004) Statistical grid-based clustering over data streams. ACM SIGMOD Record 33:32–37
Article Google Scholar
Akioka S (2013) Task graphs for stream mining algorithms. In: Proceedings of first international workshop on big dynamic distributed data. ACM, pp 55–60
Hadian A, Shahrivari S (2014) High performance parallel k-means clustering for disk-resident datasets on multi-core CPUs. J Supercomput 69(2):845–863
Article Google Scholar
Lv Z, Hu Y, Zhong H, Wu J, Li B, Zhao H (2010) Parallel K-means clustering of remote sensing images based on MapReduce. In: Proceedings of the 2010 international conference on web information systems and mining, WISM’10. Springer, Berlin, pp 162–170
Wang S, Dutta H (2011) PARABLE: a parallel random-partition based hierarchical clustering algorithm for the MapReduce framework. http://hdl.handle.net/10022/AC:P:11821
Zhanquan S (2013) A parallel clustering method study based on MapReduce. In: International workshop on cloud computing and information security. Atlantis Press
Zhao W, Ma H, He Q (2009) Parallel K-means clustering based on MapReduce. In: Proceedings of the 1st international conference on cloud computing. Springer, pp 674–679
Dean J, Ghemawat S (2008) MapReduce: simplified data processing on large clusters. ACM Commun 51(1):107–113
Article Google Scholar
The Apache Software Foundation (1999). http://hadoop.apache.org/, http://hadoop.apache.org/hdfs/
Ghemawat S, Gobioff H, Leung ST (2003) The Google file system. In: Proceedings of the 19th ACM symposium on operating systems principles. ACM, pp 29–43
Bahmani B, Moseley B, Vattani A, Kumar R, Vassilvitskii S (2012) Scalable K-means++. Proc VLDB Endow 5(7):622–633
Article Google Scholar
Li Q, Wang P, Wang W, Hu H, Li Z, Li J (2014) An efficient K-means clustering algorithm on MapReduce. In: Proceedings of the 19th international conference on database systems for advanced applications, pp 357–371
He Y, Tan H, Luo W, Mao H, Ma D, Feng S, Fan J (2011) MR-DBSCAN: an efficient parallel density-based clustering algorithm using MapReduce. In: Proceedings of the 17th international conference on parallel and distributed systems. IEEE, pp 473–480
Kim Y, Shim K, Kim M-S, Lee JS (2014) DBCURE-MR: an efficient density-based clustering algorithm for large data using MapReduce. Inf Syst 42(0):15–35. ISSN 0306-4379
Ganglia (2000) High Performance Monitoring Tool. University of California, Berkeley, http://ganglia.sourceforge.net/
UCI KDD Archive (1999) KDD CUP 99 Intrusion Data. http://kdd.ics.uci.edu//databases/kddcup99
Cardoso Margarida GMS (2014) Wholesale customers data. http://archive.ics.uci.edu/ml/datasets/Wholesale+customers
Asuncion A, Newman DJ (2007) UCI machine learning repository. https://archive.ics.uci.edu/ml/datasets/Covertype
Bhatt R, Dhall A (2012) Skin segmentation data. http://archive.ics.uci.edu/ml/datasets/Skin+Segmentation
Ackermann MR, Märtens M, Raupach C, Swierkot K, Lammersen C, Sohler C (2012) StreamKM++: a clustering algorithm for data streams. ACM J Exp Algorithmics 17(1):327–338
MathSciNet MATH Google Scholar
Tan P-N, Steinbach M, Kumar V (2014) Introduction to data mining, 2nd edn. Pearson Education, Limited, New York City
Google Scholar
Karloff H, Suri S, Vassilvitskii S (2010) A model of computation for MapReduce. In: Proceedings of the twenty-first annual ACM-SIAM symposium on discrete algorithms, SODA ’10, pp 938–948, Philadelphia. Society for Industrial and Applied Mathematics. http://dl.acm.org/citation.cfm?id=1873601.1873677

Download references

Acknowledgments

We thank authors of [17] for making BigCross data set available to us. We also thank Cluster Innovation Centre, Delhi University, for permitting us to use its computing facility.

Author information

Authors and Affiliations

Department of Computer Science, University of Delhi, New Delhi, 110019, India
Vasudha Bhatnagar
Acharya Narendra Dev College, University of Delhi, Govindpuri, Kalkaji, New Delhi, 110019, India
Sharanjit Kaur
Deshbandhu College, University of Delhi, Kalkaji, New Delhi, 110019, India
Rakhi Saxena
Indraprastha Institute of Information Technology, Okhla Industrial Estate, Near Govind Puri Metro Station, Phase III, New Delhi, 110020, Delhi, India
Dhriti Khanna

Authors

Vasudha Bhatnagar
View author publications
You can also search for this author in PubMed Google Scholar
Sharanjit Kaur
View author publications
You can also search for this author in PubMed Google Scholar
Rakhi Saxena
View author publications
You can also search for this author in PubMed Google Scholar
Dhriti Khanna
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Sharanjit Kaur.

Additional information

This work was done when Vasudha Bhatnagar was visiting South Asian University, New Delhi, India.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Bhatnagar, V., Kaur, S., Saxena, R. et al. DASC: data aware algorithm for scalable clustering. Knowl Inf Syst 50, 851–881 (2017). https://doi.org/10.1007/s10115-016-0958-4

Download citation

Received: 05 February 2015
Revised: 14 February 2016
Accepted: 13 May 2016
Published: 01 June 2016
Issue Date: March 2017
DOI: https://doi.org/10.1007/s10115-016-0958-4

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DASC: data aware algorithm for scalable clustering

Abstract

Access this article

Similar content being viewed by others

Overview of Scalable Partitional Methods for Big Data Clustering

An Optimized K-means Clustering Approach on Top of MapReduce

Big Data Analysis Using Hybrid Meta-Heuristic Optimization Algorithm and MapReduce Framework

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Keywords

Navigation

DASC: data aware algorithm for scalable clustering

Abstract

Access this article

Similar content being viewed by others

Overview of Scalable Partitional Methods for Big Data Clustering

An Optimized K-means Clustering Approach on Top of MapReduce

Big Data Analysis Using Hybrid Meta-Heuristic Optimization Algorithm and MapReduce Framework

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation