Parallel SLINK for big data

Goyal, Poonam; Kumari, Sonal; Sharma, Sumit; Balasubramaniam, Sundar; Goyal, Navneet

doi:10.1007/s41060-019-00188-y

Poonam Goyal ORCID: orcid.org/0000-0003-1556-9905^1,2,
Sonal Kumari^1,2,
Sumit Sharma^1,2,
Sundar Balasubramaniam^1,2 &
…
Navneet Goyal^1,2

263 Accesses
4 Citations
Explore all metrics

Abstract

The major strength of hierarchical clustering algorithms is that it allows visual interpretations of clusters through dendrograms. Users can cut the dendrogram at different levels to get desired number of clusters. A major problem with hierarchical algorithms is their quadratic runtime complexity, which limits the amount of data that can be clustered in reasonable amount of time. Also, due to its agglomerative merging process, each iteration depends on the data of all previous iterations, making it difficult to parallelize. Thus, there is a need for an efficient parallel implementation of SLINK algorithm which can scale to big data. We present a parallel SLINK algorithm, sGridSLINK, for shared memory architectures. sGridSLINK produces exactly the same dendrogram as the classical SLINK algorithm. We also present, hGridSLINK, a parallel algorithm which fully exploits a multi-core cluster system. To the best of our knowledge, there is no hybrid parallel algorithm for SLINK available in the literature. The proposed algorithms exploit spatial locality of data to reduce the number of distance calculations. Adaptive gridding is used to counter skewness in data and to ensure load balancing. Extensive experiments are carried out to establish the efficiency and scalability of proposed parallel algorithms. sGridSLINK is approximately 840 times faster than the state-of-the-art algorithm using 55 threads on a 48-core machine on a real dataset having 6 million data points. It also achieves a speedup of 47.93 over the best known sequential SLINK, GridSLINK, on a real dataset using 48 threads on a 48-core machine. hGridSLINK achieves a maximum speedup of 68.26 on a 32-node cluster ($32\times 4$ processing elements) with respect to GridSLINK. The hGridSLINK algorithm is able to cluster 200 million data points in only 1317 s (less than 22 min). No existing parallel SLINK algorithm is capable of such efficient clustering of Big Data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey on parallel clustering algorithms for Big Data

Article 06 October 2020

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

High performance parallel $$k$$ -means clustering for disk-resident datasets on multi-core CPUs

Article 28 April 2014

References

(2013) Uci Machine Learning Repository. http://archive.ics.uci.edu/ml. Accessed on 13 Oct 2013
Aggarwal, C.C., Reddy, C.K.: Data Clustering: Algorithms and Applications, 1st edn. CRC Press, Boca Raton (2013)
Book Google Scholar
Bentley, J.L.: Multidimensional binary search trees used for associative searching. Commun. ACM 18(9), 509–517 (1975)
Article MathSciNet Google Scholar
Bentley, J.L.: A parallel algorithm for constructing minimum spanning trees. J. Algorithms 1(1), 51–59 (1980)
Article MathSciNet Google Scholar
Bertone, S., De Lucia, G., Thomas, P.A.: The recycling of gas and metals in galaxy formation: predictions of a dynamical feedback model. Mon. Not. R. Astron. Soc. 379(3), 1143–1154 (2007)
Article Google Scholar
Bower, R.G., Benson, A.J., Malbon, R.K., Helly, J.C., Frenk, C.S., Baugh, C.M., Cole, S., Lacey, C.G.: Breaking the hierarchy of galaxy formation. Mon. Not. R. Astron. Soc. 370(2), 645–655 (2006)
Article Google Scholar
Brunst, H., Hackenberg, D., Juckeland, G., Rohling, H.: Comprehensive performance tracking with vampir 7. Tools for High Performance Computing, pp. 17–29. Springer, Berlin (2010)
Google Scholar
Challa, J.S., Goyal, P., Nikhil, S., Mangla, A., Balasubramaniam, S.S., Goyal, N.: Dd-rtree: a dynamic distributed data structure for efficient data distribution among cluster nodes for spatial data mining algorithms. In: 2016 IEEE International Conference on Big Data (Big Data), IEEE Computer Society, Washington DC, USA, pp. 27–36 (2016)
Chapman, B., Jost, G., Rvd, P.: Using OpenMP: Portable Shared Memory Parallel Programming (Scientific and Engineering Computation). The MIT Press, Cambridge (2007)
Google Scholar
Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 3rd edn. The MIT Press, Cambridge (2009)
MATH Google Scholar
Dahlhaus, E.: Parallel algorithms for hierarchical clustering and applications to split decomposition and parity graph recognition. J. Algorithms 36(2), 205–240 (2000)
Article MathSciNet Google Scholar
Dash, M., Liu, H., Scheuermann, P., Tan, K.L.: Fast hierarchical clustering and its validation. Data Knowl. Eng. 44(1), 109–138 (2003)
Article Google Scholar
Dash, M., Petrutiu, S., Scheuermann, P.: ppop: fast yet accurate parallel hierarchical clustering using partitioning. Data Knowl. Eng. 61(3), 563–578 (2007)
Article Google Scholar
De Lucia, G., Blaizot, J.: The hierarchical formation of the brightest cluster galaxies. Mon. Not. R. Astron. Soc. 375, 2–14 (2007)
Article Google Scholar
Du, Z., Lin, F.: A novel parallelization approach for hierarchical clustering. Parallel Comput. 31(5), 523–527 (2005)
Article Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters a density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, KDD’96, pp. 226–231 (1996)
Fatta, G.D., Pettinger, D.: Dynamic load balancing in parallel kd-tree k-means. In: 2010 10th IEEE International Conference on Computer and Information Technology, IEEE Computer Society, Washington DC, USA, pp. 2478–2485 (2010)
Forum, M.P.: Mpi: A Message-passing Interface Standard. University of Tennessee, Knoxville, TN, USA, Technical Report (1994)
Fouedjio, F.: A spectral clustering approach for multivariate geostatistical data. Int. J. Data Sci. Anal. 4(4), 301–312 (2017)
Article Google Scholar
Gagolewski, M., Bartoszuk, M., Cena, A.: Genie: a new, fast, and outlier-resistant hierarchical clustering algorithm. Inf. Sci. 363, 8–23 (2016)
Article Google Scholar
Goil, S., Nagesh, H., Choudhary, A.: Efficient and scalable subspace clustering for very large data sets. In: Proceedings of the 5th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, ACM, New York, USA, pp. 443–452 (1999)
Goyal, P., Kumari, S., Sharma, S., Kishore, V., Goyal, N., Balasubramaniam, S.S.: Spatial locality aware, fast, and scalable slink algorithm for commodity clusters. In: 2016 IEEE International Conference on Cluster Computing (CLUSTER), IEEE Computer Society, Washington DC, USA, pp. 158–159 (2016)
Goyal, P., Kumari, S., Sharma, S., Kumar, D., Kishore, V., Balasubramaniam, S., Goyal, N.: A fast, scalable slink algorithm for commodity cluster computing exploiting spatial locality. In: 2016 IEEE 18th International Conference on High Performance Computing and Communications, IEEE Computer Society, Washington DC, USA, pp. 268–275 (2016)
Guttman, A.: R-trees: a dynamic index structure for spatial searching. SIGMOD Rec. 14(2), 47–57 (1984)
Article Google Scholar
Hendrix, W., Ali Patwary, M., Agrawal, A., Liao, W.K., Choudhary, A.: Parallel Hierarchical Clustering Code (2012). http://cucis.ece.northwestern.edu/projects/Clustering/. Accessed 10 Aug 2017
Hendrix, W., Patwary, M.M.A., Agrawal, A., Liao, W., Choudhary, A.: Parallel hierarchical clustering on shared memory platforms. In: 2012 19th International Conference on High Performance Computing, IEEE Computer Society, Washington DC, USA, pp. 1–9 (2012)
Hendrix, W., Palsetia, D., Patwary, M.M.A., Agrawal, A., Liao, W., Choudhary, A.: A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures. In: 2013 IEEE Symposium on Large-Scale Data Analysis and Visualization (LDAV), IEEE Computer Society, Washington DC, USA, pp. 7–13 (2013)
Jeon, Y., Yoon, S.: Multi-threaded hierarchical clustering by parallel nearest-neighbor chaining. IEEE Trans. Parallel Distrib. Syst. 26(9), 2534–2548 (2015)
Article Google Scholar
Jin, C., Patwary, M., Agarwal, A., Hendrix, W., Liao, W., Choudhary, A.: A distributed single-linkage hierarchical clustering algorithm using mapreduce. In: Proceedings of the 4th International SC Workshop on Data Intensive Computing in the Clouds, ACM, New York, USA, pp. 418–426 (2013)
Jin, C., Chen, Z., Hendrix, W., Agrawal, A., Choudhary, A.: Incremental, distributed single-linkage hierarchical clustering algorithm using mapreduce. In: Proceedings of the Symposium on High Performance Computing, Society for Computer Simulation International, San Diego, CA, USA, HPC ’15, pp. 83–92 (2015)
Jin, C., Liu, R., Chen, Z., Hendrix, W., Agrawal, A., Choudhary, A.: A scalable hierarchical clustering algorithm using spark. In: 2015 IEEE First International Conference on Big Data Computing Service and Applications, IEEE Computer Society, Washington DC, USA, pp. 418–426 (2015)
Johnson, E.L., Kargupta, H.: Collective, hierarchical clustering from distributed, heterogeneous data. In: Revised Papers from Large-Scale Parallel Data Mining, SIGKDD, Springer-Verlag, Berlin, Heidelberg, Workshop on Large-Scale Parallel KDD Systems, pp. 221–244 (2000)
Kaul, M., Yang, B., Jensen, C.S.: Building accurate 3d spatial networks to enable next generation intelligent transportation systems. In: 2013 IEEE 14th International Conference on Mobile Data Management, vol. 1, pp. 137–146 (2013)
Kruskal, J.: On the shortest spanning subtree of a graph and the traveling salesman problem. Proc. Am. Math. Soc. 7(1), 48–50 (1956)
Article MathSciNet Google Scholar
Kumari, S., Maurya, S., Goyal, P., Balasubramaniam, S.S., Goyal, N.: Scalable parallel algorithms for shared nearest neighbor clustering. In: 2016 IEEE 23rd International Conference on High Performance Computing (HiPC), pp. 72–81 (2016)
Kurban, H., Jenne, M., Dalkilic, M.M.: Using data to build a better em: Em* for big data. Int. J. Data Sci. Anal. 4(2), 83–97 (2017)
Article Google Scholar
Li, X.: Parallel algorithms for hierarchical clustering and cluster validity. IEEE Trans. Pattern Anal. Mach. Intell. 12(11), 1088–1092 (1990)
Article Google Scholar
Liao, W.K., Ying, L., Choudhary, A.: A grid-based clustering algorithm using adaptive mesh refinement. In: Proceedings of the 7th Workshop on Mining Scientific and Engineering Data Sets, pp. 1–9 (2004)
Mazzeo, G.M., Zaniolo, C.: The parallelization of a complex hierarchical clustering algorithm: faster unsupervised learning on larger data sets. University of California, Los Angeles, Technical Report (2016)
Murtágh, F.: Multidimensional Clustering Algorithms. Physica-Verlag, Heidelberg (1985)
MATH Google Scholar
Olman, V., Mao, F., Wu, H., Xu, Y.: Parallel clustering algorithm for large data sets with applications in bioinformatics. IEEE/ACM Trans. Comput. Biol. Bioinform. 6(2), 344–352 (2009)
Article Google Scholar
Olson, C.F.: Parallel algorithms for hierarchical clustering. Parallel Comput. 21(8), 1313–1325 (1995)
Article MathSciNet Google Scholar
Patwary, M.A., Palsetia, D., Agrawal, A., Liao, W.k., Manne, F., Choudhary, A.: A new scalable parallel dbscan algorithm using the disjoint-set data structure. In: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis, IEEE Computer Society Press, Los Alamitos, CA, USA, SC ’12, pp. 62:1–62:11 (2012)
Patwary, M.M.A., Blair, J., Manne, F.: Experiments on union-find algorithms for the disjoint-set data structure. In: Proceedings of the 9th International Conference on Experimental Algorithms, Springer, Berlin, Heidelberg, SEA’10, pp. 411–423 (2010)
Patwary, M.M.A., Byna, S., Satish, N.R., Sundaram, N., Lukić, Z., Roytershteyn, V., Anderson, M.J., Yao, Y., Prabhat, Dubey P.: Bd-cats: big data clustering at trillion particle scale. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, ACM, New York, NY, USA, SC ’15, pp. 6:1–6:12 (2015)
Prim, R.C.: Shortest connection networks and some generalizations. Bell Syst. Tech. J. 36(6), 1389–1401 (1957)
Article Google Scholar
Rajasekaran, S.: Efficient parallel hierarchical clustering algorithms. IEEE Trans. Parallel Distrib. Syst. 16(6), 497–502 (2005)
Article MathSciNet Google Scholar
Sibson, R.: Slink: an optimally efficient algorithm for the single-link cluster method. Comput. J. 16(1), 30–34 (1973)
Article MathSciNet Google Scholar
Springel, V., White, S.D.M., Jenkins, A., Frenk, C.S., Yoshida, N., Gao, L., Navarro, J., Thacker, R., Croton, D., Helly, J., Peacock, J.A., Cole, S., Thomas, P., Couchman, H., Evrard, A., Colberg, J., Pearce, F.: Simulations of the formation, evolution and clustering of galaxies and quasars. Nature 435, 629–636 (2005)
Article Google Scholar
Tan, P.N., Steinbach, M., Kumar, V.: Introduction to Data Mining, 1st edn. Addison-Wesley Longman Publishing Co., Inc., Boston (2005)
Google Scholar
Teffer, D., Srinivasan, R., Ghosh, J.: Adahash: hashing-based scalable, adaptive hierarchical clustering of streaming data on mapreduce frameworks. Int. J. Data Sci. Anal. 2018, 1–11 (2018)
Google Scholar
Wu, C.H., Horng, S.J., Tsai, H.R.: Efficient parallel algorithms for hierarchical clustering on arrays with reconfigurable optical buses. J. Parallel Distrib. Comput. 60(9), 1137–1153 (2000)
Article Google Scholar
Zaki Jr., M.J., Meira, W., Meira, W.: Data Mining and Analysis: Fundamental Concepts and Algorithms. Cambridge University Press, New York (2014)
Book Google Scholar

Download references

Acknowledgements

This work was supported by a Research Grant from Department of Electronics and Information Technology (DeitY), Government of India.

Author information

Authors and Affiliations

ADAPT Lab, Birla Institute of Technology and Science, Pilani, India
Poonam Goyal, Sonal Kumari, Sumit Sharma, Sundar Balasubramaniam & Navneet Goyal
Department of Computer Science and Information Systems, Pilani Campus, Pilani, India
Poonam Goyal, Sonal Kumari, Sumit Sharma, Sundar Balasubramaniam & Navneet Goyal

Authors

Poonam Goyal
View author publications
You can also search for this author in PubMed Google Scholar
Sonal Kumari
View author publications
You can also search for this author in PubMed Google Scholar
Sumit Sharma
View author publications
You can also search for this author in PubMed Google Scholar
Sundar Balasubramaniam
View author publications
You can also search for this author in PubMed Google Scholar
Navneet Goyal
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Poonam Goyal.

Ethics declarations

Conflict of interest

The authors declare that they have no conflict of interest.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Goyal, P., Kumari, S., Sharma, S. et al. Parallel SLINK for big data. Int J Data Sci Anal 9, 339–359 (2020). https://doi.org/10.1007/s41060-019-00188-y

Download citation

Received: 29 January 2018
Accepted: 22 May 2019
Published: 11 June 2019
Issue Date: April 2020
DOI: https://doi.org/10.1007/s41060-019-00188-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Parallel SLINK for big data

Abstract

Access this article

Similar content being viewed by others

A survey on parallel clustering algorithms for Big Data

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

High performance parallel $$k$$ -means clustering for disk-resident datasets on multi-core CPUs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Parallel SLINK for big data

Abstract

Access this article

Similar content being viewed by others

A survey on parallel clustering algorithms for Big Data

Elephant Against Goliath: Performance of Big Data Versus High-Performance Computing DBSCAN Clustering Implementations

High performance parallel $$k$$ -means clustering for disk-resident datasets on multi-core CPUs

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Ethics declarations

Conflict of interest

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation