skip to main content
10.1145/3167918.3167929acmotherconferencesArticle/Chapter ViewAbstractPublication Pagesaus-cswConference Proceedingsconference-collections
research-article

Efficient hierarchical clustering for single-dimensional data using CUDA

Published: 29 January 2018 Publication History

Abstract

Hierarchical clustering is a widely-used and well-researched clustering technique. The classical algorithm for agglomerative hierarchical clustering is prohibitively expensive for use with large datasets. Numerous algorithms exist to improve the efficiency of hierarchical clustering for various linkage metrics, and for large datasets. Recent research has proposed approaches for improving the efficiency of hierarchical clustering through parallelism. The newest approaches utilise GPGPU technologies, which facilitate massive parallelism on commodity consumer hardware. Existing GPGPU implementations fail to maximise the number of merges that can be performed in parallel, and feature high use of memory. These limitations prevent existing implementations from achieving the full performance offered by GPGPU. In this paper, we propose a novel GPGPU algorithm for hierarchical clustering of single-dimensional data. Our proposed algorithm exploits the unique characteristics of one-dimensional data to maximise merge parallelism and significantly reduce memory requirements. Validation demonstrates that our proposed algorithm produces equivalent results to the classical algorithm for both the single-linkage and complete-linkage metrics. Benchmarking results show that our algorithm scales well to large datasets, and offers a substantial speed-up over the classical algorithm. Future work will look to extend our proposed approach to larger datasets with higher dimensions.

References

[1]
Aboutabl, A. E., and Elsayed, M. N. A novel parallel algorithm for clustering documents based on the hierarchical agglomerative approach. International Journal of Computer Science and Information Technology (IJCSIT) 3, 2 (2011), 152--163.
[2]
Baker, F. B. Stability of two hierarchical grouping techniques case i: Sensitivity to data errors. Journal of the American Statistical Association 69, 346 (1974), 440--445.
[3]
Cathey, R. J., Jensen, E. C., Beitzel, S. M., Frieder, O., and Grossman, D. Exploiting parallelism to support scalable hierarchical clustering. Journal of the American Society for Information Science and Technology 58, 8 (2007), 1207--1221.
[4]
Chang, D.-J., Kantardzic, M. M., and Ouyang, M. Hierarchical clustering with cuda/gpu. In Symposium on Computer Animation (2009), pp. 7--12.
[5]
Dash, M., Petrutiu, S., and Scheuermann, P. ppop: Fast yet accurate parallel hierarchical clustering using partitioning. Data & Knowledge Engineering 61, 3 (2007), 563--578. Advances on Natural Language ProcessingNLDB 05.
[6]
Dash, M., Tan, K. L., and Liu, H. Efficient yet accurate clustering. In Data Mining, 2001. ICDM 2001, Proceedings IEEE International Conference on (2001), pp. 99--106.
[7]
Day, W. H. E., and Edelsbrunner, H. Efficient algorithms for agglomerative hierarchical clustering methods. Journal of Classification 1, 1 (1984), 7--24.
[8]
Defays, D. An efficient algorithm for a complete link method. The Computer Journal 20, 4 (1977), 364--366.
[9]
Du, Z., and Lin, F. A novel parallelization approach for hierarchical clustering. Parallel Computing 31, 5 (2005), 523--527.
[10]
El-Hamdouchi, A., and Willett, P. Comparison of hierarchic agglomerative clustering methods for document retrieval. The Computer Journal 32, 3 (1989), 220--227.
[11]
Feng, Z., Zhou, B., and Shen, J. A parallel hierarchical clustering algorithm for PCs cluster system. Neurocomputing 70, 4-6 (2007), 809--818. Advanced Neurocomputing Theory and MethodologySelected papers from the International Conference on Intelligent Computing 2005 (ICIC 2005)International Conference on Intelligent Computing 2005.
[12]
Fowlkes, E. B., and Mallows, C. L. A method for comparing two hierarchical clusterings. Journal of the American Statistical Association 78, 383 (1983), 553--569.
[13]
Guha, S., Rastogi, R., and Shim, K. Cure: An efficient clustering algorithm for large databases. In Proceedings of the 1998 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 1998), SIGMOD '98, ACM, pp. 73--84.
[14]
Haahr, M. Random.org: true random number service, 2016.
[15]
Hadjidoukas, P. E., and Amsaleg, L. Parallelization of a Hierarchical Data Clustering Algorithm Using OpenMP. Springer Berlin Heidelberg, Berlin, Heidelberg, 2008, pp. 289--299.
[16]
Hendrix, W., Palsetia, D., Patwary, M. M. A., Agrawal, A., Liao, W.-k., and Choudhary, A. N. A scalable algorithm for single-linkage hierarchical clustering on distributed-memory architectures. In IEEE Symposium on Large Data Analysis and Visualization (2013), pp. 7--13.
[17]
Hendrix, W., Patwary, M. M. A., Agrawal, A., k. Liao, W., and Choudhary, A. Parallel hierarchical clustering on shared memory platforms. In High Performance Computing (HiPC), 2012 19th International Conference on (Dec 2012), pp. 1--9.
[18]
Hilbert, D. Ueber die stetige abbildung einer line auf ein flächenstück. Mathematische Annalen 38, 3 (1891), 459--460.
[19]
Hill, M. D., and Marty, M. R. Amdahl's law in the multicore era. IEEE Computer 41 (2008), 33--38.
[20]
Jenks, G. F. The data model concept in statistical mapping. International yearbook of cartography 7 (1967), 186--190.
[21]
Jeon, Y., and Yoon, S. Multi-threaded hierarchical clustering by parallel nearest-neighbor chaining. IEEE Transactions on Parallel and Distributed Systems 26, 9 (Sept 2015), 2534--2548.
[22]
Kohlhoff, K. J., Sosnick, M. H., Hsu, W. T., Pande, V. S., and Altman, R. B. Campaign: an open-source library of gpu-accelerated data clustering algorithms. Bioinformatics 27, 16 (2011), 2321--2322.
[23]
Malhat, M. G., and El-Sisi, A. B. Parallel ward clustering for chemical compounds using opencl. In Computer Engineering Systems (ICCES), 2015 Tenth International Conference on (Dec 2015), pp. 23--27.
[24]
Michailidis, P. D., and Margaritis, K. G. Accelerating kernel density estimation on the gpu using the cuda framework. Applied Mathematical Sciences 7, 30 (2013), 1447--1476.
[25]
Morton, G. M. A computer oriented geodetic data base and a new technique in file sequencing. International Business Machines Company New York, 1966.
[26]
Murtagh, F. Multidimensional clustering algorithms. Compstat Lectures, Vienna: Physika Verlag, 1985 (1985).
[27]
Olson, C. F. Parallel algorithms for hierarchical clustering. Parallel Computing 21, 8 (1995), 1313--1325.
[28]
Parzen, E. On estimation of a probability density function and mode. The annals of mathematical statistics 33, 3 (1962), 1065--1076.
[29]
Rasmussen, E. M. Clustering algorithms. Information retrieval: data structures & algorithms 419 (1992), 442.
[30]
Rehn, A., Holdsworth, J., and Lee, I. Automated outlier removal for mobile microbenchmarking datasets. In 10th International Conference on Intelligent Systems and Knowledge Engineering (ISKE), 2015 (2015), pp. 578--585.
[31]
Robinson, D., and Foulds, L. Comparison of phylogenetic trees. Mathematical Biosciences 53, 1 (1981), 131 -- 147.
[32]
Scornavacca, C., Zickmann, F., and Huson, D. H. Tanglegrams for rooted phylogenetic trees and networks. Bioinformatics 27, 13 (2011), i248--i256.
[33]
Shalom, S. A., and Dash, M. Efficient partitioning based hierarchical agglomerative clustering using graphics accelerators with cuda. International Journal of Artificial Intelligence & Applications 4, 2 (2013), 13.
[34]
Shalom, S. A. A., Dash, M., and Tue, M. An Approach for Fast Hierarchical Agglomerative Clustering Using Graphics Processors with CUDA. Springer Berlin Heidelberg, Berlin, Heidelberg, 2010, pp. 35--42.
[35]
Shalom, S. A. A., Dash, M., Tue, M., and Wilson, N. Hierarchical agglomerative clustering using graphics processor with compute unified device architecture. In 2009 International Conference on Signal Processing Systems (May 2009), pp. 556--561.
[36]
Sibson, R. Slink: An optimally efficient algorithm for the single-link cluster method. The Computer Journal 16, 1 (1973), 30--34.
[37]
Sneath, P. H. A. The application of computers to taxonomy. Microbiology 17, 1 (1957), 201--226.
[38]
Tantono, M. Parallelisation of hierarchical clustering algorithms for metagenomics.
[39]
Wang, H., and Song, M. Ckmeans. 1d. dp: optimal k-means clustering in one dimension by dynamic programming. The R journal 3, 2 (2011), 29.
[40]
Wilson, J., Dai, M., Jakupovic, E., Watson, S., and Meng, F. Supercomputing with toys: harnessing the power of nvidia 8800gtx and playstation 3 for bioinformatics problems. In Computational Systems Bioinformatics Conference (2007), vol. 6, Citeseer, pp. 387--390.
[41]
Zhang, Q., and Zhang, Y. Hierarchical clustering of gene expression profiles with graphics hardware acceleration. Pattern Recognition Letters 27, 6 (2006), 676--681.
[42]
Zhang, T., Ramakrishnan, R., and Livny, M. Birch: An efficient data clustering method for very large databases. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (New York, NY, USA, 1996), SIGMOD '96, ACM, pp. 103--114.

Cited By

View all
  • (2024)A Complete Linkage Algorithm for Clustering Dynamic DatasetsProceedings of the National Academy of Sciences, India Section A: Physical Sciences10.1007/s40010-024-00894-894:5(471-486)Online publication date: 25-Sep-2024
  • (2023)An Efficient and Speedy approach for Hierarchical Clustering Using Complete Linkage method2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)10.1109/ICECCT56650.2023.10179708(1-8)Online publication date: 22-Feb-2023
  • (2021)Extending the Theory of Planned Behavior to Explore the Influence of Residents’ Dependence on Public TransportIEEE Access10.1109/ACCESS.2021.31172789(137224-137240)Online publication date: 2021
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Other conferences
ACSW '18: Proceedings of the Australasian Computer Science Week Multiconference
January 2018
404 pages
ISBN:9781450354363
DOI:10.1145/3167918
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

  • CORE: Computing Research and Education

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 29 January 2018

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. GPU acceleration
  2. agglomerative clustering
  3. parallel

Qualifiers

  • Research-article

Conference

ACSW 2018
Sponsor:
  • CORE
ACSW 2018: Australasian Computer Science Week 2018
January 29 - February 2, 2018
Queensland, Brisband, Australia

Acceptance Rates

ACSW '18 Paper Acceptance Rate 49 of 96 submissions, 51%;
Overall Acceptance Rate 204 of 424 submissions, 48%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)0
Reflects downloads up to 25 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)A Complete Linkage Algorithm for Clustering Dynamic DatasetsProceedings of the National Academy of Sciences, India Section A: Physical Sciences10.1007/s40010-024-00894-894:5(471-486)Online publication date: 25-Sep-2024
  • (2023)An Efficient and Speedy approach for Hierarchical Clustering Using Complete Linkage method2023 Fifth International Conference on Electrical, Computer and Communication Technologies (ICECCT)10.1109/ICECCT56650.2023.10179708(1-8)Online publication date: 22-Feb-2023
  • (2021)Extending the Theory of Planned Behavior to Explore the Influence of Residents’ Dependence on Public TransportIEEE Access10.1109/ACCESS.2021.31172789(137224-137240)Online publication date: 2021
  • (2021)Accelerated Single Linkage Algorithm using the farthest neighbour principleSādhanā10.1007/s12046-020-01544-646:1Online publication date: 26-Feb-2021

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media