Skip to main content
Log in

K-means tree: an optimal clustering tree for unsupervised learning

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

A Correction to this article was published on 16 November 2023

This article has been updated

Abstract

Tree construction is one of the popular methods for tackling any supervised task in machine learning. However, there has been little effort in applying trees for unsupervised tasks. The traditional unsupervised trees are based on recursively partitioning the space such that the achieved partitions contain similar samples. Sense of similarity depends on the models and applications. This paper tackles the issue of learning optimal clustering oblique trees for the first time and proposes a linear time algorithm for training it. Optimizing performance of infrastructures and energy consumption in the field of Internet of things can be mentioned as applications of tree and clustering, respectively. The motivation of unsupervised tree models is to preserve the data manifold, while keeping the query, while keeping the query time fast. Popular unsupervised models consist of k-d trees, random projection (RP trees), principal component analysis trees (PCA trees) and clustering trees. However, all existing methods for unsupervised tree are sub-optimal. Additionally, existing clustering trees are limited to axis-aligned trees. Further, some of the mentioned methods suffer from curse of dimensionality such as k-d trees. Despite the mentioned challenges, trees are fast in query time. On the other hand, a non-hierarchical clustering such as k-means has both: It performs well in high-dimensional problems and is locally optimal. Its learning algorithm is efficient. However, k-means clustering is not fast in query time. To address the mentioned issues, this paper proposes a novel k-means tree, a tree that outputs the centroids of clusters. The advantages of such tree are being fast in query time and also learning as good cluster centroids as k-means. As a result, problem of learning such trees is to learn both centroids and the tree parameters optimally and jointly. In this paper, this problem is first cast as a constrained minimization problem and then solved using quadratic penalty method. The method consists of learning clusters from k-means and gradually adapting centroids to the outputs of an optimal oblique tree. The alternating optimization is used, and alternation steps consist of weighted k-means clustering and tree optimization. Additionally, the training complexity of proposed algorithm is efficient. Proposed algorithm is optimal in the sense of learned clusters and tree jointly. Trees used in the k-means tree are oblique, and as per our knowledge, this is the first time that oblique trees are applied to the task of clustering. As a side product of the proposed method, sample reduction is explored and shown its merits. It is shown that computational complexity of training KMT (K-means tree) as a sample reduction method is faster than training K-means as a sample reduction. The training complexity of KMT sample reduction algorithm is logarithmic over the size of reduced train set, while training complexity of K-means is linear over the size of reduced dataset. Finally, proposed method is compared to other tree-based clustering algorithms and its superiority in terms of reconstruction error is shown. Additionally, its query complexity is compared with k-means.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5

Similar content being viewed by others

Change history

References

  1. Bennett KP (1992) Decision tree construction via linear programming, in: Proc. 4th Midwest Artificial Intelligence and Cognitive Sience Society Conference, pp. 97–101

  2. Bennett KP (1994) Global tree optimization: a non-greedy decision tree algorithm. Comput Sci Stat 26:156–160

    Google Scholar 

  3. Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18:509–517

    Article  MATH  Google Scholar 

  4. Bertsimas D, Dunn J (2017) Optimal classification trees. Mach Learn 106:1039–1082

    Article  MathSciNet  MATH  Google Scholar 

  5. Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin

    MATH  Google Scholar 

  6. Borchani H, Varando G, Bielza C, Larrañaga P (2015) A survey on multi-output regression. Wiley Interdiscip Rev Data Min Knowl Discov 5:216–233

    Article  Google Scholar 

  7. Breiman L (2001) Random forests. Mach Learn 45:5–32

    Article  MATH  Google Scholar 

  8. Breiman LJ, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont

    MATH  Google Scholar 

  9. Chamam A, Pierre S (2010) A distributed energy-efficient clustering protocol for wireless sensor networks. Comput Electr Eng 36:303–312

    Article  MATH  Google Scholar 

  10. Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2:27

    Google Scholar 

  11. Cheng Y (1995) Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell 17:790–799

    Article  Google Scholar 

  12. Coates A, Ng AY (2012) Learning feature representations with k-means. In: Neural networks: tricks of the trade. Springer, pp 561–580

  13. Criminisi A, Shotton J (2013) Decision forests for computer vision and medical image analysis. In: Advances in computer vision and pattern recognition. Springer

  14. Dasgupta S, Freund Y (2008) Random projection trees and low dimensional manifolds. Proceedings of the fortieth annual ACM symposium on Theory of computing. ACM, 537–546

  15. Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38

    MathSciNet  MATH  Google Scholar 

  16. Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, pp 226–231

  17. Freund Y, Dasgupta S, Kabra M, Verma N (2008) Learning the structure of manifolds using random projections. In: Advances in neural information processing systems, pp 473–480

  18. Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14:1612

    Google Scholar 

  19. Gifford H (2014) Hierarchical k-means for unsupervised learning

  20. Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York

    Book  MATH  Google Scholar 

  21. Heinzelman WR, Chandrakasan A, Balakrishnan H (2000) Energy-efficient communication protocol for wireless microsensor networks. In: Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. IEEE, p 10

  22. Hyafil L, Rivest RL (1975) Constructing optimal binary decision trees is NP-complete. Inf Process Lett 5:15–17

    Article  MathSciNet  MATH  Google Scholar 

  23. Ikonomovska E, Gama J, Džeroski S (2011) Incremental multi-target model trees for data streams. In: Proceedings of the 2011 ACM symposium on applied computing. ACM, pp 988–993

  24. Jordan MI, Jacobs RA (1994) Hierarchical mixtures of experts and the EM algorithm. Neural Comput 6:181–214

    Article  Google Scholar 

  25. Kristan M, Skocaj D, Leonardis A (2008) Incremental learning with Gaussian mixture models. In: Computer Vision Winter Workshop, pp 25–32

  26. Lee DT, Wong C (1977) Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Inform 9:23–29

    Article  MathSciNet  MATH  Google Scholar 

  27. Lee YH, Kim HJ, Roh Bh, Yoo SW, Oh Y (2005) Tree-based classification algorithm for heterogeneous unique item id schemes. In: International Conference on Embedded and Ubiquitous Computing. Springer, pp 1078–1087

  28. Levatić J, Ceci M, Kocev D, Džeroski S (2014) Semi-supervised learning for multi-target regression. In: International workshop on new frontiers in mining complex patterns. Springer, pp 3–18

  29. Liu L, Wong WH (2014) Multivariate density estimation based on adaptive partitioning: convergence rate, variable selection and spatial adaptation. Department of Statistics, Stanford University

  30. Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sin 7:815–840

    MathSciNet  MATH  Google Scholar 

  31. McCartin-Lim M, McGregor A, Wang R (2012) Approximate principal direction trees. arXiv preprint: arXiv:1206.4668

  32. Müller P, Quintana FA (2004) Nonparametric Bayesian data analysis. Stat Sci 19:95–110

    Article  MathSciNet  MATH  Google Scholar 

  33. Murthy SK, Kasif S, Salzberg S (1994) A system for induction of oblique decision trees. J Artif Intell Res 2:1–32

    Article  MATH  Google Scholar 

  34. Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol 2, pp 2161–2168

  35. Nocedal J, Wright S (2006) Numerical optimization. Springer, Berlin

    MATH  Google Scholar 

  36. Norouzi M, Collins M, Johnson MA, Fleet DJ, Kohli P (2015a) Efficient non-greedy optimization of decision trees. In: Advances in neural information processing systems, pp 1729–1737

  37. Norouzi M, Collins M, Johnson MA, Fleet DJ, Kohli P (2015b) Efficient non-greedy optimization of decision trees. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems (NIPS). MIT Press, Cambridge, pp 1720–1728

    Google Scholar 

  38. Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106

    Article  Google Scholar 

  39. Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco

    Google Scholar 

  40. Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam

    Google Scholar 

  41. Ram P, Gray AG (2011) Density estimation trees. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data mining. ACM, pp 627–635

  42. Rokach L, Maimon O (2005a) Clustering methods. In: Data mining and knowledge discovery handbook. Springer, pp 321–352

  43. Rokach L, Maimon O (2005b) Top-down induction of decision trees classifiers—a survey. IEEE Trans Syst Man Cybern Part C (Appl Rev) 35:476–487

    Article  Google Scholar 

  44. Schlimmer JC, Fisher D (1986) A case study of incremental concept induction. In: AAAI, pp 496–501

  45. Silverman BW (2018) Density estimation for statistics and data analysis. Routledge, London

    Book  Google Scholar 

  46. Tavallali P, Tavallali P, Singhal M (2019) Optimization of hierarchical regression model with application to optimizing multi-response regression k-ary trees. In: Thirty-third AAAI Conference on Artificial Intelligence

  47. Tsai CW, Lai CF, Chiang MC, Yang LT (2014) Data mining for internet of things: a survey. IEEE Commun Surv Tutor 16:77–97

    Article  Google Scholar 

  48. Uckelmann D, Harrison M, Michahelles F (2011) An architectural approach towards the future internet of things. In: Architecting the internet of things. Springer, pp 1–24

  49. Utgoff PE (1989) Incremental induction of decision trees. Mach Learn 4:161–186

    Article  Google Scholar 

  50. Verma N, Kpotufe S, Dasgupta S (2009) Which spatial partition trees are adaptive to intrinsic dimension?. In: Proceedings of the Twenty-fifth Conference on Uncertainty in Artificial Intelligence. AUAI Press, pp 565–574

  51. Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 226–235

  52. Wasserman L (2005) All of nonparametric statistics. Springer series in statistics. Springer, Berlin

    Google Scholar 

  53. Yang K, Wong WH (2014) Density estimation via adaptive partition and discrepancy control. arXiv preprint arXiv:1404.1425

Download references

Acknowledgements

Peyman Tavallali’s research contribution to this paper was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mukesh Singhal.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Performance on synthetic data

In order to further visually understand different models, K-means centroid of two synthetic datasets is investigated. Figure 6 shows a rotated checkerboard dataset. This dataset is not an easy one for trees due to the kind of symmetric shape of the data. Red crosses are the learned centroids by different trees, green crosses are centroids learned by K-means, and blue dots are the train set samples. The K-means can easily recognize and cluster different modes of the data, therefore, learning globally optimal centroids for the problem. The global optimality of centroids is an empirical observation because as a synthetic data, it was originally designed with 16 modes and a K-means with 16 centroids has exactly learned different modes of the data. Whereas other unsupervised trees have not learned proper clustering of the data, the proposed method (K-means tree/KMT) has learned the exact centroids from the K-means. Hierarchical K-means clustering was also able to learn same or similar centroids to K-means, hence making it comparative to KMT. However, hierarchical K-means clustering is sub-optimal for the problem. This can cause issues in real datasets. For different trees, first, all of them are trained up to depth of 4, and then, a grouped K-means was run over their partitions of data.

In Fig. 6, it can be observed that for small values of K, various models could learn centroids close to K-means. However, conventional models’ performance drops for higher values of K. Among all models, only KMT and hierarchical K-means could learn the centroids properly for all values of K.

Fig. 6
figure 6

Synthetic dataset for demonstrating performance of various unsupervised trees. Blue dots show the dataset. Green symbols of \(\times\) show centroids learned by K-means, and red symbols of \(\times\) show the centroids learned by the proposed algorithm (KMT). KMT was able to learn the centroids similar to K-means, while other unsupervised tree methods could not achieve centroids as good as K-means (color figure online)

Figure 7 shows another more complicated synthetic dataset. In this dataset, KMT performed better than the other models. The purpose of this dataset is to find clusters along each wing of the dataset. However, because of its symmetric shape, finding proper centroids is difficult in practice. None of the conventional tree models (including HK model) could break the symmetry to learn proper centroids. It is while KMT could learn centroids similar to K-means.

Fig. 7
figure 7

Synthetic dataset for demonstrating performance of various unsupervised trees. Blue dots show the dataset. Green symbols of \(\times\) show centroids learned by K-means, and red symbols of \(\times\) show the centroids learned by the proposed algorithm (KMT). KMT was able to learn the centroids similar to K-means, while other unsupervised tree methods could not achieve centroids as good as K-means (color figure online)

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Tavallali, P., Tavallali, P. & Singhal, M. K-means tree: an optimal clustering tree for unsupervised learning. J Supercomput 77, 5239–5266 (2021). https://doi.org/10.1007/s11227-020-03436-2

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-020-03436-2

Keywords

Navigation