Abstract
Tree construction is one of the popular methods for tackling any supervised task in machine learning. However, there has been little effort in applying trees for unsupervised tasks. The traditional unsupervised trees are based on recursively partitioning the space such that the achieved partitions contain similar samples. Sense of similarity depends on the models and applications. This paper tackles the issue of learning optimal clustering oblique trees for the first time and proposes a linear time algorithm for training it. Optimizing performance of infrastructures and energy consumption in the field of Internet of things can be mentioned as applications of tree and clustering, respectively. The motivation of unsupervised tree models is to preserve the data manifold, while keeping the query, while keeping the query time fast. Popular unsupervised models consist of k-d trees, random projection (RP trees), principal component analysis trees (PCA trees) and clustering trees. However, all existing methods for unsupervised tree are sub-optimal. Additionally, existing clustering trees are limited to axis-aligned trees. Further, some of the mentioned methods suffer from curse of dimensionality such as k-d trees. Despite the mentioned challenges, trees are fast in query time. On the other hand, a non-hierarchical clustering such as k-means has both: It performs well in high-dimensional problems and is locally optimal. Its learning algorithm is efficient. However, k-means clustering is not fast in query time. To address the mentioned issues, this paper proposes a novel k-means tree, a tree that outputs the centroids of clusters. The advantages of such tree are being fast in query time and also learning as good cluster centroids as k-means. As a result, problem of learning such trees is to learn both centroids and the tree parameters optimally and jointly. In this paper, this problem is first cast as a constrained minimization problem and then solved using quadratic penalty method. The method consists of learning clusters from k-means and gradually adapting centroids to the outputs of an optimal oblique tree. The alternating optimization is used, and alternation steps consist of weighted k-means clustering and tree optimization. Additionally, the training complexity of proposed algorithm is efficient. Proposed algorithm is optimal in the sense of learned clusters and tree jointly. Trees used in the k-means tree are oblique, and as per our knowledge, this is the first time that oblique trees are applied to the task of clustering. As a side product of the proposed method, sample reduction is explored and shown its merits. It is shown that computational complexity of training KMT (K-means tree) as a sample reduction method is faster than training K-means as a sample reduction. The training complexity of KMT sample reduction algorithm is logarithmic over the size of reduced train set, while training complexity of K-means is linear over the size of reduced dataset. Finally, proposed method is compared to other tree-based clustering algorithms and its superiority in terms of reconstruction error is shown. Additionally, its query complexity is compared with k-means.
Similar content being viewed by others
Change history
16 November 2023
A Correction to this paper has been published: https://doi.org/10.1007/s11227-023-05723-0
References
Bennett KP (1992) Decision tree construction via linear programming, in: Proc. 4th Midwest Artificial Intelligence and Cognitive Sience Society Conference, pp. 97–101
Bennett KP (1994) Global tree optimization: a non-greedy decision tree algorithm. Comput Sci Stat 26:156–160
Bentley JL (1975) Multidimensional binary search trees used for associative searching. Commun ACM 18:509–517
Bertsimas D, Dunn J (2017) Optimal classification trees. Mach Learn 106:1039–1082
Bishop CM (2006) Pattern recognition and machine learning. Springer, Berlin
Borchani H, Varando G, Bielza C, Larrañaga P (2015) A survey on multi-output regression. Wiley Interdiscip Rev Data Min Knowl Discov 5:216–233
Breiman L (2001) Random forests. Mach Learn 45:5–32
Breiman LJ, Friedman JH, Olshen RA, Stone CJ (1984) Classification and regression trees. Wadsworth, Belmont
Chamam A, Pierre S (2010) A distributed energy-efficient clustering protocol for wireless sensor networks. Comput Electr Eng 36:303–312
Chang CC, Lin CJ (2011) Libsvm: a library for support vector machines. ACM Trans Intell Syst Technol (TIST) 2:27
Cheng Y (1995) Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell 17:790–799
Coates A, Ng AY (2012) Learning feature representations with k-means. In: Neural networks: tricks of the trade. Springer, pp 561–580
Criminisi A, Shotton J (2013) Decision forests for computer vision and medical image analysis. In: Advances in computer vision and pattern recognition. Springer
Dasgupta S, Freund Y (2008) Random projection trees and low dimensional manifolds. Proceedings of the fortieth annual ACM symposium on Theory of computing. ACM, 537–546
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via the EM algorithm. J R Stat Soc Ser B (Methodol) 39:1–38
Ester M, Kriegel HP, Sander J, Xu X et al (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. In: Kdd, pp 226–231
Freund Y, Dasgupta S, Kabra M, Verma N (2008) Learning the structure of manifolds using random projections. In: Advances in neural information processing systems, pp 473–480
Freund Y, Schapire R, Abe N (1999) A short introduction to boosting. J Jpn Soc Artif Intell 14:1612
Gifford H (2014) Hierarchical k-means for unsupervised learning
Hastie T, Tibshirani R, Friedman JH (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
Heinzelman WR, Chandrakasan A, Balakrishnan H (2000) Energy-efficient communication protocol for wireless microsensor networks. In: Proceedings of the 33rd Annual Hawaii International Conference on System Sciences. IEEE, p 10
Hyafil L, Rivest RL (1975) Constructing optimal binary decision trees is NP-complete. Inf Process Lett 5:15–17
Ikonomovska E, Gama J, Džeroski S (2011) Incremental multi-target model trees for data streams. In: Proceedings of the 2011 ACM symposium on applied computing. ACM, pp 988–993
Jordan MI, Jacobs RA (1994) Hierarchical mixtures of experts and the EM algorithm. Neural Comput 6:181–214
Kristan M, Skocaj D, Leonardis A (2008) Incremental learning with Gaussian mixture models. In: Computer Vision Winter Workshop, pp 25–32
Lee DT, Wong C (1977) Worst-case analysis for region and partial region searches in multidimensional binary search trees and balanced quad trees. Acta Inform 9:23–29
Lee YH, Kim HJ, Roh Bh, Yoo SW, Oh Y (2005) Tree-based classification algorithm for heterogeneous unique item id schemes. In: International Conference on Embedded and Ubiquitous Computing. Springer, pp 1078–1087
Levatić J, Ceci M, Kocev D, Džeroski S (2014) Semi-supervised learning for multi-target regression. In: International workshop on new frontiers in mining complex patterns. Springer, pp 3–18
Liu L, Wong WH (2014) Multivariate density estimation based on adaptive partitioning: convergence rate, variable selection and spatial adaptation. Department of Statistics, Stanford University
Loh WY, Shih YS (1997) Split selection methods for classification trees. Stat Sin 7:815–840
McCartin-Lim M, McGregor A, Wang R (2012) Approximate principal direction trees. arXiv preprint: arXiv:1206.4668
Müller P, Quintana FA (2004) Nonparametric Bayesian data analysis. Stat Sci 19:95–110
Murthy SK, Kasif S, Salzberg S (1994) A system for induction of oblique decision trees. J Artif Intell Res 2:1–32
Nister D, Stewenius H (2006) Scalable recognition with a vocabulary tree. In: 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’06), vol 2, pp 2161–2168
Nocedal J, Wright S (2006) Numerical optimization. Springer, Berlin
Norouzi M, Collins M, Johnson MA, Fleet DJ, Kohli P (2015a) Efficient non-greedy optimization of decision trees. In: Advances in neural information processing systems, pp 1729–1737
Norouzi M, Collins M, Johnson MA, Fleet DJ, Kohli P (2015b) Efficient non-greedy optimization of decision trees. In: Cortes C, Lawrence ND, Lee DD, Sugiyama M, Garnett R (eds) Advances in neural information processing systems (NIPS). MIT Press, Cambridge, pp 1720–1728
Quinlan JR (1986) Induction of decision trees. Mach Learn 1:81–106
Quinlan JR (1993) C4.5: programs for machine learning. Morgan Kaufmann, San Francisco
Quinlan JR (2014) C4. 5: programs for machine learning. Elsevier, Amsterdam
Ram P, Gray AG (2011) Density estimation trees. In: Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data mining. ACM, pp 627–635
Rokach L, Maimon O (2005a) Clustering methods. In: Data mining and knowledge discovery handbook. Springer, pp 321–352
Rokach L, Maimon O (2005b) Top-down induction of decision trees classifiers—a survey. IEEE Trans Syst Man Cybern Part C (Appl Rev) 35:476–487
Schlimmer JC, Fisher D (1986) A case study of incremental concept induction. In: AAAI, pp 496–501
Silverman BW (2018) Density estimation for statistics and data analysis. Routledge, London
Tavallali P, Tavallali P, Singhal M (2019) Optimization of hierarchical regression model with application to optimizing multi-response regression k-ary trees. In: Thirty-third AAAI Conference on Artificial Intelligence
Tsai CW, Lai CF, Chiang MC, Yang LT (2014) Data mining for internet of things: a survey. IEEE Commun Surv Tutor 16:77–97
Uckelmann D, Harrison M, Michahelles F (2011) An architectural approach towards the future internet of things. In: Architecting the internet of things. Springer, pp 1–24
Utgoff PE (1989) Incremental induction of decision trees. Mach Learn 4:161–186
Verma N, Kpotufe S, Dasgupta S (2009) Which spatial partition trees are adaptive to intrinsic dimension?. In: Proceedings of the Twenty-fifth Conference on Uncertainty in Artificial Intelligence. AUAI Press, pp 565–574
Wang H, Fan W, Yu PS, Han J (2003) Mining concept-drifting data streams using ensemble classifiers. In: Proceedings of the Ninth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. ACM, pp 226–235
Wasserman L (2005) All of nonparametric statistics. Springer series in statistics. Springer, Berlin
Yang K, Wong WH (2014) Density estimation via adaptive partition and discrepancy control. arXiv preprint arXiv:1404.1425
Acknowledgements
Peyman Tavallali’s research contribution to this paper was carried out at the Jet Propulsion Laboratory, California Institute of Technology, under a contract with the National Aeronautics and Space Administration.
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Performance on synthetic data
In order to further visually understand different models, K-means centroid of two synthetic datasets is investigated. Figure 6 shows a rotated checkerboard dataset. This dataset is not an easy one for trees due to the kind of symmetric shape of the data. Red crosses are the learned centroids by different trees, green crosses are centroids learned by K-means, and blue dots are the train set samples. The K-means can easily recognize and cluster different modes of the data, therefore, learning globally optimal centroids for the problem. The global optimality of centroids is an empirical observation because as a synthetic data, it was originally designed with 16 modes and a K-means with 16 centroids has exactly learned different modes of the data. Whereas other unsupervised trees have not learned proper clustering of the data, the proposed method (K-means tree/KMT) has learned the exact centroids from the K-means. Hierarchical K-means clustering was also able to learn same or similar centroids to K-means, hence making it comparative to KMT. However, hierarchical K-means clustering is sub-optimal for the problem. This can cause issues in real datasets. For different trees, first, all of them are trained up to depth of 4, and then, a grouped K-means was run over their partitions of data.
In Fig. 6, it can be observed that for small values of K, various models could learn centroids close to K-means. However, conventional models’ performance drops for higher values of K. Among all models, only KMT and hierarchical K-means could learn the centroids properly for all values of K.
Figure 7 shows another more complicated synthetic dataset. In this dataset, KMT performed better than the other models. The purpose of this dataset is to find clusters along each wing of the dataset. However, because of its symmetric shape, finding proper centroids is difficult in practice. None of the conventional tree models (including HK model) could break the symmetry to learn proper centroids. It is while KMT could learn centroids similar to K-means.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Tavallali, P., Tavallali, P. & Singhal, M. K-means tree: an optimal clustering tree for unsupervised learning. J Supercomput 77, 5239–5266 (2021). https://doi.org/10.1007/s11227-020-03436-2
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-020-03436-2