Clustering with a minimum spanning tree of scale-free-like structure
Introduction
The goal of this study was the clustering of real-world data using methods based on graph theory.
A minimum spanning tree (MST) of a weighted graph connects all the given data points at the lowest possible cost (Sedgewick, 1984). An MST can be used in clustering: if the weights of the edges represent the distances between the data points, removing edges from the MST leads to a collection of connected components which can be defined to be clusters. It might be possible to use some other kinds of networks in clustering; in this study, one network model was considered.
Different irregular network architectures have been proposed in the literature. One of the oldest is the random graph model of Erdös and Rényi. It has been used in different application fields as an idealized model along with networks with regular structure. Two newer models are small-world and scale-free networks. Small-world networks lie somewhere between regular and random networks and the name analogy derives from the small-world phenomenon (Watts and Strogatz, 1998). Whereas the probability that a vertex has k links follows a Poisson distribution in random networks, scale-free networks follow a power law P(k) ∼ k−γ. The exponent γ has had values of γ = 2.1–2.4 for many real-world cases. (Strogatz, 2001).
A scale-free structure emerges in a network when it is growing by adding new vertices, and the new vertices are preferably attached to vertices which are already highly connected (Barabási and Albert, 1999). Both of these ingredients are necessary if a scale-free structure is wanted (Barabási et al., 1999). The situation becomes a little different if each vertex has some initial fitness which affects to the connection-making process. In this situation scale-free behavior can emerge also (Ergün and Rodgers, 2002).
Section snippets
Materials
The most important selection criterion for the data was the continuity of the attributes. In addition the selected datasets were all studied before. The standardized deviates of original attribute values were used in this study.
Three datasets from UCI Machine Learning Repository (Blake and Merz, 1998) were used along with a dataset consisting of intracranial EEG measurements from rats.
Fisher’s iris plant dataset contains 150 instances and three (continuous) attributes measured from three
Methods
Clustering algorithms based on graph theory can be used to detect clusters of different shapes and sizes, a feature that is not common among clustering methods. An example of this approach is a minimum spanning tree (MST) clustering (see Algorithm 1). The data must have well-separable clusters in order that they can be recognized with the MST clustering. On the other hand, the method does not need any parameters like the number of clusters or some other a priori information about the underlying
Results
For each dataset three different clustering methods were tested: SFMST, MST and k-means.
In both MST and SFMST methods Euclidean distance was used as the distance measure. A vertex was defined to be a hub if it had at least four links, and an SFMST cluster was defined to be a hub and all the vertices that connect directly to it. If two hubs were connected to each other or there was only one linking vertex between the hubs, they were defined to be in the same cluster. In addition, a branch was
Discussion
One drawback of the presented procedure is that the algorithm is quite time-consuming; clearly algorithm development and analysis for faster computing times is needed. Slow computing times restrict the practical amount of data points. Based on this it can be argued if the link distribution of produced trees really follow a power law.
The dependence on distance function or an similarity measure is an open question; maybe non-continuous attributes can be used along with the continuous ones if the
Acknowledgement
The author wishes to thank professors Tapio Grönfors and Seppo Lammi, from Department of Computer Science, University of Kuopio, for continuous encouragement and advisement and for commenting the manuscript as well. Compliments for providing the EEG dataset go to Jari Nissinen, Markku Penttonen and Asla Pitkänen from A.I. Virtanen Institute for Molecular Sciences, University of Kuopio.
The network pictures in this document were created with Pajek—Program for Large Network Analysis (Batagelj and
References (12)
- et al.
Mean-field theory for scale-free random networks
Physica A
(1999) - et al.
Growing random networks with fitness
Physica A
(2002) - et al.
Empirically defined regions of influence for clustering analyses
Pattern Recognition
(1995) - et al.
Data structures and algorithms
(1983) - et al.
Emergence of scaling in random networks
Science
(1999) - Batagelj, V., Mrvar, A., 2004. Pajek—program for large network analysis....
Cited by (75)
Minimum spanning tree hierarchical clustering algorithm: A new Pythagorean fuzzy similarity measure for the analysis of functional brain networks
2022, Expert Systems with ApplicationsCitation Excerpt :Clustering algorithms are commonly known and have been studied in various fields. In the introduction, we have explained a few research articles that broaden the applicability of clustering techniques to the ones with fuzzy information (Karunambigai et al., 2017; Pivinen, 2005; Xu, Chen, & Wu, 2008). PFSs have various advantages over FSs and IFSs due to the congenitally vague membership functions.
A graph-based clustering method with special focus on hyperspectral imaging
2020, Analytica Chimica ActaCitation Excerpt :Here one tries to create disconnected subgraphs by removing edges with weights that differ significantly from others. Among many other graph types the Minimum Spanning Tree (MST) and its variants [2,3] are often used for this approach. Zahn [4] derived some basic criteria how such edges can be detected.
Land consolidation of small-scale farms in preparation for a cane harvester
2017, Computers and Electronics in AgricultureThe research of constructing dynamic cognition model based on brain network
2017, Saudi Journal of Biological SciencesCitation Excerpt :On the other hand, the method does not need any parameters like the number of clusters or some other a priori information about the underlying data. Highly connected vertices can be thought to be “cluster centers”, in this paper, the maximum degree is used to choose the cluster and centers, for example, the different cluster is in different colors as shown in Fig. 4 (Päivinen, 2005; Onnela et al., 2005). We used functional EEG and dynamic evolution modeling to firstly investigate the cortical dynamics among the region.
Farm drainage channel network optimization by improved modified minimal spanning tree
2015, Agricultural Water ManagementAn adaptive minimum spanning tree test for detecting irregularly-shaped spatial clusters
2015, Computational Statistics and Data Analysis