Clustering with a minimum spanning tree of scale-free-like structure

https://doi.org/10.1016/j.patrec.2004.09.039Get rights and content

Abstract

In this study a novel approach to graph-theoretic clustering is presented. A clustering algorithm which uses a structure called scale-free minimum spanning tree is presented and its performance is compared with standard minimum spanning tree clustering and k-means methods. The results show that the proposed method is a potential clustering procedure after some further analysis is done.

Introduction

The goal of this study was the clustering of real-world data using methods based on graph theory.

A minimum spanning tree (MST) of a weighted graph connects all the given data points at the lowest possible cost (Sedgewick, 1984). An MST can be used in clustering: if the weights of the edges represent the distances between the data points, removing edges from the MST leads to a collection of connected components which can be defined to be clusters. It might be possible to use some other kinds of networks in clustering; in this study, one network model was considered.

Different irregular network architectures have been proposed in the literature. One of the oldest is the random graph model of Erdös and Rényi. It has been used in different application fields as an idealized model along with networks with regular structure. Two newer models are small-world and scale-free networks. Small-world networks lie somewhere between regular and random networks and the name analogy derives from the small-world phenomenon (Watts and Strogatz, 1998). Whereas the probability that a vertex has k links follows a Poisson distribution in random networks, scale-free networks follow a power law P(k)  kγ. The exponent γ has had values of γ = 2.1–2.4 for many real-world cases. (Strogatz, 2001).

A scale-free structure emerges in a network when it is growing by adding new vertices, and the new vertices are preferably attached to vertices which are already highly connected (Barabási and Albert, 1999). Both of these ingredients are necessary if a scale-free structure is wanted (Barabási et al., 1999). The situation becomes a little different if each vertex has some initial fitness which affects to the connection-making process. In this situation scale-free behavior can emerge also (Ergün and Rodgers, 2002).

Section snippets

Materials

The most important selection criterion for the data was the continuity of the attributes. In addition the selected datasets were all studied before. The standardized deviates of original attribute values were used in this study.

Three datasets from UCI Machine Learning Repository (Blake and Merz, 1998) were used along with a dataset consisting of intracranial EEG measurements from rats.

Fisher’s iris plant dataset contains 150 instances and three (continuous) attributes measured from three

Methods

Clustering algorithms based on graph theory can be used to detect clusters of different shapes and sizes, a feature that is not common among clustering methods. An example of this approach is a minimum spanning tree (MST) clustering (see Algorithm 1). The data must have well-separable clusters in order that they can be recognized with the MST clustering. On the other hand, the method does not need any parameters like the number of clusters or some other a priori information about the underlying

Results

For each dataset three different clustering methods were tested: SFMST, MST and k-means.

In both MST and SFMST methods Euclidean distance was used as the distance measure. A vertex was defined to be a hub if it had at least four links, and an SFMST cluster was defined to be a hub and all the vertices that connect directly to it. If two hubs were connected to each other or there was only one linking vertex between the hubs, they were defined to be in the same cluster. In addition, a branch was

Discussion

One drawback of the presented procedure is that the algorithm is quite time-consuming; clearly algorithm development and analysis for faster computing times is needed. Slow computing times restrict the practical amount of data points. Based on this it can be argued if the link distribution of produced trees really follow a power law.

The dependence on distance function or an similarity measure is an open question; maybe non-continuous attributes can be used along with the continuous ones if the

Acknowledgement

The author wishes to thank professors Tapio Grönfors and Seppo Lammi, from Department of Computer Science, University of Kuopio, for continuous encouragement and advisement and for commenting the manuscript as well. Compliments for providing the EEG dataset go to Jari Nissinen, Markku Penttonen and Asla Pitkänen from A.I. Virtanen Institute for Molecular Sciences, University of Kuopio.

The network pictures in this document were created with Pajek—Program for Large Network Analysis (Batagelj and

References (12)

There are more references available in the full text version of this article.

Cited by (75)

  • Minimum spanning tree hierarchical clustering algorithm: A new Pythagorean fuzzy similarity measure for the analysis of functional brain networks

    2022, Expert Systems with Applications
    Citation Excerpt :

    Clustering algorithms are commonly known and have been studied in various fields. In the introduction, we have explained a few research articles that broaden the applicability of clustering techniques to the ones with fuzzy information (Karunambigai et al., 2017; Pivinen, 2005; Xu, Chen, & Wu, 2008). PFSs have various advantages over FSs and IFSs due to the congenitally vague membership functions.

  • A graph-based clustering method with special focus on hyperspectral imaging

    2020, Analytica Chimica Acta
    Citation Excerpt :

    Here one tries to create disconnected subgraphs by removing edges with weights that differ significantly from others. Among many other graph types the Minimum Spanning Tree (MST) and its variants [2,3] are often used for this approach. Zahn [4] derived some basic criteria how such edges can be detected.

  • The research of constructing dynamic cognition model based on brain network

    2017, Saudi Journal of Biological Sciences
    Citation Excerpt :

    On the other hand, the method does not need any parameters like the number of clusters or some other a priori information about the underlying data. Highly connected vertices can be thought to be “cluster centers”, in this paper, the maximum degree is used to choose the cluster and centers, for example, the different cluster is in different colors as shown in Fig. 4 (Päivinen, 2005; Onnela et al., 2005). We used functional EEG and dynamic evolution modeling to firstly investigate the cortical dynamics among the region.

View all citing articles on Scopus
View full text