On-line hierarchical clustering
Introduction
The objective of cluster analysis is to group a set of objects into clusters such that objects within the same cluster have a high degree of similarity, while objects belonging to different clusters have a high degree of dissimilarity.
The clustering of data set into subsets can be divided into hierarchical and non hierarchical or partitioning methods. The general rationale behind partitioning methods is to choose some initial partitioning of the data set and then alter cluster memberships so as to obtain better partitions according to a predefined objective function.
Hierarchical clustering procedures can be divided into agglomerative methods, which progressively merge the objects according to some distance measure in such a way that whenever two objects belong to the same cluster at some level, they remain together at all higher levels, and divisive methods, which progressively subdivide the data set (Gowda and Krishna, 1978).
Objects to be clustered usually come from an experimental study of some phenomenon and are described by a specific set of features selected by the data analyst. The feature values may be measured on different scale and these can be: Continuous numeric, Symbolic, or Structured.
Continuous numeric data are well known as a classical data type and many algorithms for clustering this type of data using partitioning or hierarchical techniques can be found in the literature (Jain and Dubes, 1988). Symbolic objects are extension of classical data types. In conventional data sets, the objects are individualized, whereas in symbolic objects, they are more unified by means of relationship. Based on the complexity, the symbolic objects can be of Assertion, Hoard or Synthetic type (Gowda and Ravi, 1995). Some references to clustering of symbolic objects can be found in Diday, 1988; Gowda and Diday, 1991; Gowda and Diday, 1992; Gowda and Ravi, 1995; Fisher, 1987; Cheng and Fu, 1985; Michalski and Stepp, 1983; Ichino, 1988; Gennari et al., 1989; Ralambondrainy, 1995using different methodologies like hierarchical clustering (Gowda and Diday, 1991; Gowda and Diday, 1992; Gowda and Ravi, 1995; Michalski and Stepp, 1983; Ichino, 1988), incremental clustering (Gennari et al., 1989), partitioning clustering (Ralambondrainy, 1995), and recently, fuzzy clustering (El-Sonbaty and Ismail, 1998). In the literature, the researches dealing with symbolic objects are less than those for numerical objects and this is due to the nature of such objects which is simple in construction but hard in processing. Besides, the values taken by the features of symbolic objects may include one or more elementary objects and, the data set may have a variable number of features (Gowda and Diday, 1991). Structured objects have higher complexity than continuous and symbolic objects because of their structure which is much more complex and their representation which needs higher data structures to permit the description of relations between elementary object components and facilitate hierarchical object models that describe how an object is built up from the primitives. A survey of different representations and proximity measures of structured objects can be found in El-Sonbaty et al., submitted.
Most of the hierarchical techniques introduced for clustering numeric or symbolic objects were off-line and that means these techniques require all the objects or the distance matrix to be available before the start of any hierarchical clustering routine which seems impractical in some cases. The drawbacks in using hierarchical techniques are well known in the field of data clustering. Memory size, updating the membership matrix, complexity per iteration of calculating distance function, and overall complexity of the algorithm to name a few of these difficulties faced when using any hierarchical based technique (Jain and Dubes, 1988).
The main contribution of this paper is to introduce an on-line agglomerative hierarchical technique based on the concept of single-linkage method for clustering symbolic and numeric data. The new algorithm has computational complexity O(n2) which is lower than the computational complexity of traditional hierarchical techniques reported in the literature (Jain and Dubes, 1988; Diday, 1988; Gowda and Diday, 1991; Gowda and Diday, 1992; Gowda and Ravi, 1995; Fisher, 1987; Cheng and Fu, 1985; Michalski and Stepp, 1983; Ichino, 1988; Gennari et al., 1989) that have O(n3). The proposed algorithm has also lower memory size that facilitates dealing with large data sets.
Section 2describes the proposed algorithm. Applications and analysis of experimental results are shown in 3 Experimental results, 4 Discussions and conclusions.
Section snippets
Proposed algorithm
In this section, we introduce the new algorithm and discuss its computational complexity and required memory size.
Experimental results
In this section the performance of the proposed algorithm is tested and evaluated using some test data reported in the literature and some simulation experiments. The data sets used in these experiments are synthetic or real data and their classification is known from other clustering techniques (Gowda and Diday, 1991; Gowda and Diday, 1992; Gowda and Ravi, 1995; Ichino, 1988). Comparisons between results obtained from the proposed algorithm and other techniques are given. The simulation
Discussions and conclusions
In this paper, a new on-line algorithm for hierarchical clustering based on the concept of single-linkage method was introduced. For each object, we calculate the nearest k objects to it. These nearest k objects are continuously updated by the arrival of a new object. By final object, we already have a group of objects and their nearest k objects, which are sorted to generate a set of pairs constructing the hierarchical dendogram. From experimental results and complexity analysis, the following
References (14)
A conceptual version of the K-means algorithm
Pattern Recognition Letters
(1995)- et al.
Conceptual clustering in knowledge organization
PAMI
(1985) - Diday, E., 1988. In: Bock, H.H. (Ed.), The Symbolic Approach in Clustering, Classification and Related Methods of Data...
- et al.
Fuzzy clustering for symbolic data
IEEE Trans. on Fuzzy Systems
(1998) - El-Sonbaty, Y., Kamel, M.S., Ismail, M.A., submitted. Representations and proximity measures of structured...
Knowledge acquistion via incremental conceptual clustering
Machine Learning
(1987)- et al.
Models of incremental concept formation
Artificial Intelligence
(1989)
Cited by (30)
Scatter/Gather browsing of web service QoS data
2012, Future Generation Computer SystemsCitation Excerpt :Besides the partitioning algorithms discussed in the above two papers, hierarchical clustering can also be used for symbolic data. In [19], an online agglomerative hierarchical clustering algorithm based on the single-linkage method is used to cluster both symbolic and numerical data. There are also various other approaches available for interval clustering, using different mechanisms such as rough sets, genetic algorithms, belief functions, and neural networks.
Clustering constrained symbolic data
2009, Pattern Recognition LettersPartitional clustering algorithms for symbolic interval data based on single adaptive distances
2009, Pattern RecognitionCitation Excerpt :A divisive method starts with all items in a single cluster and performs a splitting procedure until a stopping criterion is met (usually upon obtaining a partition of singleton clusters). SDA has provided hierarchical clustering methods for symbolic data [18,19,26,20,21,14,22,28,24,23], including a divisive [6] hierarchical method that performs a split of a cluster according to a suitable dispersion criterion. As mentioned above, SDA has provided partitional clustering methods for symbolic interval data.
Clustering of interval data based on city-block distances
2004, Pattern Recognition LettersHierarchical conceptual clustering based on quantile method for identifying microscopic details in distributional data
2021, Advances in Data Analysis and ClassificationSymbolic approach to reduced bio-basis
2018, International Journal of Data Mining and Bioinformatics