Elsevier

Pattern Recognition Letters

Volume 19, Issue 14, December 1998, Pages 1285-1291
Pattern Recognition Letters

On-line hierarchical clustering

https://doi.org/10.1016/S0167-8655(98)00104-4Get rights and content

Abstract

Most of the techniques used in the literature for hierarchical clustering are based on off-line operation. The main contribution of this paper is to propose a new algorithm for on-line hierarchical clustering by finding the nearest k objects to each introduced object so far and these nearest k objects are continuously updated by the arrival of a new object. By final object, we have the objects and their nearest k objects which are sorted to produce the hierarchical dendogram. The results of the application of the new algorithm on real and synthetic data and also using simulation experiments, show that the new technique is quite efficient and, in many respects, superior to traditional off-line hierarchical methods.

Introduction

The objective of cluster analysis is to group a set of objects into clusters such that objects within the same cluster have a high degree of similarity, while objects belonging to different clusters have a high degree of dissimilarity.

The clustering of data set into subsets can be divided into hierarchical and non hierarchical or partitioning methods. The general rationale behind partitioning methods is to choose some initial partitioning of the data set and then alter cluster memberships so as to obtain better partitions according to a predefined objective function.

Hierarchical clustering procedures can be divided into agglomerative methods, which progressively merge the objects according to some distance measure in such a way that whenever two objects belong to the same cluster at some level, they remain together at all higher levels, and divisive methods, which progressively subdivide the data set (Gowda and Krishna, 1978).

Objects to be clustered usually come from an experimental study of some phenomenon and are described by a specific set of features selected by the data analyst. The feature values may be measured on different scale and these can be: Continuous numeric, Symbolic, or Structured.

Continuous numeric data are well known as a classical data type and many algorithms for clustering this type of data using partitioning or hierarchical techniques can be found in the literature (Jain and Dubes, 1988). Symbolic objects are extension of classical data types. In conventional data sets, the objects are individualized, whereas in symbolic objects, they are more unified by means of relationship. Based on the complexity, the symbolic objects can be of Assertion, Hoard or Synthetic type (Gowda and Ravi, 1995). Some references to clustering of symbolic objects can be found in Diday, 1988; Gowda and Diday, 1991; Gowda and Diday, 1992; Gowda and Ravi, 1995; Fisher, 1987; Cheng and Fu, 1985; Michalski and Stepp, 1983; Ichino, 1988; Gennari et al., 1989; Ralambondrainy, 1995using different methodologies like hierarchical clustering (Gowda and Diday, 1991; Gowda and Diday, 1992; Gowda and Ravi, 1995; Michalski and Stepp, 1983; Ichino, 1988), incremental clustering (Gennari et al., 1989), partitioning clustering (Ralambondrainy, 1995), and recently, fuzzy clustering (El-Sonbaty and Ismail, 1998). In the literature, the researches dealing with symbolic objects are less than those for numerical objects and this is due to the nature of such objects which is simple in construction but hard in processing. Besides, the values taken by the features of symbolic objects may include one or more elementary objects and, the data set may have a variable number of features (Gowda and Diday, 1991). Structured objects have higher complexity than continuous and symbolic objects because of their structure which is much more complex and their representation which needs higher data structures to permit the description of relations between elementary object components and facilitate hierarchical object models that describe how an object is built up from the primitives. A survey of different representations and proximity measures of structured objects can be found in El-Sonbaty et al., submitted.

Most of the hierarchical techniques introduced for clustering numeric or symbolic objects were off-line and that means these techniques require all the objects or the distance matrix to be available before the start of any hierarchical clustering routine which seems impractical in some cases. The drawbacks in using hierarchical techniques are well known in the field of data clustering. Memory size, updating the membership matrix, complexity per iteration of calculating distance function, and overall complexity of the algorithm to name a few of these difficulties faced when using any hierarchical based technique (Jain and Dubes, 1988).

The main contribution of this paper is to introduce an on-line agglomerative hierarchical technique based on the concept of single-linkage method for clustering symbolic and numeric data. The new algorithm has computational complexity O(n2) which is lower than the computational complexity of traditional hierarchical techniques reported in the literature (Jain and Dubes, 1988; Diday, 1988; Gowda and Diday, 1991; Gowda and Diday, 1992; Gowda and Ravi, 1995; Fisher, 1987; Cheng and Fu, 1985; Michalski and Stepp, 1983; Ichino, 1988; Gennari et al., 1989) that have O(n3). The proposed algorithm has also lower memory size that facilitates dealing with large data sets.

Section 2describes the proposed algorithm. Applications and analysis of experimental results are shown in 3 Experimental results, 4 Discussions and conclusions.

Section snippets

Proposed algorithm

In this section, we introduce the new algorithm and discuss its computational complexity and required memory size.

Experimental results

In this section the performance of the proposed algorithm is tested and evaluated using some test data reported in the literature and some simulation experiments. The data sets used in these experiments are synthetic or real data and their classification is known from other clustering techniques (Gowda and Diday, 1991; Gowda and Diday, 1992; Gowda and Ravi, 1995; Ichino, 1988). Comparisons between results obtained from the proposed algorithm and other techniques are given. The simulation

Discussions and conclusions

In this paper, a new on-line algorithm for hierarchical clustering based on the concept of single-linkage method was introduced. For each object, we calculate the nearest k objects to it. These nearest k objects are continuously updated by the arrival of a new object. By final object, we already have a group of objects and their nearest k objects, which are sorted to generate a set of pairs constructing the hierarchical dendogram. From experimental results and complexity analysis, the following

References (14)

  • H Ralambondrainy

    A conceptual version of the K-means algorithm

    Pattern Recognition Letters

    (1995)
  • Y Cheng et al.

    Conceptual clustering in knowledge organization

    PAMI

    (1985)
  • Diday, E., 1988. In: Bock, H.H. (Ed.), The Symbolic Approach in Clustering, Classification and Related Methods of Data...
  • Y El-Sonbaty et al.

    Fuzzy clustering for symbolic data

    IEEE Trans. on Fuzzy Systems

    (1998)
  • El-Sonbaty, Y., Kamel, M.S., Ismail, M.A., submitted. Representations and proximity measures of structured...
  • D.H Fisher

    Knowledge acquistion via incremental conceptual clustering

    Machine Learning

    (1987)
  • J Gennari et al.

    Models of incremental concept formation

    Artificial Intelligence

    (1989)
There are more references available in the full text version of this article.

Cited by (30)

  • Scatter/Gather browsing of web service QoS data

    2012, Future Generation Computer Systems
    Citation Excerpt :

    Besides the partitioning algorithms discussed in the above two papers, hierarchical clustering can also be used for symbolic data. In [19], an online agglomerative hierarchical clustering algorithm based on the single-linkage method is used to cluster both symbolic and numerical data. There are also various other approaches available for interval clustering, using different mechanisms such as rough sets, genetic algorithms, belief functions, and neural networks.

  • Clustering constrained symbolic data

    2009, Pattern Recognition Letters
  • Partitional clustering algorithms for symbolic interval data based on single adaptive distances

    2009, Pattern Recognition
    Citation Excerpt :

    A divisive method starts with all items in a single cluster and performs a splitting procedure until a stopping criterion is met (usually upon obtaining a partition of singleton clusters). SDA has provided hierarchical clustering methods for symbolic data [18,19,26,20,21,14,22,28,24,23], including a divisive [6] hierarchical method that performs a split of a cluster according to a suitable dispersion criterion. As mentioned above, SDA has provided partitional clustering methods for symbolic interval data.

  • Symbolic approach to reduced bio-basis

    2018, International Journal of Data Mining and Bioinformatics
View all citing articles on Scopus
View full text