An ISODATA clustering procedure for symbolic objects using a distributed genetic algorithm

https://doi.org/10.1016/S0167-8655(99)00027-6Get rights and content

Abstract

A novel ISODATA clustering procedure for symbolic objects is presented using distributed genetic algorithms where in a structured organisation in the distribution of the population is introduced and selection and mating are made within locally distributed subgroups of individuals rather than the whole population.

Introduction

In conventional data analysis, the objects are numerical vectors. The clustering of such objects is achieved by minimizing intracluster dissimilarity and maximizing intercluster dissimilarity. A good survey of cluster analysis can be found in the literature (Diday and Simon, 1976; Diday et al., 1987; Bock, 1987; Jain and Dubes, 1988; Diday, 1989; Duda and Hart, 1973).

Symbolic objects are extensions of classical data types. In conventional data sets, the objects are “individualised” whereas in symbolic data sets, they are more “unified” by means of relationships. Based on the complexity, the symbolic objects can be of assertion, hoard or of synthetic type. Some references to symbolic objects can be found in (Diday, 1990Diday et al., 1980Michalski et al., 1981Cheng and Fu, 1983Lebovitz, 1987Fisher, 1987Kodratoff and Tecuci, 1988Hwang, 1989Gowda and Diday, 1991aGowda and Diday, 1992Gowda and Diday, 1991bGowda and Ravi, 1995aGowda and Ravi, 1995b).

Ichino (1988) defines general distance functions for mixed feature variables and presents dendrograms by the single linkage and complete linkage methods for data sets containing numeric and symbolic feature values.

Gowda and Diday, 1991a, Gowda and Diday, 1992 have proposed a new similarity and a new dissimilarity measure for symbolic objects. They present hierarchical symbolic clustering algorithms using the newly defined similarity and dissimilarity measures. They form composite symbolic objects whenever mutual pairs of symbolic objects are selected for agglomeration based on either similarity or dissimilarity.

Gowda and Ravi, 1995a, Gowda and Ravi, 1995b have proposed modified similarity and dissimilarity measures. They have proposed an agglomerative (Gowda and Ravi, 1995b) clustering algorithm which makes use of both similarity and dissimilarity. They also present a divisive (Gowda and Ravi, 1995a) symbolic clustering algorithm wherein the algorithm starts with the entire set of samples in one group and forms the subgroups by first dividing the entire set into two subsets according to some similarity and dissimilarity measure. These subsets are further subdivided till a stopping rule arrests further subdivisions.

Our aim in this paper is to present a symbolic clustering algorithm that makes use of distributed genetic algorithms.

The basic principles of genetic algorithms were first laid down by Holland (1975) and are well described by Davis, 1987, Davis, 1991, Goldberg, 1989, Michalewicz, 1992.

Genetic algorithms are inspired by the evolution of biological organisms. Over many generations, a population of solutions evolve according to one of the principles of natural selection, the `survival of the fittest', first clearly stated by Charles Darwin in The Origin of Species.

Alternate to standard genetic algorithms which deal with a single population whereupon a global selection is applied and where each chromosome can potentially mate with any other one, Andrey and Tarroux (1994) have introduced a structured organisation in the distribution of the population. In such a case, selection and mating can take place within locally distributed subgroups of individuals rather than within the whole population. These are referred to as distributed genetic algorithms. These algorithms can be run independently and in parallel on each subgroup.

The main aim of this paper would be to present an ISODATA clustering procedure for symbolic objects making use of the principles of distributed genetic algorithms. The proposed procedure is expected to perform better than the conventional procedures and increase the speed of convergence as the selection and mating are done in locally distributed groups rather than on the whole population.

Section snippets

Similarity between two symbolic objects

Symbolic objects are defined by a logical conjunction of events linking values and variables in which the variables can take one (including none) or more values and all the objects need not be defined on the same variables.

Two symbolic objects A and B can be represented as the cartesian product of features Ak and Bk.A=A1×A2×⋯×Ad,B=B1×B2×⋯×Bd.Let Uk denote the domain of the kth feature. Then the feature space can be written as a cartesian product,U(d)=U1×U2×⋯×Ud.The feature values may be

Composite symbolic object formation

In conventional data analysis, whenever two samples that are merged are to be represented by a single sample, one of the frequently used methods is to use the mean of the two as a single representative. In symbolic data analysis, the concept of composite symbolic object (Gowda and Diday, 1991a, Bonner, 1969, Forgy, 1965, Macqueen, 1967, Lance and Williams, 1967, Ball and Hall, 1965, Wolf, 1970Gowda and Diday, 1992) is used. A new method of forming composite symbolic object is proposed which is

Symbolic genetic ISODATA algorithm

The central idea in most of the nonhierarchical methods of clustering is to select a set of seed points around which clusters may be grown and to alter cluster memberships so as to get a better partition. Different methods for selecting seed points are suggested by Hyvarinen (1963), Bonner (1969), Forgy (1965), Macqueen (1967), Lance and Williams (1967), Ball and Hall (1965), Wolf (1970). Forgy (1965) suggested an iterative method of computing the seed points as the centroids of a set of

Results of simulation

In order to demonstrate the efficacy of the proposed algorithm, several simulation studies were made, the results of which are given below. In order to compare the efficacy of the proposed methodology, the conventional ISODATA algorithm proposed by Duda and Hart (1973) was applied on the data sets. The initial seed points of the classes were selected randomly. The results obtained using the proposed algorithm were examined for their validity using Hubert's τ statistics approach (Jain and Dubes,

Conclusion

A novel ISODATA clustering procedure for symbolic objects is developed using a distributed genetic algorithm. Alternate to standard genetic algorithm, here selection and mating are made within locally distributed subgroups of individuals rather than within the whole population. The efficacy of the proposed procedure is brought out by applying it on various data sets. A comparison is made with the conventional ISODATA procedure and the results are presented.

References (34)

  • Andrey, P., Tarroux, P., 1994. Unsupervised image segmentation using a distributed genetic algorithm. Pattern...
  • Ball, G.H., Hall, D.I., 1965. ISODATA – A novel method of data analysis and classification. Standford Res. Inst.,...
  • Bock, H.H. (Ed.), 1987. Classification and Related Methods of Data Analysis. Amsterdam, The...
  • Bonner, R.E., 1969. On some clustering techniques. IBM Journal 8,...
  • Cheng, Y., Fu, K.S., 1983. Conceptual clustering in knowledge organisation. IEEE Transactions Pattern Analysis and...
  • Davis, L., 1987. Genetic Algorithms and Simulated Annealing....
  • Davis, L., 1991. Handbook of Genetic Algorithms. Van Nostrand...
  • Diday, E., 1990. From numerical to symbolic clustering. In: COMPSTAT- Dubrovnik, Yugoslavia, 9–15 September...
  • Diday, E. (Ed.), 1989. Data Analysis, Learning Symbolic and Numeric Knowledge. Nova Science Publishers, Antibes,...
  • Diday, E., Simon, J.C., 1976. Clustering Analysis: Communication and Cybernetics, Vol. 10. Springer, New York, pp....
  • Diday, E., Govaert, G., Lechevallier, Y., Siddi, J., 1980. Clustering in pattern recognition. In: Proceedings of the...
  • Diday, E., Hayashi, C., Jambu, M., Ohsumi, N. (Eds.), 1987. Recent Developments in Clustering and Data Analysis....
  • Duda, R.O., Hart, P.E., 1973. Pattern Classification and Scene Analysis. Wiley Interscience, New...
  • Fisher, D.H., 1987. Knowledge acquisition via incremental conceptual clustering. Machine Learning 2,...
  • Forgy, E.W., 1965. Cluster analysis of multivariate data. AAAS, Biometric Soc (WNAR),...
  • Goldberg, D.E., 1989. Genetic Algorithms in Search Optimization and Machine Learning. Addison-Wesley, New...
  • Gowda, K.C. Diday, E., 1991a. Symbolic clustering using a new dissimilarity measure. Pattern Recognition 24 (6),...
  • Cited by (24)

    • Feature selection based on sensitivity analysis of fuzzy ISODATA

      2012, Neurocomputing
      Citation Excerpt :

      Fuzzy ISODATA algorithm is one of the well-known clustering algorithms. Fuzzy ISODATA utilizes the information in clustering process for minimizing the objective function and parting the dataset with the optimal fuzzy membership value matrices [4,34]. In the algorithm, ε is the admissible error.

    • Three-way fuzzy clustering models for LR fuzzy time trajectories

      2003, Computational Statistics and Data Analysis
    • Multi-directional local gradient descriptor: A new feature descriptor for face recognition

      2019, Image and Vision Computing
      Citation Excerpt :

      Such representation may not be able to represent effectively variation of feature values of the samples of the same subject. To address this issue, in recent years a new feature representation technique called symbolic data modeling has been introduced [23–27]. The symbolic data modeling technique offers a formal methodology to represent variable feature values.

    View all citing articles on Scopus
    View full text