Investigating diversity of clustering methods: An empirical comparison

https://doi.org/10.1016/j.datak.2007.01.002Get rights and content

Abstract

The paper aims to shed some light on the question why clustering algorithms, despite being quantitative and hence supposedly objective in nature, yield different and varied results. To do that, we took 10 common clustering algorithms and tested them over four known datasets, used in the literature as baselines with agreed upon clusters. One additional method, Binary-Positive, developed by our team, was added to the analysis. The results affirm the unpredictable nature of the clustering process, point to different assumptions taken by different methods. One conclusion of the study is to carefully choose the appropriate clustering method for any given application.

Introduction

The data mining field is advancing rapidly both in academia and in industry, and is at the technological forefront of data resource usage and management in organizations. Data mining deals with identifying interesting and useful patterns and associations within existing data, to reach new insights, which will add to the organization’s knowledge base. A variety of models and algorithms from different fields are being used, originating from statistics, artificial intelligence, neural networks, and databases, all coming under a general term – knowledge discovery and data mining (KDD).

Among the many techniques and models utilized in data mining are clustering and classification – two principal techniques which are the focus of empirical studies in this field. The goal of clustering is to organize the entities or objects into groups, in a manner, which results in minimal distances within each group, and maximum distances between the various groups [6]. Cluster analysis is thus the division of a collection of records into groups. This analysis is based on similarity. Describing this intuitively, we could say that the similarity between records within a valid cluster is greater than the similarity between these records and those belonging to a different cluster. The variety of techniques for representing data, measuring proximity (similarity) between records, and grouping records (algorithms) have produced a rich and sometimes confusing assortment of methods. A clear distinction is thus needed between classification and clustering before we dwell on the latter.

Classification is a supervised process – we are provided with a collection of pre-classified (labeled) records; the problem is to label the records that are as yet unlabelled. The labeled or “training” records are used to learn the attributes of a group, which in turn are used to label the new record. Clustering, on the other hand, is an unsupervised process. The problem is to group a given collection of unlabelled records into meaningful clusters. While the categories in classification are set externally, categories in clustering are determined by the clustering process itself, i.e., the various categories are derived from the dataset itself.

Clustering is useful in several exploratory pattern-analysis, grouping, decision-making and machine learning situations. However, in many such problems, there is little prior information (e.g. statistic models) available about the data where the decision maker has to make as few assumptions about the data as possible [13].

Despite being one of the central techniques in data mining, and its obvious advantages, clustering use is limited to academic research and data analysis, and as yet to gain wide use in commerce. There are several reasons for the clustering technique being accepted so slowly. Firstly, there is a standardization problem. Different clustering algorithms produce different clusters and there is no clear-cut and standard method to compare them. Secondly, the interpretation for the various clusters formed, and their implementation in the original environment is not defined. Managers and business users hardly know the value of what they can attain by performing clustering using whatever technique. Indeed, the clustering process is unpredictable and sometimes even inconsistent. Different programs generally divide the same dataset differently. This diversity makes the clustering process difficult. Furthermore, there is no clear way to measure and evaluate the quality of a clustering algorithm [7].

A large number of clustering algorithms are available and reported in the literature. This can easily confound a user trying to choose an algorithm, which suits his problem. Clustering techniques, while purely quantitative in nature, turn out to be quite unpredictable in their approach as well as outcome. Different clustering techniques hardly produce similar clusters from the same data, a puzzling phenomenon. This is possibly due to clustering algorithms driven by underlying assumptions, cluster formats, or different representation. And most of all, the outcome depends on definition of similarity measures and clustering criteria being used in the process. This paper aims to bring some light on such processes that produce unpredictable result by taking a range of algorithms (11 of them) and applying them to the same datasets (4 such sets). In all we ran 44 clustering experiments. Such experiments may yield some conclusions on the clustering process itself.

Section snippets

Study goals

The goal of this study is to compare different clustering methods using several datasets, try and characterize the “better” methods (the ones whose results are closer to the baseline known results), and to try to analyze and explain the source of differences between them. As mentioned above, different clustering algorithms produce different groups, and there is no clear-cut and standard method with which to compare them.

In addition to the standard clustering algorithms used in this study, we

Clustering algorithms

Below is the description of various methods used in this study [24], [25]. The algorithms described below can be divided into two groups: standard algorithms, which work using the hierarchal method, and other algorithms which took part in this study such as divisive methods, neural networks and the Binary-Positive method. The following section includes a general description of each method and specifies the manner in which the records are clustered.

Dataset descriptions

The following is a short description of the four datasets used in this study. The data come from a variety of fields and have been familiar in the literature as baselines for clustering. They also present non-dichotomy problems where each set can be divided into more than two clusters. As the clusters of those dataset is known, it provides a measure for evaluating the various clusters produced by the tested 11 methods described above.

Implementation description

The comparison process was made by implementing the 11 clustering methods included in this study on the 4 datasets described above. For each algorithm in each dataset we performed:

  • 1.

    Implementing the algorithm on the dataset.

  • 2.

    Presenting the results in a cross-tab table.

  • 3.

    Converting the table into a score.

Graphic representation

The following figures illustrate the normalized results displayed in Table 3, using two graphs:

  • 1.

    Dataset score distribution: a graphical representation of table rows (one line per method).

  • 2.

    Algorithm score distribution: graphical representation of table columns (one line per dataset).

Discussion

There are obvious differences in the algorithms’ ability to perform clustering for each different dataset, and generate similar clusters of the original data. Indeed, results of clustering the “Iris” data are much higher than any other datasets, affirming its popularity as the dataset for pattern recognition.

On the other and, results of clustering the “Balance” data are the lowest. All algorithms fail to produce clusters similar to those originally known. This can be explained by the way data

Summary and conclusions

In this study we performed 44 clustering runs to compare between 11 different clustering algorithms using four known datasets. We observed that different clustering algorithms generate different groups suggesting unpredictable nature of the clustering method when attempting to match “known” and accepted clusters. We presented the result in a unified way, both in tables and graphs, thus adding some standardization to this area.

The study attempts to construct an objective method for results

Roy Gelbard is a faculty member of the Information System Program at the Graduate School of Business Administration, Bar-Ilan University. He received his Ph.D. and M.Sc. degrees in Information Systems from Tel-Aviv University. He holds also degrees in Biology and Philosophy. His work involves data mining, data and knowledge representation, software engineering and software project management.

References (25)

  • R. Gelbard et al.

    Hempel’s raven paradox: a positive approach to cluster analysis

    Computers and Operations Research

    (2000)
  • D. Klahr et al.

    The representation of children’s knowledge

  • R.S. Siegler

    Three aspects of cognitive development

    Cognitive Psychology

    (1976)
  • I. Spiegler et al.

    Storage and retrieval considerations of binary data bases

    Information Processing and Management

    (1985)
  • S. Aeberhard, D. Coomans, O. de Vel, Comparison of classifiers in high dimensional settings, Technical Report no....
  • S. Aeberhard, D. Coomans, O. Vel, The classification performance of RDA, Technical Report no. 92-01, Dept. of Computer...
  • Cheeseman et al’s AUTOCLASS II, in: MLC Proceedings, 1988, pp....
  • B.V. Dasarathy

    Nosing around the neighborhood: a new system structure and classification rule for recognition in partially exposed environments

    IEEE Transactions on Pattern Analysis and Machine Intelligence

    (1980)
  • R.O. Duda et al.

    Pattern Classification and Scene Analysis

    (1973)
  • Z. Erlich et al.

    A Binary Positive Model for clustering and data mining

    Information Systems Frontiers

    (2002)
  • Z. Erlich et al.

    Evaluating a positive attribute clustering model for data mining

    Journal of Computer Information Systems

    (2003)
  • R.A. Fisher

    The use of multiple measurements in taxonomic problems

    Annual Eugenics

    (1936)
  • Cited by (0)

    Roy Gelbard is a faculty member of the Information System Program at the Graduate School of Business Administration, Bar-Ilan University. He received his Ph.D. and M.Sc. degrees in Information Systems from Tel-Aviv University. He holds also degrees in Biology and Philosophy. His work involves data mining, data and knowledge representation, software engineering and software project management.

    Orit Goldman is a doctoral student at the Graduate School of Management, Tel-Aviv University. She holds a B.Sc. in Computer Science and an MBA both from Bar-Ilan University. She has worked at SPSS as a research methods consultant, dealing with the empirical applications of data mining and knowledge discovery. She is an experienced analyst in DM software such as Clementine.

    Israel Spiegler is the Mexico Professor of Management Information Systems and Vice Dean of the Graduate School of Management, Tel-Aviv University. He holds M.Sc. and Ph.D. degrees in Computers and Information Systems from UCLA. He held faculty positions at Boston University, School of Information Science at Claremont Graduate University, and UCLA (visiting). He is currently a Visiting Scholar at Harvard University. His main areas of interest are data modeling, databases, artificial intelligence and knowledge management.

    View full text