Elsevier

Pattern Recognition

Volume 96, December 2019, 106969
Pattern Recognition

A study of divisive clustering with Hausdorff distances for interval data

https://doi.org/10.1016/j.patcog.2019.106969Get rights and content

Highlights

  • Hausdorrf, Gowda–Diday and Ichino–Yaguchi distances for intervals are compared.

  • Euclidean counterparts and their normalizations are included.

  • Summary of advantages and disadvantages of these respective distances are based on simulation studies.

  • The simulation study shows local normalizations outperform global normalizations.

Abstract

Clustering methods are becoming key as analysts try to understand what knowledge is buried inside contemporary large data sets. This article analyzes the impact of six different Hausdorff distances on sets of multivariate interval data (where, for each dimension, an interval is defined as an observation [a, b] with a ≤ b and with a and b taking values on the real line R1), used as the basis for Chavent’s [15, 16] divisive clustering algorithm. Advantages and disadvantages are summarized for each distance. Comparisons with two other distances for interval data, the Gowda–Diday and Ichino–Yaguchi measures are included. All have specific strengths depending on the type of data present. Global normalization of a distance is not recommended; and care needs to be made when using local normalizations to ensure the features of the underlying data sets are revealed. The study is based on sets of simulated data, and on a real data set.

Introduction

This work focuses on clustering for multivariate interval-valued data, where for each dimension interval observations assume the form X=[a,b], a ≤ b (with a ≤ b and with a and b taking values on the real line R1). Such observations are examples of what Diday [24] introduced as symbolic data (which are observations broadly defined as hypercubles or Cartesian products of distributions in Rp, e.g., lists, intervals, histograms, and the like); see, e.g., Bock and Diday [10], Billard and Diday [7], Diday and Noirhomme-Fraiture [26], and [52] with a non-technical introduction in [6]. With the advent of modern computer capabilities and the attendant proliferation of massively large data sets, it is imperative that methodologies for symbolic data be developed. Indeed, Goodman [29] opined that symbolic data analysis was one of the two most important new avenues appearing over the horizon for the future of statistical science.

Hierarchical clustering can be either divisive or agglomerative; see, e.g., Anderberg [3]. Divisive clustering starts with all the observations in one cluster Ω and successively divides a cluster at each stage until a requisite number of clusters is achieved (with one possibility being that the final stage consists entirely of single element clusters). In contrast, in agglomerative clustering methods, the original clusters are merged two at a time until all the observations belong to a final cluster Ω. Except for pyramidal clustering (e.g., [25] and [11], but not covered in this work), clusters are non-overlapping and exhaustive. Typically, the divisive/merging criteria are based on dissimilarity or distance measures between observations or clusters. Our focus is on divisive clustering. When performing divisive clustering, we can partition the data according to either one single variable (monothetic method) or all variables simultaneously (polythetic method).

Clustering methodology has seen extensive activity for classical data sets in recent decades; a good coverage and review can be found in, e.g., [3], [9], [27], [30]. Much of the literature deals with agglomerative clustering, with some nice overall reviews in [38], [39], [49], [51], [59]. Other important contributions include [1], [2], [28], [47], [50], [54], [55], [56], [57], [60], [61], among others. Some papers look at questions revolving about initial seeds in k-means methods, e.g., [13], [48], [53], [58]. Indhu and Porkodi [35] concludes that the hierarchical method has greater accuracy that the k-means, density-based or EM algorithms. Chang [14] uses principal components on a mixture of two normal distributions. These are for classical data.

Unfortunately, there are relatively few works for divisive clustering for symbolic data. Indeed, it was not until Chavent [15], Chavent [16] that the first divisive clustering method for interval data was introduced; this was a monothethic method for interval distances adapted from the original Hausdorff [33] distances between point observations. Yet, given the increasing prevalence of symbolic data sets, especially for interval-valued data arising from aggregation of the massively large-sized data sets from modern computer environments, it is becoming increasing important that methods be developed to handle such data sets. Even without aggregation, examples abound; e.g., temperatures are recorded as daily minimum and maximum values, stock market values are recorded as daily (low, high) or (opening, closing) values. Thus, a clustering algorithm applied to interval stock prices might identity which kinds of stocks are similar to each other (such as manufacturing, banking, media, and so on). It is known that doing analyses based on interval midpoints or means lose critical information (see, e.g., [5], [6]; therefore, answers are not necessarily correct when using classical point surrogates. While [42], [43] have introduced algorithms for histogram data (a polythetic and a double monothethic algorithm, respectively), the original [15] algorithm remains as the prime algorithm for interval data, and that itself was restricted to the Hausdorff distance only. Therefore, the aim herein is to compare different distances (for which no comparisons currently exist) for this well-known and important algorithm.

In contrast, many methods for obtaining partitions of interval data have been developed; see, e.g., [4], [17], [18], [20], [21], [22], [23], [36], [37], among others. More recently, partitioning algorithms for histogram-type data were introduced by, e.g., [46], and [44], [45]. Some of these are also based on the Hausdorff distance, some transform the interval to its corresponding center and/or end-points, some use Mallows’ L2 distances, and so on.

In our article, Chavent’s hierarchical divisive monothetic method is applied to interval-valued data with the implementation based on six different types of Hausdorff distances: the basic Hausdorff distance (of [15]), the Euclidean Hausdorff distance, the Global Span Normalized Hausdorff distance, the Local Span Normalized Hausdorff distance, the Global Normalized Hausdorff distance, and the Local Normalized Hausdorff distance; see Section 2.1.1. To date, the global normalized distances have not been applied to symbolic data. Nor have the advantages and disadvantages between these choices been discussed in the literature. As part of this comparison, we apply Chavent’s hierarchical divisive monothetic method and the different distances to seven different types of simulated data, in Section 3. It is noted that Hausdorff distances dominate clustering methodologies for interval data to date, primarily because of their simplicity and intuitive appeal.

There are other distances/dissimilarities that could be used for interval data, such as the Gowda–Diday [31], [32] dissimilarities, or the Ichino–Yaguchi [34] distances; these can also be used in a divisive clustering method. In our simulation study, these distances are compared with the Hausdorff distances. It is observed that these require longer computing times than for the Hausdorff distances, primarily due to the complexity of their definitions. The Hausdorff distance, on the other hand, is easy to understand and to calculate, and more importantly, has lower computing costs than counterparts. Hence, the primary focus of this paper is on the Hausdorff distance.

Our aim is to provide a comparative study of currently available distances and their extensions for interval-valued data sets, so as to obtain an intuitive sense of what distances may be more appropriate for particular settings. After defining the various Hausdorff distances in Section 2.1.1 and the Gowda–Diday and Ichino–Yaguchi distances in 2.1.2 Gowda–Diday and Ichino–Yaguchi distances, 2.2 Chavent’s divisive clustering algorithm for interval data gives the detailed algorithm of applying Chavent’s method to interval-valued data. Simulations are run in Section 3 in order to compare the different distances and to learn their respective advantages and disadvantages. While the algorithm is applicable for any dimension (p), for ease of presentation and illustration, these data sets are p=2-dimensional. Simulations are also run for p=10-dimensional observations for each of the different types of data sets studied. Further, for each of these, uncorrelated and correlated observations are considered. A real data set with p=13 variables is considered in Section 4. Some concluding remarks are in Section 5. Finally, an alternative aspect that needs attention when calculating local normalizations is discussed in the Supplementary Materials (Section S3).

Section snippets

Divisive clustering and Hausdorff distances

Suppose we have a domain ΩΩ1××ΩpRp and a set of n interval-valued observations, measured on p random variables, described by X(i)=(Xi1,,Xip), i=1,,n, with Xij=[aij,bij], aij ≤ bij, j=1,,p. The goal is to divide Ω into R non-overlapping and exhaustive clusters C1,,CR, with Cu containing nu observations. Typically, the clustering process is based on dissimilarities or distances between observations. In Section 2.1.1, the basic Hausdorff [33] distance between interval observations is

Simulations

The different distances (of Section 2.1) are used to cluster data by using the Chavent divisive algorithm in each of six different sets of simulated data designed to illustrate six differing types of data sets. Since, as is seen from the relevant tables, the Gowda–Diday distances and the Ichino–Yaguchi distances are considerably slower than for the Hausdorff distances, for conciseness, this discussion will tend to focus on the Hausdorff distances per se. However, CER and computing time results

Application

Table 22 shows temperature data for 60 weather stations of China in 1988. [Tables 19–22 and Fig. 13 are in the Supplementary Materials, Section S2.] The data set consists of minimum and maximum temperatures for each month, variables X1X12 representing January - December, and X13 is the elevation. The unit of temperatures is Celsius degree. Each observation is of equal weight here. These data are extracted from a larger data set which contains observations for many more stations, more variables

Conclusion

Chavent [15] introduced a divisive clustering method for interval-valued data, based on the basic Hausdorff distance, universally accepted to date for interval data. While distances can be adapted in various ways, such as normalizing or not (via a variety of possible normalizations), global or local weighting options, and so forth, there were no definitive studies to compare these various options. This article addresses this deficiency by comparing six different Hausdorff distances applied to

References (61)

  • J.F. Lu et al.

    Hierarchical initialization approach for K-means clustering

    Pattern Recognit. Lett.

    (2008)
  • T.S. Madhulatha

    An overview of clustering methods

    IOSR J. Eng.

    (2012)
  • P. Praveen et al.

    A study on monothetic divisive hierarchical clustering method

    Int. J. Adv. Scientif.Technol. Eng. Manag. Sci.

    (2017)
  • D. Steinley

    k-means clustering: a half-century synthesis

    Br. J. Math. Stat. Psychol.

    (2006)
  • Z. Abdullah et al.

    Hierarchical clustering algorithms in data mining

    Int. J. Comput. Inf. Eng.

    (2015)
  • M.R. Anderberg

    Cluster Analysis for Applications

    (1973)
  • V. Batagelj

    Generalized ward and related clustering problems

  • L. Billard et al.

    Sample covariance functions for complex quantitative data

    Proceedings World Congress, International Association of Statistical Computing

    (2008)
  • L. Billard

    Brief overview of symbolic data and analytic issues

    Stat. Anal. Data Min.

    (2011)
  • L. Billard et al.

    Symbolic Data Analysis: Conceptual Statistics and Data Mining

    (2006)
  • L. Billard et al.

    Principal component analysis for interval data

    Wiley Interdiscip. Rev.

    (2012)
  • H.H. Bock

    Clustering methods: a history of k-means algorithms

  • H.H. Bock et al.

    Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data

    (2000)
  • M.P. Brito et al.

    Pyramidal represenation of symbolic objects

  • V. Cariou et al.

    Generalization method when manipulating relational databases

    Revue des Nouvelles Technologies de l’Information

    (2015)
  • M.E. Celebi et al.

    A comparative study of efficient initialization methods for the k-means clustering algorithm

    Expert Syst. Appl.

    (2013)
  • M. Chavent

    A monothetic clustering method

    Pattern Recognit. Lett.

    (1998)
  • M. Chavent et al.

    New clustering methods for interval data

    Comput. Stat.

    (2006)
  • M. Chavent et al.

    Dynamical clustering of interval data: optimization of an adequacy criterion based on Hausdorff distance

    Classification, Clustering, and Data Analysis

    (2002)
  • Y. Chen

    Symbolic Data Regression and Clustering

    (2014)
  • Cited by (11)

    View all citing articles on Scopus
    View full text