A study of divisive clustering with Hausdorff distances for interval data
Introduction
This work focuses on clustering for multivariate interval-valued data, where for each dimension interval observations assume the form a ≤ b (with a ≤ b and with a and b taking values on the real line ). Such observations are examples of what Diday [24] introduced as symbolic data (which are observations broadly defined as hypercubles or Cartesian products of distributions in e.g., lists, intervals, histograms, and the like); see, e.g., Bock and Diday [10], Billard and Diday [7], Diday and Noirhomme-Fraiture [26], and [52] with a non-technical introduction in [6]. With the advent of modern computer capabilities and the attendant proliferation of massively large data sets, it is imperative that methodologies for symbolic data be developed. Indeed, Goodman [29] opined that symbolic data analysis was one of the two most important new avenues appearing over the horizon for the future of statistical science.
Hierarchical clustering can be either divisive or agglomerative; see, e.g., Anderberg [3]. Divisive clustering starts with all the observations in one cluster Ω and successively divides a cluster at each stage until a requisite number of clusters is achieved (with one possibility being that the final stage consists entirely of single element clusters). In contrast, in agglomerative clustering methods, the original clusters are merged two at a time until all the observations belong to a final cluster Ω. Except for pyramidal clustering (e.g., [25] and [11], but not covered in this work), clusters are non-overlapping and exhaustive. Typically, the divisive/merging criteria are based on dissimilarity or distance measures between observations or clusters. Our focus is on divisive clustering. When performing divisive clustering, we can partition the data according to either one single variable (monothetic method) or all variables simultaneously (polythetic method).
Clustering methodology has seen extensive activity for classical data sets in recent decades; a good coverage and review can be found in, e.g., [3], [9], [27], [30]. Much of the literature deals with agglomerative clustering, with some nice overall reviews in [38], [39], [49], [51], [59]. Other important contributions include [1], [2], [28], [47], [50], [54], [55], [56], [57], [60], [61], among others. Some papers look at questions revolving about initial seeds in k-means methods, e.g., [13], [48], [53], [58]. Indhu and Porkodi [35] concludes that the hierarchical method has greater accuracy that the k-means, density-based or EM algorithms. Chang [14] uses principal components on a mixture of two normal distributions. These are for classical data.
Unfortunately, there are relatively few works for divisive clustering for symbolic data. Indeed, it was not until Chavent [15], Chavent [16] that the first divisive clustering method for interval data was introduced; this was a monothethic method for interval distances adapted from the original Hausdorff [33] distances between point observations. Yet, given the increasing prevalence of symbolic data sets, especially for interval-valued data arising from aggregation of the massively large-sized data sets from modern computer environments, it is becoming increasing important that methods be developed to handle such data sets. Even without aggregation, examples abound; e.g., temperatures are recorded as daily minimum and maximum values, stock market values are recorded as daily (low, high) or (opening, closing) values. Thus, a clustering algorithm applied to interval stock prices might identity which kinds of stocks are similar to each other (such as manufacturing, banking, media, and so on). It is known that doing analyses based on interval midpoints or means lose critical information (see, e.g., [5], [6]; therefore, answers are not necessarily correct when using classical point surrogates. While [42], [43] have introduced algorithms for histogram data (a polythetic and a double monothethic algorithm, respectively), the original [15] algorithm remains as the prime algorithm for interval data, and that itself was restricted to the Hausdorff distance only. Therefore, the aim herein is to compare different distances (for which no comparisons currently exist) for this well-known and important algorithm.
In contrast, many methods for obtaining partitions of interval data have been developed; see, e.g., [4], [17], [18], [20], [21], [22], [23], [36], [37], among others. More recently, partitioning algorithms for histogram-type data were introduced by, e.g., [46], and [44], [45]. Some of these are also based on the Hausdorff distance, some transform the interval to its corresponding center and/or end-points, some use Mallows’ L2 distances, and so on.
In our article, Chavent’s hierarchical divisive monothetic method is applied to interval-valued data with the implementation based on six different types of Hausdorff distances: the basic Hausdorff distance (of [15]), the Euclidean Hausdorff distance, the Global Span Normalized Hausdorff distance, the Local Span Normalized Hausdorff distance, the Global Normalized Hausdorff distance, and the Local Normalized Hausdorff distance; see Section 2.1.1. To date, the global normalized distances have not been applied to symbolic data. Nor have the advantages and disadvantages between these choices been discussed in the literature. As part of this comparison, we apply Chavent’s hierarchical divisive monothetic method and the different distances to seven different types of simulated data, in Section 3. It is noted that Hausdorff distances dominate clustering methodologies for interval data to date, primarily because of their simplicity and intuitive appeal.
There are other distances/dissimilarities that could be used for interval data, such as the Gowda–Diday [31], [32] dissimilarities, or the Ichino–Yaguchi [34] distances; these can also be used in a divisive clustering method. In our simulation study, these distances are compared with the Hausdorff distances. It is observed that these require longer computing times than for the Hausdorff distances, primarily due to the complexity of their definitions. The Hausdorff distance, on the other hand, is easy to understand and to calculate, and more importantly, has lower computing costs than counterparts. Hence, the primary focus of this paper is on the Hausdorff distance.
Our aim is to provide a comparative study of currently available distances and their extensions for interval-valued data sets, so as to obtain an intuitive sense of what distances may be more appropriate for particular settings. After defining the various Hausdorff distances in Section 2.1.1 and the Gowda–Diday and Ichino–Yaguchi distances in 2.1.2 Gowda–Diday and Ichino–Yaguchi distances, 2.2 Chavent’s divisive clustering algorithm for interval data gives the detailed algorithm of applying Chavent’s method to interval-valued data. Simulations are run in Section 3 in order to compare the different distances and to learn their respective advantages and disadvantages. While the algorithm is applicable for any dimension (p), for ease of presentation and illustration, these data sets are -dimensional. Simulations are also run for -dimensional observations for each of the different types of data sets studied. Further, for each of these, uncorrelated and correlated observations are considered. A real data set with variables is considered in Section 4. Some concluding remarks are in Section 5. Finally, an alternative aspect that needs attention when calculating local normalizations is discussed in the Supplementary Materials (Section S3).
Section snippets
Divisive clustering and Hausdorff distances
Suppose we have a domain and a set of n interval-valued observations, measured on p random variables, described by with aij ≤ bij, . The goal is to divide Ω into R non-overlapping and exhaustive clusters with Cu containing nu observations. Typically, the clustering process is based on dissimilarities or distances between observations. In Section 2.1.1, the basic Hausdorff [33] distance between interval observations is
Simulations
The different distances (of Section 2.1) are used to cluster data by using the Chavent divisive algorithm in each of six different sets of simulated data designed to illustrate six differing types of data sets. Since, as is seen from the relevant tables, the Gowda–Diday distances and the Ichino–Yaguchi distances are considerably slower than for the Hausdorff distances, for conciseness, this discussion will tend to focus on the Hausdorff distances per se. However, CER and computing time results
Application
Table 22 shows temperature data for 60 weather stations of China in 1988. [Tables 19–22 and Fig. 13 are in the Supplementary Materials, Section S2.] The data set consists of minimum and maximum temperatures for each month, variables representing January - December, and X13 is the elevation. The unit of temperatures is Celsius degree. Each observation is of equal weight here. These data are extracted from a larger data set which contains observations for many more stations, more variables
Conclusion
Chavent [15] introduced a divisive clustering method for interval-valued data, based on the basic Hausdorff distance, universally accepted to date for interval data. While distances can be adapted in various ways, such as normalizing or not (via a variety of possible normalizations), global or local weighting options, and so forth, there were no definitive studies to compare these various options. This article addresses this deficiency by comparing six different Hausdorff distances applied to
References (61)
- et al.
Rapid and brief communication efficient clustering of large data sets
Pattern Recognit.
(2001) On using principal components before separating a mixture of two multivariate normal normal distributions
Appl. Stat.
(1983)Criterion-based divisive clustering for symbolic data
- et al.
Dynamic cluster methods for interval data based on Mahalanobis distances
Classification, Clustering, and Data Mining Applications
(2004) - et al.
Symbolic Data Analysis and the SODAS Software
(2008) - et al.
Pattern Recognition
(2001) - et al.
Dynamic clustering of histogram data based on adaptive squared Wasserstein distances
Expert Syst. Appl.
(2014) Data clustering: 50 years beyond K-means
J. Pattern Recognit. Lett.
(2010)- et al.
Data clustering: a review
ACM Comput. Surv. (CSUR)
(1999) - et al.
Double monothetic clustering for histogram-valued data
Commun. Stat. Appl.Methods
(2018)
Hierarchical initialization approach for K-means clustering
Pattern Recognit. Lett.
An overview of clustering methods
IOSR J. Eng.
A study on monothetic divisive hierarchical clustering method
Int. J. Adv. Scientif.Technol. Eng. Manag. Sci.
k-means clustering: a half-century synthesis
Br. J. Math. Stat. Psychol.
Hierarchical clustering algorithms in data mining
Int. J. Comput. Inf. Eng.
Cluster Analysis for Applications
Generalized ward and related clustering problems
Sample covariance functions for complex quantitative data
Proceedings World Congress, International Association of Statistical Computing
Brief overview of symbolic data and analytic issues
Stat. Anal. Data Min.
Symbolic Data Analysis: Conceptual Statistics and Data Mining
Principal component analysis for interval data
Wiley Interdiscip. Rev.
Clustering methods: a history of k-means algorithms
Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data
Pyramidal represenation of symbolic objects
Generalization method when manipulating relational databases
Revue des Nouvelles Technologies de l’Information
A comparative study of efficient initialization methods for the k-means clustering algorithm
Expert Syst. Appl.
A monothetic clustering method
Pattern Recognit. Lett.
New clustering methods for interval data
Comput. Stat.
Dynamical clustering of interval data: optimization of an adequacy criterion based on Hausdorff distance
Classification, Clustering, and Data Analysis
Symbolic Data Regression and Clustering
Cited by (11)
Ordinal classification for interval-valued data and interval-valued functional data
2024, Expert Systems with ApplicationsMixed data clustering based on a number of similar features
2023, Pattern RecognitionAGURF: An adaptive general unified representation frame for imbalanced interval-valued data
2023, Information SciencesSoft subspace clustering of interval-valued data with regularizations
2021, Knowledge-Based Systems