A study of divisive clustering with Hausdorff distances for interval data

doi:10.1016/j.patcog.2019.106969

Pattern Recognition

Volume 96, December 2019, 106969

https://doi.org/10.1016/j.patcog.2019.106969 Get rights and content

Highlights

•
Hausdorrf, Gowda–Diday and Ichino–Yaguchi distances for intervals are compared.
•
Euclidean counterparts and their normalizations are included.
•
Summary of advantages and disadvantages of these respective distances are based on simulation studies.
•
The simulation study shows local normalizations outperform global normalizations.

Abstract

Clustering methods are becoming key as analysts try to understand what knowledge is buried inside contemporary large data sets. This article analyzes the impact of six different Hausdorff distances on sets of multivariate interval data (where, for each dimension, an interval is defined as an observation [a, b] with a ≤ b and with a and b taking values on the real line $R^{1}$ ), used as the basis for Chavent’s [15, 16] divisive clustering algorithm. Advantages and disadvantages are summarized for each distance. Comparisons with two other distances for interval data, the Gowda–Diday and Ichino–Yaguchi measures are included. All have specific strengths depending on the type of data present. Global normalization of a distance is not recommended; and care needs to be made when using local normalizations to ensure the features of the underlying data sets are revealed. The study is based on sets of simulated data, and on a real data set.

Introduction

This work focuses on clustering for multivariate interval-valued data, where for each dimension interval observations assume the form $X = [a, b],$ a ≤ b (with a ≤ b and with a and b taking values on the real line $R^{1}$ ). Such observations are examples of what Diday [24] introduced as symbolic data (which are observations broadly defined as hypercubles or Cartesian products of distributions in $R^{p},$ e.g., lists, intervals, histograms, and the like); see, e.g., Bock and Diday [10], Billard and Diday [7], Diday and Noirhomme-Fraiture [26], and [52] with a non-technical introduction in [6]. With the advent of modern computer capabilities and the attendant proliferation of massively large data sets, it is imperative that methodologies for symbolic data be developed. Indeed, Goodman [29] opined that symbolic data analysis was one of the two most important new avenues appearing over the horizon for the future of statistical science.

Hierarchical clustering can be either divisive or agglomerative; see, e.g., Anderberg [3]. Divisive clustering starts with all the observations in one cluster Ω and successively divides a cluster at each stage until a requisite number of clusters is achieved (with one possibility being that the final stage consists entirely of single element clusters). In contrast, in agglomerative clustering methods, the original clusters are merged two at a time until all the observations belong to a final cluster Ω. Except for pyramidal clustering (e.g., [25] and [11], but not covered in this work), clusters are non-overlapping and exhaustive. Typically, the divisive/merging criteria are based on dissimilarity or distance measures between observations or clusters. Our focus is on divisive clustering. When performing divisive clustering, we can partition the data according to either one single variable (monothetic method) or all variables simultaneously (polythetic method).

Clustering methodology has seen extensive activity for classical data sets in recent decades; a good coverage and review can be found in, e.g., [3], [9], [27], [30]. Much of the literature deals with agglomerative clustering, with some nice overall reviews in [38], [39], [49], [51], [59]. Other important contributions include [1], [2], [28], [47], [50], [54], [55], [56], [57], [60], [61], among others. Some papers look at questions revolving about initial seeds in k-means methods, e.g., [13], [48], [53], [58]. Indhu and Porkodi [35] concludes that the hierarchical method has greater accuracy that the k-means, density-based or EM algorithms. Chang [14] uses principal components on a mixture of two normal distributions. These are for classical data.

Unfortunately, there are relatively few works for divisive clustering for symbolic data. Indeed, it was not until Chavent [15], Chavent [16] that the first divisive clustering method for interval data was introduced; this was a monothethic method for interval distances adapted from the original Hausdorff [33] distances between point observations. Yet, given the increasing prevalence of symbolic data sets, especially for interval-valued data arising from aggregation of the massively large-sized data sets from modern computer environments, it is becoming increasing important that methods be developed to handle such data sets. Even without aggregation, examples abound; e.g., temperatures are recorded as daily minimum and maximum values, stock market values are recorded as daily (low, high) or (opening, closing) values. Thus, a clustering algorithm applied to interval stock prices might identity which kinds of stocks are similar to each other (such as manufacturing, banking, media, and so on). It is known that doing analyses based on interval midpoints or means lose critical information (see, e.g., [5], [6]; therefore, answers are not necessarily correct when using classical point surrogates. While [42], [43] have introduced algorithms for histogram data (a polythetic and a double monothethic algorithm, respectively), the original [15] algorithm remains as the prime algorithm for interval data, and that itself was restricted to the Hausdorff distance only. Therefore, the aim herein is to compare different distances (for which no comparisons currently exist) for this well-known and important algorithm.

In contrast, many methods for obtaining partitions of interval data have been developed; see, e.g., [4], [17], [18], [20], [21], [22], [23], [36], [37], among others. More recently, partitioning algorithms for histogram-type data were introduced by, e.g., [46], and [44], [45]. Some of these are also based on the Hausdorff distance, some transform the interval to its corresponding center and/or end-points, some use Mallows’ L₂ distances, and so on.

In our article, Chavent’s hierarchical divisive monothetic method is applied to interval-valued data with the implementation based on six different types of Hausdorff distances: the basic Hausdorff distance (of [15]), the Euclidean Hausdorff distance, the Global Span Normalized Hausdorff distance, the Local Span Normalized Hausdorff distance, the Global Normalized Hausdorff distance, and the Local Normalized Hausdorff distance; see Section 2.1.1. To date, the global normalized distances have not been applied to symbolic data. Nor have the advantages and disadvantages between these choices been discussed in the literature. As part of this comparison, we apply Chavent’s hierarchical divisive monothetic method and the different distances to seven different types of simulated data, in Section 3. It is noted that Hausdorff distances dominate clustering methodologies for interval data to date, primarily because of their simplicity and intuitive appeal.

There are other distances/dissimilarities that could be used for interval data, such as the Gowda–Diday [31], [32] dissimilarities, or the Ichino–Yaguchi [34] distances; these can also be used in a divisive clustering method. In our simulation study, these distances are compared with the Hausdorff distances. It is observed that these require longer computing times than for the Hausdorff distances, primarily due to the complexity of their definitions. The Hausdorff distance, on the other hand, is easy to understand and to calculate, and more importantly, has lower computing costs than counterparts. Hence, the primary focus of this paper is on the Hausdorff distance.

Our aim is to provide a comparative study of currently available distances and their extensions for interval-valued data sets, so as to obtain an intuitive sense of what distances may be more appropriate for particular settings. After defining the various Hausdorff distances in Section 2.1.1 and the Gowda–Diday and Ichino–Yaguchi distances in 2.1.2 Gowda–Diday and Ichino–Yaguchi distances, 2.2 Chavent’s divisive clustering algorithm for interval data gives the detailed algorithm of applying Chavent’s method to interval-valued data. Simulations are run in Section 3 in order to compare the different distances and to learn their respective advantages and disadvantages. While the algorithm is applicable for any dimension (p), for ease of presentation and illustration, these data sets are $p = 2$ -dimensional. Simulations are also run for $p = 10$ -dimensional observations for each of the different types of data sets studied. Further, for each of these, uncorrelated and correlated observations are considered. A real data set with $p = 13$ variables is considered in Section 4. Some concluding remarks are in Section 5. Finally, an alternative aspect that needs attention when calculating local normalizations is discussed in the Supplementary Materials (Section S3).

Section snippets

Divisive clustering and Hausdorff distances

Suppose we have a domain $Ω \equiv Ω_{1} \times \dots \times Ω_{p} \subset R^{p}$ and a set of n interval-valued observations, measured on p random variables, described by $X (i) = (X_{i 1}, \dots, X_{i p}),$ $i = 1, \dots, n,$ with $X_{i j} = [a_{i j}, b_{i j}],$ a_ij ≤ b_ij, $j = 1, \dots, p$ . The goal is to divide Ω into R non-overlapping and exhaustive clusters $C_{1}, \dots, C_{R},$ with C_u containing n_u observations. Typically, the clustering process is based on dissimilarities or distances between observations. In Section 2.1.1, the basic Hausdorff [33] distance between interval observations is

Simulations

The different distances (of Section 2.1) are used to cluster data by using the Chavent divisive algorithm in each of six different sets of simulated data designed to illustrate six differing types of data sets. Since, as is seen from the relevant tables, the Gowda–Diday distances and the Ichino–Yaguchi distances are considerably slower than for the Hausdorff distances, for conciseness, this discussion will tend to focus on the Hausdorff distances per se. However, CER and computing time results

Application

Table 22 shows temperature data for 60 weather stations of China in 1988. [Tables 19–22 and Fig. 13 are in the Supplementary Materials, Section S2.] The data set consists of minimum and maximum temperatures for each month, variables $X_{1} - X_{12}$ representing January - December, and X₁₃ is the elevation. The unit of temperatures is Celsius degree. Each observation is of equal weight here. These data are extracted from a larger data set which contains observations for many more stations, more variables

Conclusion

Chavent [15] introduced a divisive clustering method for interval-valued data, based on the basic Hausdorff distance, universally accepted to date for interval data. While distances can be adapted in various ways, such as normalizing or not (via a variety of possible normalizations), global or local weighting options, and so forth, there were no definitive studies to compare these various options. This article addresses this deficiency by comparing six different Hausdorff distances applied to

References (61)

V.S. Ananthanarayana et al.
Rapid and brief communication efficient clustering of large data sets
Pattern Recognit.
(2001)
W.C. Chang
On using principal components before separating a mixture of two multivariate normal normal distributions
Appl. Stat.
(1983)
M. Chavent
Criterion-based divisive clustering for symbolic data
R.M.C.R. de Souza et al.
Dynamic cluster methods for interval data based on Mahalanobis distances
Classification, Clustering, and Data Mining Applications
(2004)
E. Diday et al.
Symbolic Data Analysis and the SODAS Software
(2008)
R.O. Duda et al.
Pattern Recognition
(2001)
A. Irpino et al.
Dynamic clustering of histogram data based on adaptive squared Wasserstein distances
Expert Syst. Appl.
(2014)
A.K. Jain
Data clustering: 50 years beyond K-means
J. Pattern Recognit. Lett.
(2010)
A.K. Jain et al.
Data clustering: a review
ACM Comput. Surv. (CSUR)
(1999)
K. Kim et al.
Double monothetic clustering for histogram-valued data
Commun. Stat. Appl.Methods
(2018)

J.F. Lu et al.

Hierarchical initialization approach for K-means clustering

Pattern Recognit. Lett.

(2008)

T.S. Madhulatha

An overview of clustering methods

IOSR J. Eng.

(2012)

P. Praveen et al.

A study on monothetic divisive hierarchical clustering method

Int. J. Adv. Scientif.Technol. Eng. Manag. Sci.

(2017)

D. Steinley

k-means clustering: a half-century synthesis

Br. J. Math. Stat. Psychol.

(2006)

Z. Abdullah et al.

Hierarchical clustering algorithms in data mining

Int. J. Comput. Inf. Eng.

(2015)

M.R. Anderberg

Cluster Analysis for Applications

(1973)

V. Batagelj

Generalized ward and related clustering problems

L. Billard et al.

Sample covariance functions for complex quantitative data

Proceedings World Congress, International Association of Statistical Computing

(2008)

L. Billard

Brief overview of symbolic data and analytic issues

Stat. Anal. Data Min.

(2011)

L. Billard et al.

Symbolic Data Analysis: Conceptual Statistics and Data Mining

(2006)

L. Billard et al.

Principal component analysis for interval data

Wiley Interdiscip. Rev.

(2012)

H.H. Bock

Clustering methods: a history of k-means algorithms

H.H. Bock et al.

Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data

(2000)

M.P. Brito et al.

Pyramidal represenation of symbolic objects

V. Cariou et al.

Generalization method when manipulating relational databases

Revue des Nouvelles Technologies de l’Information

(2015)

M.E. Celebi et al.

A comparative study of efficient initialization methods for the k-means clustering algorithm

Expert Syst. Appl.

(2013)

M. Chavent

A monothetic clustering method

Pattern Recognit. Lett.

(1998)

M. Chavent et al.

New clustering methods for interval data

Comput. Stat.

(2006)

M. Chavent et al.

Dynamical clustering of interval data: optimization of an adequacy criterion based on Hausdorff distance

Classification, Clustering, and Data Analysis

(2002)

Y. Chen

Symbolic Data Regression and Clustering

(2014)

Cited by (11)

Ordinal classification for interval-valued data and interval-valued functional data
2024, Expert Systems with Applications
The aim of ordinal classification is to predict the ordered labels of the output from a set of observed inputs. Interval-valued data refers to data in the form of intervals. For the first time, interval-valued data and interval-valued functional data are considered as inputs in an ordinal classification problem. Six ordinal classifiers for interval data and interval-valued functional data are proposed. Three of them are parametric, one of them is based on ordinal binary decompositions and the other two are based on ordered logistic regression. The other three methods are based on the use of distances between interval data and kernels on interval data. One of the methods uses the weighted $k$ -nearest-neighbor technique for ordinal classification. Another method considers kernel principal component analysis plus an ordinal classifier. And the sixth method, which is the method that performs best, uses a kernel-induced ordinal random forest. They are compared with naïve approaches in an extensive experimental study with synthetic and original real data sets, about human global development, and weather data. The results show that considering ordering and interval-valued information improves the accuracy. The source code and data sets are available at https://github.com/aleixalcacer/OCFIVD.
Mixed data clustering based on a number of similar features
2023, Pattern Recognition
Finding the degree of similarity measurement is one of the challenges of mixed data clustering. In this article, it has been tried to design a more efficient method by innovating in three important parts of clustering. In the part of the general method, for assigning data objects to the cluster, in addition to the distance, attention is paid to the “number of similar features”. Compared to assigning each object to a cluster, in cases where the distances are equal or close, the cluster center with the highest number of features similar to the given objects will be appropriate. This method is more accurate than the Hamming distance. To determine the cluster centers, instead of random selection, a more suitable object is identified with a distance-based method. In accuracy in three datasets, the proposed algorithm has performed at least two percent better than the other algorithms.
AGURF: An adaptive general unified representation frame for imbalanced interval-valued data
2023, Information Sciences
Interval-valued data (IVD) is a kind of data in which each feature is an interval, and embeds some uncertainty and variability information. Due to the inherent structural particularity of IVD, in addition to the skewed distribution classes, another imbalance for IVD is the biased range distribution on each interval-valued data. This internal imbalance depicts the internal distribution of IVD in detail, but is usually ignored in representation. This work proposes an adaptive general unified representation frame (AGURF), which may expand the representation frame of IVD. Based on the unified representation frame (URF) proposed in previous work, an adaptive general unified representation frame is constructed firstly. Then the offset-center is defined to re-measure the location of each interval-valued data more accurate. Meanwhile, a rule to set the adaptive factors for each class automatically, which serves as base factors to balance the relationship between offset-center and radius, is proposed. Finally, several general classifiers are also used to verify AGURF. The experiment results on synthetic and real-world datasets demonstrate that the proposed method can better represent imbalanced IVD, obtain good classification performance and reduce time cost simultaneously.
Soft subspace clustering of interval-valued data with regularizations
2021, Knowledge-Based Systems
Data analysis plays an indispensable role in understanding different phenomena. One of the vital means of handling these data is to group them into a set of clusters given a measure of similarity. Usually, clustering methods deal with objects described by single-valued variables. Nevertheless, this representation is too restrictive for representing complex data, such as lists, histograms, or even intervals. Furthermore, in some problems, many dimensions are irrelevant and can mask existing clusters. In this regard, new interval-valued data clustering methods with regularizations and adaptive distances are proposed. These approaches consider that the boundaries of the interval-valued variables have the same and different importance for the clustering process. The algorithms optimize an objective function alternating three steps for obtaining the representatives of each group, a fuzzy partition, and the relevance weights of the variables. Experiments on synthetic and real data sets corroborate the robustness and usefulness of the proposed methods.
A trajectory similarity measurement algorithm based on three-dimensional space area division
2023, Research Square
Divisive Clustering for Interval Data Based on Principal Components
2023, SSRN

View all citing articles on Scopus

View full text

A study of divisive clustering with Hausdorff distances for interval data

Highlights

Abstract

Introduction

Section snippets

Divisive clustering and Hausdorff distances

Simulations

Application

Conclusion

Pattern Recognit.

Appl. Stat.

Expert Syst. Appl.

J. Pattern Recognit. Lett.

ACM Comput. Surv. (CSUR)

Commun. Stat. Appl.Methods

Pattern Recognit. Lett.

IOSR J. Eng.

Int. J. Adv. Scientif.Technol. Eng. Manag. Sci.

Br. J. Math. Stat. Psychol.

Hierarchical clustering algorithms in data mining

Int. J. Comput. Inf. Eng.

Cluster Analysis for Applications

Generalized ward and related clustering problems

Sample covariance functions for complex quantitative data

Proceedings World Congress, International Association of Statistical Computing

Brief overview of symbolic data and analytic issues

Stat. Anal. Data Min.

Symbolic Data Analysis: Conceptual Statistics and Data Mining

Principal component analysis for interval data

Wiley Interdiscip. Rev.

Clustering methods: a history of k-means algorithms

Analysis of Symbolic Data: Exploratory Methods for Extracting Statistical Information from Complex Data

Pyramidal represenation of symbolic objects

Generalization method when manipulating relational databases

Revue des Nouvelles Technologies de l’Information

A comparative study of efficient initialization methods for the k-means clustering algorithm

Expert Syst. Appl.

A monothetic clustering method

Pattern Recognit. Lett.

New clustering methods for interval data

Comput. Stat.

Dynamical clustering of interval data: optimization of an adequacy criterion based on Hausdorff distance

Classification, Clustering, and Data Analysis

Symbolic Data Regression and Clustering