Smoothing dissimilarities to cluster binary data

doi:10.1016/j.csda.2008.03.012

Computational Statistics & Data Analysis

Volume 52, Issue 10, 15 June 2008, Pages 4699-4711

https://doi.org/10.1016/j.csda.2008.03.012 Get rights and content

Abstract

Cluster analysis attempts to group data objects into homogeneous clusters on the basis of the pairwise dissimilarities among the objects. When the data contain noise, we might consider performing a smoothing operation, either on the data themselves or on the dissimilarities, before implementing the clustering algorithm. Possible benefits to such pre-smoothing are discussed in the context of binary data. We suggest a method for cluster analysis of binary data based on “smoothed” dissimilarities. The smoothing method presented borrows ideas from shrinkage estimation of cell probabilities. Some simulation results are given showing that improvement in the accuracy of the clustering result is obtained via smoothing, especially in the case in which the observed data contain substantial noise. The method is illustrated with an example involving binary test item response data.

Introduction

Cluster analysis is the statistical technique of separating objects, or observations, into homogeneous groups on the basis of (typically multivariate) data for several variables. We often picture the variables as continuous, but there is a substantial literature about clustering objects based on binary data (e.g., Everitt et al. (2001) and Kaufman and Rousseeuw (1990)).

When the data contain some type of noise (whether measurement error or merely unexplained variability), it is intuitive that smoothing the data, when done properly, may better recapture the underlying process generating the data. Often individual data values contain substantial noise and, thus, are less trustworthy to reflect the process we hope to understand. Smoothing methods attempt to reduce this noise by balancing the information in individual data points with information in the data set as a whole, or by shrinking data values toward some assumed structural model. Cluster analysis itself may be viewed as a type of smoothing, in the sense of being a technique to obtain a less complex structure from noisy data. However, standard clustering methods can be sensitive to outliers that could exist when we directly cluster observed data. Therefore clustering a smoothed version of the data may be preferable to clustering the observed (unsmoothed) data.

In certain situations the idea of smoothing is natural. For example, Hitchcock et al. (2007) showed that a shrinkage method of smoothing could aid in the clustering of functional data (data arising as curves). With binary data, the idea of “smoothing” seems less natural than with functional data, but the concept of shrinkage will be an important one in the methods discussed here.

A common method for clustering binary data objects is to define pairwise dissimilarities among the objects, each of which is typically a function of the number of matches (or mismatches) among the $P$ binary variables measured on the pair of objects. A “match” occurs when, for a certain variable, both objects share the same value (both 0 or both 1). For any pair of objects a 2×2 table of matches and mismatches may be constructed. Our smoothing method will fundamentally use this table.

In Section 2 we will formally define the dissimilarities for a set of binary data and introduce a clustering method based on a smoothed version of this collection of dissimilarities. Section 3 describes a simulation study to determine the effect of this smoothing method on the accuracy of the cluster analysis. In Section 4, we apply the method to a real data set involving test item responses, and Section 5 is a conclusion.

Section snippets

Method

In this section we present a method of clustering binary data objects based on dissimilarities that are “smoothed” via a shrinkage technique. As a motivation for this approach, consider the following hypothetical example. A class of schoolchildren are given a series of tests, each of which entails performing some physical task (e.g., doing a pull-up, jumping over a bar, etc.). The data point observed on each child for each task is binary (0/1) according to whether the task was successfully

Simulation study

In this section we describe a simulation study to measure the effect of smoothing the dissimilarities on the accuracy of the clustering of a binary data set. For a simulated data set of $n$ objects (i.e. individuals), generated from a built-in clustering structure, we will measure “accuracy” via the statistic proposed by Rand (1971). For any partitioning of the objects, the Rand statistic gives the proportion of pairs of objects that are correctly placed either together or apart (depending on how

An application to binary test data

In this section we apply the smoothed-dissimilarity clustering method to a real data set, the ACT mathematics test results for 2115 male examinees, studied in Ramsay and Silverman (2002) and made available on Silverman’s web site http://www.stats.ox.ac.uk/~silverma/fdacasebook/testitems.html in plain text form. The data are given as a matrix of zeroes and ones having 2115 rows (representing the examinees) and 60 columns (representing the test items). An observation $y_{i j} = 0$ indicates that student $i$

Conclusion

We have introduced a novel method of smoothing the dissimilarities among binary data as a preliminary step to cluster analysis. This method, described in Section 2, borrows ideas developed for the shrinkage estimation of cell probabilities in contingency tables. The simulation study in Section 3 indicates that the smoothing method most effectively improves clustering accuracy in the most difficult situation for clustering: when the within-cluster data variability is high and when the true

Acknowledgments

The authors are grateful to the associate editor and two anonymous referees for comments which resulted in an improvement to this paper.

References (14)

P.J. Rousseeuw
Silhouettes: A graphical aid to the interpretation and validation of cluster analysis
J. Comput. Appl. Math.
(1987)
J.H. Albert
Empirical Bayes estimation in contingency tables
Commun. Stat. A–Theory Methods
(1987)
B. Everitt et al.
Cluster Analysis
(2001)
S.E. Fienberg et al.
Simultaneous estimation of multinomial cell probabilities
J. Amer. Statist. Assoc.
(1973)
H. Finch
Comparison of distance measures in cluster analysis with dichotomous data
J. Data Science
(2005)
S. Hands et al.
A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques
Multivariate Behav. Res.
(1987)
D.B. Hitchcock et al.
The effect of pre-smoothing functional data on cluster analysis
J. Stat. Comput. Simul.
(2007)

There are more references available in the full text version of this article.

Cited by (8)

Review of intelligent control system
2013, International Review of Automatic Control
Compatible clustering algorithm with convex space partitioning
2012, Advanced Materials Research
T-S fuzzy modeling based on compatible relation and its application in power plant
2011, Proceedings of the 2011 6th IEEE Conference on Industrial Electronics and Applications, ICIEA 2011
Clustering compatible objects by point neighborhood
2010, Proceedings - 2010 International Conference on Artificial Intelligence and Education, ICAIE 2010
Reducing dendrogram instability of features using rough set indiscernibility level
2010, 2010 International Conference on Distributed Frameworks for Multimedia Applications, DFmA 2010
The concept of indiscernibility level of rough set to reduce the dendrogram instability
2010, Communications in Computer and Information Science

View all citing articles on Scopus

View full text

Smoothing dissimilarities to cluster binary data

Abstract

Introduction

Section snippets

Method

Simulation study

An application to binary test data

Conclusion

Acknowledgments

J. Comput. Appl. Math.

Empirical Bayes estimation in contingency tables

Commun. Stat. A–Theory Methods

Cluster Analysis

Simultaneous estimation of multinomial cell probabilities

J. Amer. Statist. Assoc.

Comparison of distance measures in cluster analysis with dichotomous data

J. Data Science

A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques

Multivariate Behav. Res.

The effect of pre-smoothing functional data on cluster analysis

J. Stat. Comput. Simul.