Smoothing dissimilarities to cluster binary data

https://doi.org/10.1016/j.csda.2008.03.012Get rights and content

Abstract

Cluster analysis attempts to group data objects into homogeneous clusters on the basis of the pairwise dissimilarities among the objects. When the data contain noise, we might consider performing a smoothing operation, either on the data themselves or on the dissimilarities, before implementing the clustering algorithm. Possible benefits to such pre-smoothing are discussed in the context of binary data. We suggest a method for cluster analysis of binary data based on “smoothed” dissimilarities. The smoothing method presented borrows ideas from shrinkage estimation of cell probabilities. Some simulation results are given showing that improvement in the accuracy of the clustering result is obtained via smoothing, especially in the case in which the observed data contain substantial noise. The method is illustrated with an example involving binary test item response data.

Introduction

Cluster analysis is the statistical technique of separating objects, or observations, into homogeneous groups on the basis of (typically multivariate) data for several variables. We often picture the variables as continuous, but there is a substantial literature about clustering objects based on binary data (e.g., Everitt et al. (2001) and Kaufman and Rousseeuw (1990)).

When the data contain some type of noise (whether measurement error or merely unexplained variability), it is intuitive that smoothing the data, when done properly, may better recapture the underlying process generating the data. Often individual data values contain substantial noise and, thus, are less trustworthy to reflect the process we hope to understand. Smoothing methods attempt to reduce this noise by balancing the information in individual data points with information in the data set as a whole, or by shrinking data values toward some assumed structural model. Cluster analysis itself may be viewed as a type of smoothing, in the sense of being a technique to obtain a less complex structure from noisy data. However, standard clustering methods can be sensitive to outliers that could exist when we directly cluster observed data. Therefore clustering a smoothed version of the data may be preferable to clustering the observed (unsmoothed) data.

In certain situations the idea of smoothing is natural. For example, Hitchcock et al. (2007) showed that a shrinkage method of smoothing could aid in the clustering of functional data (data arising as curves). With binary data, the idea of “smoothing” seems less natural than with functional data, but the concept of shrinkage will be an important one in the methods discussed here.

A common method for clustering binary data objects is to define pairwise dissimilarities among the objects, each of which is typically a function of the number of matches (or mismatches) among the P binary variables measured on the pair of objects. A “match” occurs when, for a certain variable, both objects share the same value (both 0 or both 1). For any pair of objects a 2×2 table of matches and mismatches may be constructed. Our smoothing method will fundamentally use this table.

In Section 2 we will formally define the dissimilarities for a set of binary data and introduce a clustering method based on a smoothed version of this collection of dissimilarities. Section 3 describes a simulation study to determine the effect of this smoothing method on the accuracy of the cluster analysis. In Section 4, we apply the method to a real data set involving test item responses, and Section 5 is a conclusion.

Section snippets

Method

In this section we present a method of clustering binary data objects based on dissimilarities that are “smoothed” via a shrinkage technique. As a motivation for this approach, consider the following hypothetical example. A class of schoolchildren are given a series of tests, each of which entails performing some physical task (e.g., doing a pull-up, jumping over a bar, etc.). The data point observed on each child for each task is binary (0/1) according to whether the task was successfully

Simulation study

In this section we describe a simulation study to measure the effect of smoothing the dissimilarities on the accuracy of the clustering of a binary data set. For a simulated data set of n objects (i.e. individuals), generated from a built-in clustering structure, we will measure “accuracy” via the statistic proposed by Rand (1971). For any partitioning of the objects, the Rand statistic gives the proportion of pairs of objects that are correctly placed either together or apart (depending on how

An application to binary test data

In this section we apply the smoothed-dissimilarity clustering method to a real data set, the ACT mathematics test results for 2115 male examinees, studied in Ramsay and Silverman (2002) and made available on Silverman’s web site http://www.stats.ox.ac.uk/~silverma/fdacasebook/testitems.html in plain text form. The data are given as a matrix of zeroes and ones having 2115 rows (representing the examinees) and 60 columns (representing the test items). An observation yij=0 indicates that student i

Conclusion

We have introduced a novel method of smoothing the dissimilarities among binary data as a preliminary step to cluster analysis. This method, described in Section 2, borrows ideas developed for the shrinkage estimation of cell probabilities in contingency tables. The simulation study in Section 3 indicates that the smoothing method most effectively improves clustering accuracy in the most difficult situation for clustering: when the within-cluster data variability is high and when the true

Acknowledgments

The authors are grateful to the associate editor and two anonymous referees for comments which resulted in an improvement to this paper.

References (14)

  • P.J. Rousseeuw

    Silhouettes: A graphical aid to the interpretation and validation of cluster analysis

    J. Comput. Appl. Math.

    (1987)
  • J.H. Albert

    Empirical Bayes estimation in contingency tables

    Commun. Stat. A–Theory Methods

    (1987)
  • B. Everitt et al.

    Cluster Analysis

    (2001)
  • S.E. Fienberg et al.

    Simultaneous estimation of multinomial cell probabilities

    J. Amer. Statist. Assoc.

    (1973)
  • H. Finch

    Comparison of distance measures in cluster analysis with dichotomous data

    J. Data Science

    (2005)
  • S. Hands et al.

    A Monte Carlo study of the recovery of cluster structure in binary data by hierarchical clustering techniques

    Multivariate Behav. Res.

    (1987)
  • D.B. Hitchcock et al.

    The effect of pre-smoothing functional data on cluster analysis

    J. Stat. Comput. Simul.

    (2007)
There are more references available in the full text version of this article.

Cited by (8)

View all citing articles on Scopus
View full text