Skip to main content
Log in

Iterative factor clustering of binary data

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

Binary data represent a very special condition where both measures of distance and co-occurrence can be adopted. Euclidean distance-based non-hierarchical methods, like the k-means algorithm, or one of its versions, can be profitably used. When the number of available attributes increases the global clustering performance usually worsens. In such cases, to enhance group separability it is necessary to remove the irrelevant and redundant noisy information from the data. The present approach belongs to the category of attribute transformation strategy, and combines clustering and factorial techniques to identify attribute associations that characterize one or more homogeneous groups of statistical units. Furthermore, it provides graphical representations that facilitate the interpretation of the results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Arabie P, Hubert L (1994) Cluster analysis in marketing research. IEEE Trans Autom Control 19:716–723

    Google Scholar 

  • Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat A Theory 3:1–27

    Article  MATH  Google Scholar 

  • Chae SS, Dubien JL, Warde WD (2006) A method of predicting the number of clusters using Rands statistic. Comput Stat Data Anal 50:3531–3546

    Article  MathSciNet  MATH  Google Scholar 

  • Choi SS, Cha SS, Tappert CC (2010) A survey of binary similarity and sistance measures. J Syst Cybernet Inform 8:43–48

    Google Scholar 

  • Dimitriadou E, Dolnicar S, Weingassel A (2002) An examination of indexes for setermining the number of clusters in binary data sets. Psychometrika 67:137–160

    Article  MathSciNet  Google Scholar 

  • Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York

    MATH  Google Scholar 

  • Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3:1–21

    Article  Google Scholar 

  • Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Barbara D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining, vol 112, pp 47–59

  • Greenacre MJ (2007) Correspondence analysis in practice, 2nd edn. Chapman and Hall, Boca Raton

  • Guha S, Rastogi S, Shim K (2000) ROCK: a robust clustering algorithm for categorical attribute. Inform Syst 25:512–521

    Article  Google Scholar 

  • Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning. Springer, New York

    MATH  Google Scholar 

  • Hwang H, Dillon WR (2010) Simultaneous two-way clustering of multiple correspondence analysis. Multivar Behav Res 45:186–208

    Article  Google Scholar 

  • Hwang H, Dillon WR, Takane Y (2006) An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents. Psychometrika 71:161–171

    Article  MathSciNet  Google Scholar 

  • Javed K, Babri H, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24:465–477

    Article  Google Scholar 

  • Kaufman L, Rousseeuw PJ (2005) Finding groups in data. An introduction to cluster analysis. Wiley, Hoboken

    Google Scholar 

  • Kraus MJ, Müssel C, Palm G, Kestler HA (2011) Multi-objective selection for collecting cluster alternatives. Comput Stat 26:341–353

    Article  Google Scholar 

  • Kuncheva LI, Vetrov DP (2005) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal 28:1798–1808

    Article  Google Scholar 

  • Lauro CN, Balbi S (1999) The analysis of structured qualitative data. Appl Stoch Model Data Anal 15:1–27

    Article  MathSciNet  MATH  Google Scholar 

  • Lauro CN, D’Ambra L (1984) L’analyse non symmétrique des correspondances. In: Diday E et al (eds) Data analysis and informatics, III. North Holland, Amsterdam, pp 433–446

  • Lebart L, Morineau A, Warwick K (1984) Multivariate descriptive statistical analysis. Wiley, New York

    MATH  Google Scholar 

  • Light R, Margolin B (1971) An analysis of variance for categorical data. In J Am Stat Assoc 66:534–544

    Article  MathSciNet  MATH  Google Scholar 

  • MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 281–297

  • Mola F, Siciliano R (1997) A fast splitting procedure for classification and regression trees. Stat Comput 7:208–216

    Article  Google Scholar 

  • Mucha HJ (2002) An intelligent clustering clustering technique based on dual scaling. In: Nishisato S, Baba Y, Bozdogan H, Kanefuji K (eds) Measurement and multivariate analysis. Springer, Tokyo, pp 37–46

  • Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data. Psychometrika 50:159–179

    Article  Google Scholar 

  • Mirkin B (2001) Eleven ways to look at the Chi-squared coefficient for contingency tables. Am Stat 55:111–120

    Article  MathSciNet  Google Scholar 

  • Mirkin B (2011) Choosing the number of clusters. WIREs Data Mining Knowl Disc 1:252–260

    Article  Google Scholar 

  • Nocke T, Schumann H, Böhm U (2004) Methods for the visualization of clustered climate data. Comput Stat 19:74–94

    Article  Google Scholar 

  • Palumbo F, Iodice D’Enza A (2012) Adaptive factorial clustering of binary data. In: Giusti A, Ritter G, Vichi M (eds) Classification and data mining. Studies in classification, data analysis, and knowledge organization, July 2012

  • Palumbo F, Siciliano R (1999) Factorial discriminant analysis and probabilistic models. In: Metron, LVI, pp 186–198

  • van Buuren S, Heiser WJ (1989) Clustering \(n\) objects in \(k\) groups under optimal scaling of variables. Psychometrika 54:699–706

    Article  MathSciNet  Google Scholar 

  • Vichi M, Saporta G (2009) Clustering and disjoint principal component analysis. Comput Stat Data Anal 53:3194–3208

    Article  MathSciNet  MATH  Google Scholar 

  • Vichi M, Kiers H (2001) Factorial k-means analysis for two way data. Comput Stat Data Anal 37:49–64

    Article  MathSciNet  MATH  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Francesco Palumbo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iodice D’Enza, A., Palumbo, F. Iterative factor clustering of binary data. Comput Stat 28, 789–807 (2013). https://doi.org/10.1007/s00180-012-0329-x

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-012-0329-x

Keywords

Navigation