Abstract
Binary data represent a very special condition where both measures of distance and co-occurrence can be adopted. Euclidean distance-based non-hierarchical methods, like the k-means algorithm, or one of its versions, can be profitably used. When the number of available attributes increases the global clustering performance usually worsens. In such cases, to enhance group separability it is necessary to remove the irrelevant and redundant noisy information from the data. The present approach belongs to the category of attribute transformation strategy, and combines clustering and factorial techniques to identify attribute associations that characterize one or more homogeneous groups of statistical units. Furthermore, it provides graphical representations that facilitate the interpretation of the results.
Similar content being viewed by others
References
Arabie P, Hubert L (1994) Cluster analysis in marketing research. IEEE Trans Autom Control 19:716–723
Caliński T, Harabasz J (1974) A dendrite method for cluster analysis. Commun Stat A Theory 3:1–27
Chae SS, Dubien JL, Warde WD (2006) A method of predicting the number of clusters using Rands statistic. Comput Stat Data Anal 50:3531–3546
Choi SS, Cha SS, Tappert CC (2010) A survey of binary similarity and sistance measures. J Syst Cybernet Inform 8:43–48
Dimitriadou E, Dolnicar S, Weingassel A (2002) An examination of indexes for setermining the number of clusters in binary data sets. Psychometrika 67:137–160
Duda RO, Hart PE, Stork DG (2001) Pattern classification. Wiley, New York
Dudoit S, Fridlyand J (2002) A prediction-based resampling method for estimating the number of clusters in a dataset. Genome Biol 3:1–21
Ertoz L, Steinbach M, Kumar V (2003) Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Barbara D, Kamath C (eds) Proceedings of the third SIAM international conference on data mining, vol 112, pp 47–59
Greenacre MJ (2007) Correspondence analysis in practice, 2nd edn. Chapman and Hall, Boca Raton
Guha S, Rastogi S, Shim K (2000) ROCK: a robust clustering algorithm for categorical attribute. Inform Syst 25:512–521
Hastie T, Tibshirani R, Friedman JH (2001) The elements of statistical learning. Springer, New York
Hwang H, Dillon WR (2010) Simultaneous two-way clustering of multiple correspondence analysis. Multivar Behav Res 45:186–208
Hwang H, Dillon WR, Takane Y (2006) An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents. Psychometrika 71:161–171
Javed K, Babri H, Saeed M (2012) Feature selection based on class-dependent densities for high-dimensional binary data. IEEE Trans Knowl Data Eng 24:465–477
Kaufman L, Rousseeuw PJ (2005) Finding groups in data. An introduction to cluster analysis. Wiley, Hoboken
Kraus MJ, Müssel C, Palm G, Kestler HA (2011) Multi-objective selection for collecting cluster alternatives. Comput Stat 26:341–353
Kuncheva LI, Vetrov DP (2005) Evaluation of stability of k-means cluster ensembles with respect to random initialization. IEEE Trans Pattern Anal 28:1798–1808
Lauro CN, Balbi S (1999) The analysis of structured qualitative data. Appl Stoch Model Data Anal 15:1–27
Lauro CN, D’Ambra L (1984) L’analyse non symmétrique des correspondances. In: Diday E et al (eds) Data analysis and informatics, III. North Holland, Amsterdam, pp 433–446
Lebart L, Morineau A, Warwick K (1984) Multivariate descriptive statistical analysis. Wiley, New York
Light R, Margolin B (1971) An analysis of variance for categorical data. In J Am Stat Assoc 66:534–544
MacQueen J (1967) Some methods for classification and analysis of multivariate observations. In: Cam LML, Neyman J (eds) Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1. University of California Press, pp 281–297
Mola F, Siciliano R (1997) A fast splitting procedure for classification and regression trees. Stat Comput 7:208–216
Mucha HJ (2002) An intelligent clustering clustering technique based on dual scaling. In: Nishisato S, Baba Y, Bozdogan H, Kanefuji K (eds) Measurement and multivariate analysis. Springer, Tokyo, pp 37–46
Milligan GW, Cooper MC (1985) An examination of procedures for determining the number of clusters in a data. Psychometrika 50:159–179
Mirkin B (2001) Eleven ways to look at the Chi-squared coefficient for contingency tables. Am Stat 55:111–120
Mirkin B (2011) Choosing the number of clusters. WIREs Data Mining Knowl Disc 1:252–260
Nocke T, Schumann H, Böhm U (2004) Methods for the visualization of clustered climate data. Comput Stat 19:74–94
Palumbo F, Iodice D’Enza A (2012) Adaptive factorial clustering of binary data. In: Giusti A, Ritter G, Vichi M (eds) Classification and data mining. Studies in classification, data analysis, and knowledge organization, July 2012
Palumbo F, Siciliano R (1999) Factorial discriminant analysis and probabilistic models. In: Metron, LVI, pp 186–198
van Buuren S, Heiser WJ (1989) Clustering \(n\) objects in \(k\) groups under optimal scaling of variables. Psychometrika 54:699–706
Vichi M, Saporta G (2009) Clustering and disjoint principal component analysis. Comput Stat Data Anal 53:3194–3208
Vichi M, Kiers H (2001) Factorial k-means analysis for two way data. Comput Stat Data Anal 37:49–64
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Iodice D’Enza, A., Palumbo, F. Iterative factor clustering of binary data. Comput Stat 28, 789–807 (2013). https://doi.org/10.1007/s00180-012-0329-x
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-012-0329-x