Abstract
We propose two algorithms for robust two-mode partitioning of a data matrix in the presence of outliers. First we extend the robust k-means procedure to the case of biclustering, then we slightly relax the definition of outlier and propose a more flexible and parsimonious strategy, which anyway is inherently less robust. We discuss the breakdown properties of the algorithms, and illustrate the methods with simulations and three real examples.
Similar content being viewed by others
References
ATKINSON, A.C., RIANI, M., and CERIOLI, A. (2004), Exploring Multivariate Data with the Forward Search, New York: Springer.
BENNET, C.A. (1954), “Effect of Measurement Error on Chemical Process Control”, Industrial Quality Control 11: 17–20.
BITTNER, M., MELTZER, P., CHEN, Y., JIANG, Y., SEFTOR, E., HENDRIX, M.,RADMACHER, M., SIMON, R., YAKHINI, Z., BON-DOR, A., SAMPAS, N., DOUGHERTY, E., WANG, E., MAINCOLA, F., GOODEN, C., LUEDERS, J., GLATFELTER, A., POLLOCK, P., CARPTEN, J., GILLANDERS, E., LEJA, D., DIETRICH, K., BEAUDRY, C., BERENS, M., ALBERTS, D., and SONDAK, V. (2000), “Molecular Classification of Cutaneous Malignant Melanoma by Gene Expression Profiling”, Nature 406: 536–540.
BOCK, H.-H. (1996), “Probabilistic Models in Cluster Analysis”, Computational Statistics and Data Analysis 23:5–28.
CHO, H., DHILLON, I.S., GUAN, Y., and SRA, S. (2004), “Minimum Sum-Squared Residues Co-Clustering of Gene Expression Data”, Proceedings of the Fourth SIAM International Conference of Data Mining, 114–125.
CLIMER, S., and ZHANG, W. (2006) “Rearrangement Clustering: Pitfalls, Remedies, and Applications”, Journal of Machine Learning Research 7: 919–943.
CUESTA-ALBERTOS, J., GORDALIZA, A., and MATRÀN, C. (1997), “Trimmed k-Means: An Attempt to Robustify Quantizers”, Annals of Statistics 25: 553–576.
DONOHO, D.L., and HUBER, P.J. (1983), “The Notion of Breakdown Point”, in A Festschirift for Erich L. Lehmann, eds. P. Bickel, K. Doksum, and J.L.Jr. Hodges, Belmont CA: Wadsworth, 157–184.
FELLNER, W.H. (1986), “Robust Estimation of Variance Components”, Technometrics 28: 51–60.
FISHER,W. (1969), Clustering and Aggregation in Economics, Baltimore: Johns Hopkins.
FRALEY, C., and RAFTERY, A.E. (2002), “Model Based Clustering, Discriminant Analysis, and Density Estimation”, Journal of the American Statistical Association 97: 611–631.
GALLEGOS, M.T., and RITTER, G. (2005) “A Robust Method for Cluster Analysis”, Annals of Statistics 33: 347–380.
GARCIA-ESCUDERO, L.A., and GORDALIZA, A. (1999), “Robustness Properties of k Means and Trimmed k Means”, Journal of the American Statistical Association 94: 956–969.
GARCIA-ESCUDERO, L.A., GORDALIZA, A., and MATRÀN, C. (2003), “Trimming Tools in Exploratory Data Analysis”, Journal of Computational and Graphical Statistics 12: 434–449.
GOLDSTEIN, D., GHOSH, D., and CONLON, E. (2002), “Statistical Issues in the Clustering of Gene Expression Data”, Statistica Sinica 12: 219–241.
HAMPEL, F.R. (1971), “A General Qualitative Definition of Robustness”, Annals of Mathematical Statistics 42: 1887–1896.
HAMPEL, F.R., ROUSSEEUW, P.J., RONCHETTI, E., and STAHEL,W.A. (1986), Robust Statistics: The Approach Based on the Influence Function, New York: Wiley.
HARDIN, J., and ROCKE, D. (2004), “Outlier Detection in the Multiple Cluster Setting Using the Minimum Covariance Determinant Estimator”, Computational Statistics and Data Analysis 44: 625–638.
HARTIGAN, J.A. (1972), “Direct Clustering of a Data Matrix”, Journal of the American Statistical Association 67: 123–129.
HODGES, J.L. Jr. (1967), “Efficiency in Normal Samples and Tolerance of Extreme Values for Some Estimates of Location”, in Proceedings of the Fifth Berkeley Symposium on Mathematical Statistics and Probability (Vol. 1), Berkeley CA: Univ. California Press, pp. 163–186.
HUBER, P.J. (1981), Robust Statistics, New York: Wiley.
HUBERT, L., and ARABIE, P. (1985), “Comparing Partitions”, Journal of Classification 2: 193–218.
KAUFMAN, L., and ROUSSEEUW, P.J. (1990), Finding Groups in Data, NewYork: Wiley.
MADEIRA, S.C., and OLIVEIRA, A.L. (2004), “Biclustering Algorithms for Biological Data Analysis: A Survey”, IEEE/ACM Transactions on Computational Biology and Bioinformatics 1: 24–45.
ROCCI, R., and VICHI,M. (2008), “Two-Mode Multi-Partitioning”, Computational Statistics and Data Analysis 52: 1984–2003.
ROUSSEEUW, P.J. (1984), “Least Median of Squares Regression”, Journal of the American Statistical Association 79: 851–857.
ROUSSEEUW, P.J., and VAN DRIESSEN, K. (1999), “A Fast Algorithm for the Minimum Covariance Determinant Estimator”, Technometrics 41: 212–223.
ROUSSEEUW, P.J., and VAN DRIESSEN, K. (2006), “Computing LTS Regression for Large Data Sets”, Data mining and knowledge discovery 12: 29–45.
SCHEPERS, J., CEULEMANS, E., and VAN MECHELEN, I. (2008), “Selecting among Multi-Mode Partitioning Models of Different Complexities: A Comparison of Four Model Selection Criteria”, Journal of Classification 25: 67–85.
VAN MECHELEN, I., BOCK, H.H., and DE BOECK, P. (2004), “Two-Mode Clustering Methods: A Structured Overview”, Statistical Methods in Medical Research 13: 363–394.
VICHI, M. (2000), “Double k-means Clustering for Simultaneous Classification of Objects and Variables”, in Advances in Classification and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, edd. S. Borra, R. Rocci, and M. Schader, Heidelberg: Springer, 43–52.
ZEWOTIR, T., and GALPIN, J.S. (2007), “A Unified Approach on Residuals, Leverages and Outliers in the Linear Mixed Model”, Test 16: 58–75.
Author information
Authors and Affiliations
Corresponding author
Additional information
The author is grateful to four referees for detailed suggestions that led to an improved paper, and to Professor Vichi for support and careful reading of a first draft. Acknowledgements go also to Francesca Martella for advice.
Rights and permissions
About this article
Cite this article
Farcomeni, A. Robust Double Clustering: A Method Based on Alternating Concentration Steps. J Classif 26, 77–101 (2009). https://doi.org/10.1007/s00357-009-9026-z
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00357-009-9026-z