Abstract
Seriation is a useful statistical method to visualize clusters in a dataset. However, as the data are noisy or unbalanced, visualizing the data structure becomes challenging. To alleviate this limitation, we introduce a novel metric based on common neighborhood to evaluate the degree of sparsity in a dataset. A pile of matrices are derived for different levels of sparsity, and the matrices are permuted by a branch-and-bound algorithm. The matrix with the best block diagonal form is then selected by a compactness criterion. The selected matrix reveals the intrinsic structure of the data by excluding noisy data or outliers. This seriation algorithm is applicable even if the number of clusters is unknown or if the clusters are imbalanced. However, if the metric introduces too much sparsity in the data, the sub-sampled groups of data could be ousted. To resolve this problem, a multi-scale approach combining different levels of sparsity is proposed. The capability of the proposed seriation method is examined both by toy problems and in the context of spike sorting.
Similar content being viewed by others
References
van der Aalst W (2012) Process mining: overview and opportunities. ACM Trans Manag Inf Syst 3(2):7
Tome A, Schachtner S, Vigneron V, Puntonet C, Lang K (2013) A non-linear exploratory matrix factorization approach to binary data sets. Multidimens Syst Signal Process
Vigneron V, Kodewitz A, Lelandais S, Lang K (2015) Statistical signal processing in the analysis, characterization and detection of Alzheimer Disease, chap. Brain maps for Alzheimer’s disease early detection, Bentham Science
Titterington DM, Smith A, Makov U (1985) Statistical analysis of finite mixture distributions. Wiley, Chichester
Arabie P, Hubert LJ, Soete GD (1996) Clustering and classification, chap. In: An overview of combinatorial data analysis. World Scientific, River Edge, pp 5–63
Kohonen T (1995) Self organizing maps, heidleberg edn. Springer, Berlin
Bishop C (2002) Pattern recognition and machine learning. Information Science and Statistics. Springer, New York
Marcotorchino F (1987) Block seriation problems: a unified approach. Appl Stoch Models Data Anal 3:73–91
Mechelen IV, Bock HH, Boeck PD (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13:363–394
Carroll D, Arabie P (1980) Multidimensional scaling. Ann Rev Psychol 31:607–649
Hubert L, Arabie P, Meulman J (2001) Combinatorial data analysis: optimization by dynamic programming. Society ofr industrial and Applied Mathematics
Brusco M, Stahl S (2005) Branch and Bound applications in combinatorial data analysis. Springer, New York
Doreian P, Batagelj V, Ferligoj A (2004) Generalized blockmodeling of two-mode network data. Soc Netw 26:29–53
Arabie P, Hubert L (1990) The bond energy algorithm revisited. IEEE Trans Syst Man Cybern 20(1):268–274
Climer S, Zhang W (2006) Rearrangement clustering: pitfalls, remedies and applications. J Mach Learn Res 7:919–943
Hahsler M, Hornik K, Buchta C (2009) Getting things in order: an introduction to the r package seriation. Tech Rep 58
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231
Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, New York
Cha S, Yoon S, Tappert C (2005) Enhancing binary feature vector similarity measures. Tech Rep Pace Univ 210
Ertoz L, Steinbach M, Kumar V (2002) Finding clusters of different sizes, shapes and densities in noise. In: Second SIAM international conference on data mining. Arlington
Vathy-Fogarassy A, Kiss A, Abonyi J (2007) Hybrid minimal spanning tree and mixture of gaussians based clustering algorithm. In: Lecture Notes in Computer Science, Foundations of Information and Knowledge Systems. pp 313–330
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: ICDE, vol. 15
Robinson W (1951) A method for chronologically ordering archealogical deposits. Am Antiq 16(4):293–301
Brusco M, Kohn HF, Stahl S (2008) Heuristic implementation of dynamic programming for matrix permutation problems in combinatorial data analysis. Psychometrika
McCormick W, Deutsch S, Martin J, Schweitzer P (1969) Identification of data structures and relationships by matrix reordering techniques. TR 512, Institute for defense analyses, Arlington
McCormick W, Schweitzer P, White T (1972) Problem decomposition and data reorganization by a clustering technique. Oper Res 20:993–1009
Long B, Zhang Z, Yu P (2005) Co-clustering by block value decomposition. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. pp 635–640
Govaert G, Nadif N (2008) Algorithms for model-based block gaussian clustering. In: The 4th international Conference on datamining. pp 536–272
Govaert G, Nadif N (2010) Latent block model for contingency table. Commun Stat Theory Methods 39:416–425
Caraux G, Pinloche S (2005) Permutmatrix: a graphical environment to arrange gene expression profiles in optimal linear order. Bioinformatics 21(7)
Brusco M, Steinley D (2006) Inducing a blockmodel structure of two-mode binary data using seriation procedures. J Math Psychol 50:468–477
Chen C (2002) Generalized association plots: information visualization via iteratively generated correlation matrices. Stat Sinica 12:7–29
Johnson D, Krishnan S, Chhugani J (2004) Compressing large boolean matrices using reordering techniques. Proc Thirtieth Int Conf Very Large Data Bases 30:13–23
Niermann S (2005) Optimizing the ordering of tables with evolutionary computations. Am Stat 59(1):41–46
Batagelj V (1997) Notes on blockmodeling. Soc Netw 7:143–155
Kirkpatrick S, Gelatt C, Vecchi M (1983) Optimization by simulated annealing. Science 220:681–690
Apaydin T, Tosun A, Ferhatosmanoglu H (2008) Analysis of basic data reordering techniques. SSDBM:517–524
Vigneron V, Chen Y, Chen Y, Chen Y (2009) Dictionary-based classification models. applications for multichannel neural activity analysis. In: 11th International Conference on Engineering Applications of Neural Networks, vol. LNCS 7899. Springer, London, pp 27–29
Brunet C, Willman T, Vigneron V (2011) Une famille de matrices sparses pour une modélisation multi-échelle par blocs. In: Revue des Nouvelles Technologies de l’Information. Hermann
Chen H, Murray A (2003) Continuous restricted boltzmann machine with an implementable training algorithm. Vision Image Signal Process IEE Proc 150(3):153–158
Hastie T, Tibshirani R (1994) Discriminant analysis by gaussian mixtures. AT & T Bell laboratories, Murray Hill (technical report)
Bishop C (2006) Pattern recognition and machine learning. Springer, New York
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group, Belmont
McLachlan G, Basford K (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
Hartigan J (1975) Clustering Algorithms. Wiley, New-York
Nolan D (1991) The excess-mass. J Multivar Anal 39:348–371
Acknowledgments
This project was supported in part by funding from the Hubert Curien program of the Foreign French Minister and from the Taiwan NSC. The neural activity recordings were kindly provided by the Neuroengineering lab. of the National Chiao-Tung University.
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Vigneron, V., Chen, H. A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting. Pattern Anal Applic 19, 885–903 (2016). https://doi.org/10.1007/s10044-015-0458-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10044-015-0458-2