A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting

Vigneron, V.; Chen, H.

doi:10.1007/s10044-015-0458-2

A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting

Theoretical Advances
Published: 04 March 2015

Volume 19, pages 885–903, (2016)
Cite this article

Pattern Analysis and Applications Aims and scope Submit manuscript

V. Vigneron¹ &
H. Chen²

441 Accesses
5 Citations
Explore all metrics

Abstract

Seriation is a useful statistical method to visualize clusters in a dataset. However, as the data are noisy or unbalanced, visualizing the data structure becomes challenging. To alleviate this limitation, we introduce a novel metric based on common neighborhood to evaluate the degree of sparsity in a dataset. A pile of matrices are derived for different levels of sparsity, and the matrices are permuted by a branch-and-bound algorithm. The matrix with the best block diagonal form is then selected by a compactness criterion. The selected matrix reveals the intrinsic structure of the data by excluding noisy data or outliers. This seriation algorithm is applicable even if the number of clusters is unknown or if the clusters are imbalanced. However, if the metric introduces too much sparsity in the data, the sub-sampled groups of data could be ousted. To resolve this problem, a multi-scale approach combining different levels of sparsity is proposed. The capability of the proposed seriation method is examined both by toy problems and in the context of spike sorting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DIDES: a fast and effective sampling for clustering algorithm

Article 30 April 2016

SPARK: A New Clustering Algorithm for Obtaining Sparse and Interpretable Centroids

Efficient mixture model for clustering of sparse high dimensional binary data

Article Open access 01 June 2019

Notes

This criterion derives from the concept of run used in problems of data compression [33, 37], the latter characterizing the bigger sequence of 1 on a line in a Boolean matrix.

References

van der Aalst W (2012) Process mining: overview and opportunities. ACM Trans Manag Inf Syst 3(2):7
Google Scholar
Tome A, Schachtner S, Vigneron V, Puntonet C, Lang K (2013) A non-linear exploratory matrix factorization approach to binary data sets. Multidimens Syst Signal Process
Vigneron V, Kodewitz A, Lelandais S, Lang K (2015) Statistical signal processing in the analysis, characterization and detection of Alzheimer Disease, chap. Brain maps for Alzheimer’s disease early detection, Bentham Science
Titterington DM, Smith A, Makov U (1985) Statistical analysis of finite mixture distributions. Wiley, Chichester
Arabie P, Hubert LJ, Soete GD (1996) Clustering and classification, chap. In: An overview of combinatorial data analysis. World Scientific, River Edge, pp 5–63
Kohonen T (1995) Self organizing maps, heidleberg edn. Springer, Berlin
Book MATH Google Scholar
Bishop C (2002) Pattern recognition and machine learning. Information Science and Statistics. Springer, New York
Google Scholar
Marcotorchino F (1987) Block seriation problems: a unified approach. Appl Stoch Models Data Anal 3:73–91
Article MATH Google Scholar
Mechelen IV, Bock HH, Boeck PD (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13:363–394
Article MathSciNet MATH Google Scholar
Carroll D, Arabie P (1980) Multidimensional scaling. Ann Rev Psychol 31:607–649
Article Google Scholar
Hubert L, Arabie P, Meulman J (2001) Combinatorial data analysis: optimization by dynamic programming. Society ofr industrial and Applied Mathematics
Brusco M, Stahl S (2005) Branch and Bound applications in combinatorial data analysis. Springer, New York
MATH Google Scholar
Doreian P, Batagelj V, Ferligoj A (2004) Generalized blockmodeling of two-mode network data. Soc Netw 26:29–53
Article Google Scholar
Arabie P, Hubert L (1990) The bond energy algorithm revisited. IEEE Trans Syst Man Cybern 20(1):268–274
Article Google Scholar
Climer S, Zhang W (2006) Rearrangement clustering: pitfalls, remedies and applications. J Mach Learn Res 7:919–943
MathSciNet MATH Google Scholar
Hahsler M, Hornik K, Buchta C (2009) Getting things in order: an introduction to the r package seriation. Tech Rep 58
Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231
Google Scholar
Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, New York
MATH Google Scholar
Cha S, Yoon S, Tappert C (2005) Enhancing binary feature vector similarity measures. Tech Rep Pace Univ 210
Ertoz L, Steinbach M, Kumar V (2002) Finding clusters of different sizes, shapes and densities in noise. In: Second SIAM international conference on data mining. Arlington
Vathy-Fogarassy A, Kiss A, Abonyi J (2007) Hybrid minimal spanning tree and mixture of gaussians based clustering algorithm. In: Lecture Notes in Computer Science, Foundations of Information and Knowledge Systems. pp 313–330
Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: ICDE, vol. 15
Robinson W (1951) A method for chronologically ordering archealogical deposits. Am Antiq 16(4):293–301
Article Google Scholar
Brusco M, Kohn HF, Stahl S (2008) Heuristic implementation of dynamic programming for matrix permutation problems in combinatorial data analysis. Psychometrika
McCormick W, Deutsch S, Martin J, Schweitzer P (1969) Identification of data structures and relationships by matrix reordering techniques. TR 512, Institute for defense analyses, Arlington
McCormick W, Schweitzer P, White T (1972) Problem decomposition and data reorganization by a clustering technique. Oper Res 20:993–1009
Article MATH Google Scholar
Long B, Zhang Z, Yu P (2005) Co-clustering by block value decomposition. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. pp 635–640
Govaert G, Nadif N (2008) Algorithms for model-based block gaussian clustering. In: The 4th international Conference on datamining. pp 536–272
Govaert G, Nadif N (2010) Latent block model for contingency table. Commun Stat Theory Methods 39:416–425
Article MathSciNet MATH Google Scholar
Caraux G, Pinloche S (2005) Permutmatrix: a graphical environment to arrange gene expression profiles in optimal linear order. Bioinformatics 21(7)
Brusco M, Steinley D (2006) Inducing a blockmodel structure of two-mode binary data using seriation procedures. J Math Psychol 50:468–477
Article MathSciNet MATH Google Scholar
Chen C (2002) Generalized association plots: information visualization via iteratively generated correlation matrices. Stat Sinica 12:7–29
MathSciNet MATH Google Scholar
Johnson D, Krishnan S, Chhugani J (2004) Compressing large boolean matrices using reordering techniques. Proc Thirtieth Int Conf Very Large Data Bases 30:13–23
Google Scholar
Niermann S (2005) Optimizing the ordering of tables with evolutionary computations. Am Stat 59(1):41–46
Article MathSciNet Google Scholar
Batagelj V (1997) Notes on blockmodeling. Soc Netw 7:143–155
Article Google Scholar
Kirkpatrick S, Gelatt C, Vecchi M (1983) Optimization by simulated annealing. Science 220:681–690
Article MathSciNet MATH Google Scholar
Apaydin T, Tosun A, Ferhatosmanoglu H (2008) Analysis of basic data reordering techniques. SSDBM:517–524
Vigneron V, Chen Y, Chen Y, Chen Y (2009) Dictionary-based classification models. applications for multichannel neural activity analysis. In: 11th International Conference on Engineering Applications of Neural Networks, vol. LNCS 7899. Springer, London, pp 27–29
Brunet C, Willman T, Vigneron V (2011) Une famille de matrices sparses pour une modélisation multi-échelle par blocs. In: Revue des Nouvelles Technologies de l’Information. Hermann
Chen H, Murray A (2003) Continuous restricted boltzmann machine with an implementable training algorithm. Vision Image Signal Process IEE Proc 150(3):153–158
Article Google Scholar
Hastie T, Tibshirani R (1994) Discriminant analysis by gaussian mixtures. AT & T Bell laboratories, Murray Hill (technical report)
MATH Google Scholar
Bishop C (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group, Belmont
MATH Google Scholar
McLachlan G, Basford K (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York
MATH Google Scholar
Hartigan J (1975) Clustering Algorithms. Wiley, New-York
MATH Google Scholar
Nolan D (1991) The excess-mass. J Multivar Anal 39:348–371
Article MathSciNet MATH Google Scholar

Download references

Acknowledgments

This project was supported in part by funding from the Hubert Curien program of the Foreign French Minister and from the Taiwan NSC. The neural activity recordings were kindly provided by the Neuroengineering lab. of the National Chiao-Tung University.

Author information

Authors and Affiliations

IBISC, EA 4526, Université d’Évry Val d’Essonne, 40 rue du Pelvoux, CE1455, 91020, Courcouronnes, France
V. Vigneron
Department of Electrical Engineering, National Tsing Hua University, No.101, Sec.2, Kuang-Fu Road, Hsin-Chu, 30013, Taiwan
H. Chen

Authors

V. Vigneron
View author publications
You can also search for this author in PubMed Google Scholar
H. Chen
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to V. Vigneron.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Vigneron, V., Chen, H. A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting. Pattern Anal Applic 19, 885–903 (2016). https://doi.org/10.1007/s10044-015-0458-2

Download citation

Received: 06 January 2013
Accepted: 06 February 2015
Published: 04 March 2015
Issue Date: November 2016
DOI: https://doi.org/10.1007/s10044-015-0458-2

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting

Abstract

Access this article

Similar content being viewed by others

DIDES: a fast and effective sampling for clustering algorithm

SPARK: A New Clustering Algorithm for Obtaining Sparse and Interpretable Centroids

Efficient mixture model for clustering of sparse high dimensional binary data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting

Abstract

Access this article

Similar content being viewed by others

DIDES: a fast and effective sampling for clustering algorithm

SPARK: A New Clustering Algorithm for Obtaining Sparse and Interpretable Centroids

Efficient mixture model for clustering of sparse high dimensional binary data

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation