Skip to main content

Advertisement

Log in

A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting

  • Theoretical Advances
  • Published:
Pattern Analysis and Applications Aims and scope Submit manuscript

Abstract

Seriation is a useful statistical method to visualize clusters in a dataset. However, as the data are noisy or unbalanced, visualizing the data structure becomes challenging. To alleviate this limitation, we introduce a novel metric based on common neighborhood to evaluate the degree of sparsity in a dataset. A pile of matrices are derived for different levels of sparsity, and the matrices are permuted by a branch-and-bound algorithm. The matrix with the best block diagonal form is then selected by a compactness criterion. The selected matrix reveals the intrinsic structure of the data by excluding noisy data or outliers. This seriation algorithm is applicable even if the number of clusters is unknown or if the clusters are imbalanced. However, if the metric introduces too much sparsity in the data, the sub-sampled groups of data could be ousted. To resolve this problem, a multi-scale approach combining different levels of sparsity is proposed. The capability of the proposed seriation method is examined both by toy problems and in the context of spike sorting.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Notes

  1. This criterion derives from the concept of run used in problems of data compression [33, 37], the latter characterizing the bigger sequence of 1 on a line in a Boolean matrix.

References

  1. van der Aalst W (2012) Process mining: overview and opportunities. ACM Trans Manag Inf Syst 3(2):7

    Google Scholar 

  2. Tome A, Schachtner S, Vigneron V, Puntonet C, Lang K (2013) A non-linear exploratory matrix factorization approach to binary data sets. Multidimens Syst Signal Process

  3. Vigneron V, Kodewitz A, Lelandais S, Lang K (2015) Statistical signal processing in the analysis, characterization and detection of Alzheimer Disease, chap. Brain maps for Alzheimer’s disease early detection, Bentham Science

  4. Titterington DM, Smith A, Makov U (1985) Statistical analysis of finite mixture distributions. Wiley, Chichester

  5. Arabie P, Hubert LJ, Soete GD (1996) Clustering and classification, chap. In: An overview of combinatorial data analysis. World Scientific, River Edge, pp 5–63

  6. Kohonen T (1995) Self organizing maps, heidleberg edn. Springer, Berlin

    Book  MATH  Google Scholar 

  7. Bishop C (2002) Pattern recognition and machine learning. Information Science and Statistics. Springer, New York

    Google Scholar 

  8. Marcotorchino F (1987) Block seriation problems: a unified approach. Appl Stoch Models Data Anal 3:73–91

    Article  MATH  Google Scholar 

  9. Mechelen IV, Bock HH, Boeck PD (2004) Two-mode clustering methods: a structured overview. Stat Methods Med Res 13:363–394

    Article  MathSciNet  MATH  Google Scholar 

  10. Carroll D, Arabie P (1980) Multidimensional scaling. Ann Rev Psychol 31:607–649

    Article  Google Scholar 

  11. Hubert L, Arabie P, Meulman J (2001) Combinatorial data analysis: optimization by dynamic programming. Society ofr industrial and Applied Mathematics

  12. Brusco M, Stahl S (2005) Branch and Bound applications in combinatorial data analysis. Springer, New York

    MATH  Google Scholar 

  13. Doreian P, Batagelj V, Ferligoj A (2004) Generalized blockmodeling of two-mode network data. Soc Netw 26:29–53

    Article  Google Scholar 

  14. Arabie P, Hubert L (1990) The bond energy algorithm revisited. IEEE Trans Syst Man Cybern 20(1):268–274

    Article  Google Scholar 

  15. Climer S, Zhang W (2006) Rearrangement clustering: pitfalls, remedies and applications. J Mach Learn Res 7:919–943

    MathSciNet  MATH  Google Scholar 

  16. Hahsler M, Hornik K, Buchta C (2009) Getting things in order: an introduction to the r package seriation. Tech Rep 58

  17. Ester M, Kriegel HP, Sander J, Xu X (1996) A density-based algorithm for discovering clusters in large spatial databases with noise. Kdd 96:226–231

    Google Scholar 

  18. Duda R, Hart P, Stork D (2001) Pattern classification. Wiley, New York

    MATH  Google Scholar 

  19. Cha S, Yoon S, Tappert C (2005) Enhancing binary feature vector similarity measures. Tech Rep Pace Univ 210

  20. Ertoz L, Steinbach M, Kumar V (2002) Finding clusters of different sizes, shapes and densities in noise. In: Second SIAM international conference on data mining. Arlington

  21. Vathy-Fogarassy A, Kiss A, Abonyi J (2007) Hybrid minimal spanning tree and mixture of gaussians based clustering algorithm. In: Lecture Notes in Computer Science, Foundations of Information and Knowledge Systems. pp 313–330

  22. Guha S, Rastogi R, Shim K (1999) Rock: a robust clustering algorithm for categorical attributes. In: ICDE, vol. 15

  23. Robinson W (1951) A method for chronologically ordering archealogical deposits. Am Antiq 16(4):293–301

    Article  Google Scholar 

  24. Brusco M, Kohn HF, Stahl S (2008) Heuristic implementation of dynamic programming for matrix permutation problems in combinatorial data analysis. Psychometrika

  25. McCormick W, Deutsch S, Martin J, Schweitzer P (1969) Identification of data structures and relationships by matrix reordering techniques. TR 512, Institute for defense analyses, Arlington

  26. McCormick W, Schweitzer P, White T (1972) Problem decomposition and data reorganization by a clustering technique. Oper Res 20:993–1009

    Article  MATH  Google Scholar 

  27. Long B, Zhang Z, Yu P (2005) Co-clustering by block value decomposition. In: Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining. pp 635–640

  28. Govaert G, Nadif N (2008) Algorithms for model-based block gaussian clustering. In: The 4th international Conference on datamining. pp 536–272

  29. Govaert G, Nadif N (2010) Latent block model for contingency table. Commun Stat Theory Methods 39:416–425

    Article  MathSciNet  MATH  Google Scholar 

  30. Caraux G, Pinloche S (2005) Permutmatrix: a graphical environment to arrange gene expression profiles in optimal linear order. Bioinformatics 21(7)

  31. Brusco M, Steinley D (2006) Inducing a blockmodel structure of two-mode binary data using seriation procedures. J Math Psychol 50:468–477

    Article  MathSciNet  MATH  Google Scholar 

  32. Chen C (2002) Generalized association plots: information visualization via iteratively generated correlation matrices. Stat Sinica 12:7–29

    MathSciNet  MATH  Google Scholar 

  33. Johnson D, Krishnan S, Chhugani J (2004) Compressing large boolean matrices using reordering techniques. Proc Thirtieth Int Conf Very Large Data Bases 30:13–23

    Google Scholar 

  34. Niermann S (2005) Optimizing the ordering of tables with evolutionary computations. Am Stat 59(1):41–46

    Article  MathSciNet  Google Scholar 

  35. Batagelj V (1997) Notes on blockmodeling. Soc Netw 7:143–155

    Article  Google Scholar 

  36. Kirkpatrick S, Gelatt C, Vecchi M (1983) Optimization by simulated annealing. Science 220:681–690

    Article  MathSciNet  MATH  Google Scholar 

  37. Apaydin T, Tosun A, Ferhatosmanoglu H (2008) Analysis of basic data reordering techniques. SSDBM:517–524

  38. Vigneron V, Chen Y, Chen Y, Chen Y (2009) Dictionary-based classification models. applications for multichannel neural activity analysis. In: 11th International Conference on Engineering Applications of Neural Networks, vol. LNCS 7899. Springer, London, pp 27–29

  39. Brunet C, Willman T, Vigneron V (2011) Une famille de matrices sparses pour une modélisation multi-échelle par blocs. In: Revue des Nouvelles Technologies de l’Information. Hermann

  40. Chen H, Murray A (2003) Continuous restricted boltzmann machine with an implementable training algorithm. Vision Image Signal Process IEE Proc 150(3):153–158

    Article  Google Scholar 

  41. Hastie T, Tibshirani R (1994) Discriminant analysis by gaussian mixtures. AT & T Bell laboratories, Murray Hill (technical report)

    MATH  Google Scholar 

  42. Bishop C (2006) Pattern recognition and machine learning. Springer, New York

    MATH  Google Scholar 

  43. Breiman L, Friedman J, Olshen R, Stone C (1984) Classification and regression trees. Wadsworth International Group, Belmont

    MATH  Google Scholar 

  44. McLachlan G, Basford K (1988) Mixture models: inference and applications to clustering. Marcel Dekker, New York

    MATH  Google Scholar 

  45. Hartigan J (1975) Clustering Algorithms. Wiley, New-York

    MATH  Google Scholar 

  46. Nolan D (1991) The excess-mass. J Multivar Anal 39:348–371

    Article  MathSciNet  MATH  Google Scholar 

Download references

Acknowledgments

This project was supported in part by funding from the Hubert Curien program of the Foreign French Minister and from the Taiwan NSC. The neural activity recordings were kindly provided by the Neuroengineering lab. of the National Chiao-Tung University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to V. Vigneron.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Vigneron, V., Chen, H. A multi-scale seriation algorithm for clustering sparse imbalanced data: application to spike sorting. Pattern Anal Applic 19, 885–903 (2016). https://doi.org/10.1007/s10044-015-0458-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10044-015-0458-2

Keywords

Navigation