Skip to main content

UCSL : A Machine Learning Expectation-Maximization Framework for Unsupervised Clustering Driven by Supervised Learning

  • Conference paper
  • First Online:
Machine Learning and Knowledge Discovery in Databases. Research Track (ECML PKDD 2021)

Abstract

Subtype Discovery consists in finding interpretable and consistent sub-parts of a dataset, which are also relevant to a certain supervised task. From a mathematical point of view, this can be defined as a clustering task driven by supervised learning in order to uncover subgroups in line with the supervised prediction. In this paper, we propose a general Expectation-Maximization ensemble framework entitled UCSL (Unsupervised Clustering driven by Supervised Learning). Our method is generic, it can integrate any clustering method and can be driven by both binary classification and regression. We propose to construct a non-linear model by merging multiple linear estimators, one per cluster. Each hyperplane is estimated so that it correctly discriminates - or predict - only one cluster. We use SVC or Logistic Regression for classification and SVR for regression. Furthermore, to perform cluster analysis within a more suitable space, we also propose a dimension-reduction algorithm that projects the data onto an orthonormal space relevant to the supervised task. We analyze the robustness and generalization capability of our algorithm using synthetic and experimental datasets. In particular, we validate its ability to identify suitable consistent sub-types by conducting a psychiatric-diseases cluster analysis with known ground-truth labels. The gain of the proposed method over previous state-of-the-art techniques is about +1.9 points in terms of balanced accuracy. Finally, we make codes and examples available in a scikit-learn-compatible Python package. https://github.com/neurospin-projects/2021_rlouiset_ucsl/.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Asano, Y.M., Rupprecht, C., Vedaldi, A.: Self-labelling via simultaneous clustering and representation learning. In: ICLR (2020)

    Google Scholar 

  2. Carey, L.A., Perou, C.M., Livasy, C.A., Dressler, L.G., Cowan, D., et al.: Race, breast cancer subtypes, and survival in the Carolina breast cancer study. JAMA 295(21), 2492–2502 (2006)

    Article  Google Scholar 

  3. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervised learning of visual features. In: ECCV, pp. 139–156 (2018)

    Google Scholar 

  4. Caron, M., Misra, I., Mairal, J., Goyal, P., Bojanowski, P., Joulin, A.: Unsupervised learning of visual features by contrasting cluster assignments. NeurIPS 33, 9912–9924 (2020)

    Google Scholar 

  5. Chand, G.B., Dwyer, D.B., Erus, G., Sotiras, A., Varol, E., et al.: Two distinct neuroanatomical subtypes of schizophrenia revealed using machine learning. Brain 143(3), 1027–1038 (2020)

    Article  Google Scholar 

  6. Erro, R., Vitale, C., Amboni, M., Picillo, M., et al.: The heterogeneity of early Parkinson’s disease: a cluster analysis on newly diagnosed untreated patients. PLoS One 8(8), e70244 (2013)

    Article  Google Scholar 

  7. Ferreira, D., Verhagen, C., Hernández-Cabrera, J.A., Cavallin, L., et al.: Distinct subtypes of Alzheimer’s disease based on patterns of brain atrophy: longitudinal trajectories and clinical applications. Sci Rep 7, 1–13 (2017)

    Article  Google Scholar 

  8. Honnorat, N., Dong, A., Meisenzahl-Lechner, E., Koutsouleris, N., Davatzikos, C.: Neuroanatomical heterogeneity of schizophrenia revealed by semi-supervised machine learning methods. Schizophr. Res. 214, 43–50 (2019)

    Article  Google Scholar 

  9. Li, J., Zhou, P., Xiong, C., Hoi, S.C.H.: Prototypical contrastive learning of unsupervised representations. In: ICLR (2021)

    Google Scholar 

  10. Lundberg, S.M., Erion, G.G., Lee, S.I.: Consistent individualized feature attribution for tree ensembles. In: ICML workshop (2017)

    Google Scholar 

  11. Lundberg, S.M., Lee, S.I.: A unified approach to interpreting model predictions. In: NeurIps, pp. 4768–4777 (2017)

    Google Scholar 

  12. Maaten, L., Hinton, G.: Visualizing data using t-SNE. J. Mach. Learn. Res. 9(86), 2579–2605 (2008)

    Google Scholar 

  13. Marquand, A.F., Wolfers, T., Mennes, M., Buitelaar, J., Beckmann, C.F.: Beyond lumping and splitting: a review of computational approaches for stratifying psychiatric disorders. Biol. Psychiatry: Cogn. Neurosci. Neuroimaging 1(5), 433–447 (2016)

    Google Scholar 

  14. Marusyk, A., Polyak, K.: Tumor heterogeneity: causes and consequences. Biochim. Biophys. Acta 1805(1), 105–117 (2010)

    Google Scholar 

  15. McInnes, L., Healy, J., Melville, J.: UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction. arXiv:1802.03426 [cs, stat] (2020)

  16. Menyhárt, O., Győrffy, B.: Multi-omics approaches in cancer research with applications in tumor subtyping, prognosis, and diagnosis. Comput. Struct. Biotechnol. J. 19, 949–960 (2021)

    Article  Google Scholar 

  17. Oyelade, J., Isewon, I., Oladipupo, F., Aromolaran, O., Uwoghiren, E., Ameh, F., Achas, M., Adebiyi, E.: Clustering algorithms: their application to gene expression data. Bioinform. Biol. Insights 10, 237–253 (2016)

    Article  Google Scholar 

  18. Planey, C.R., Gevaert, O.: CoINcIDE: a framework for discovery of patient subtypes across multiple datasets. Genome Med. 8(1), 27 (2016)

    Article  Google Scholar 

  19. Radford, A., Metz, L., Chintala, S.: Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks. arXiv:1511.06434 [cs] (2016). arXiv: 1511.06434

  20. Rawat, K.S., Malhan, I.V.: A hybrid classification method based on machine learning classifiers to predict performance in educational data mining. In: ICCCN, pp. 677–684 (2019)

    Google Scholar 

  21. Saito, S., Tan, R.T.: Neural clustering: concatenating layers for better projections. In: ICLR - workshop (2017)

    Google Scholar 

  22. Schulz, M.A., Chapman-Rounds, M., Verma, M., Bzdok, D., Georgatzis, K.: Inferring disease subtypes from clusters in explanation space. Sci. R. 10(1), 1–6 (2020)

    Google Scholar 

  23. Sonpatki, P., Shah, N.: Recursive consensus clustering for novel subtype discovery from transcriptome data. Sci. R. 10(1), 1–6 (2020)

    Google Scholar 

  24. Tager-Flusberg, H., Joseph, R.M.: Identifying neurocognitive phenotypes in autism. Philos. Trans. R. Soc. Lond. B Biol. Sci. 358(1430), 303–314 (2003)

    Article  Google Scholar 

  25. Varol, E., Sotiras, A., Davatzikos, C.: HYDRA: revealing heterogeneity of imaging and genetic patterns through a multiple max-margin discriminative analysis framework. Neuroimage 145, 346–364 (2017)

    Article  Google Scholar 

  26. Wang, Y., et al.: Unsupervised machine learning for the discovery of latent disease clusters and patient subgroups using electronic health records. J. Biomed. Inf. 102, 103364 (2020)

    Article  Google Scholar 

  27. Wen, J., Varol, E., Chand, G., Sotiras, A., Davatzikos, C.: MAGIC: multi-scale heterogeneity analysis and clustering for brain diseases. In: MICCAI. LNCS (2020)

    Google Scholar 

  28. Wu, M.Y., Dai, D.Q., Zhang, X.F., Zhu, Y.: Cancer subtype discovery and biomarker identification via a new robust network clustering algorithm. PLOS ONE 8(6), e66256 (2013)

    Article  Google Scholar 

  29. Wåhlstedt, C., Thorell, L.B., Bohlin, G.: Heterogeneity in ADHD: neuropsychological pathways, comorbidity and symptom domains. J. Abnorm. Child Psychol. 37(4), 551–564 (2009)

    Article  Google Scholar 

  30. Yang, B., Fu, X., Sidiropoulos, N.D., Hong, M.: Towards K-means-friendly spaces: simultaneous deep learning and clustering. In: International Conference on Machine Learning, pp. 3861–3870. PMLR (2017)

    Google Scholar 

  31. Yang, T., et al.: Probing the clinical and brain structural boundaries of bipolar and major depressive disorder. Transl. Psychiatry 11(1), 1–8 (2021)

    Article  Google Scholar 

  32. Yang, Z., Wen, J., Davatzikos, C.: Smile-GANs: Semi-supervised clustering via GANs for dissecting brain disease heterogeneity from medical images. arXiv:2006.15255 (2020)

  33. Zabihi, M., Oldehinkel, M., Wolfers, T., Frouin, V., Goyard, D., et al.: Dissecting the heterogeneous cortical anatomy of autism spectrum disorder using normative models. Biol. Psychiatry: Cogn. Neurosci. Neuroimaging 4(6), 567–578 (2019)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding authors

Correspondence to Pietro Gori or Antoine Grigis .

Editor information

Editors and Affiliations

1 Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 331 KB)

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Louiset, R., Gori, P., Dufumier, B., Houenou, J., Grigis, A., Duchesnay, E. (2021). UCSL : A Machine Learning Expectation-Maximization Framework for Unsupervised Clustering Driven by Supervised Learning. In: Oliver, N., Pérez-Cruz, F., Kramer, S., Read, J., Lozano, J.A. (eds) Machine Learning and Knowledge Discovery in Databases. Research Track. ECML PKDD 2021. Lecture Notes in Computer Science(), vol 12975. Springer, Cham. https://doi.org/10.1007/978-3-030-86486-6_46

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-86486-6_46

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-86485-9

  • Online ISBN: 978-3-030-86486-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics