Skip to main content

A Systems Biology Approach for Unsupervised Clustering of High-Dimensional Data

  • Conference paper
  • First Online:

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 10122))

Abstract

One main challenge in modern medicine is the discovery of molecular disease subtypes characterized by relevant clinical differences, such as survival. However, clustering high-dimensional expression data is challenging due to noise and the curse of high-dimensionality. This article describes a disease subtyping pipeline that is able to exploit the important information available in pathway databases and clinical variables. The pipeline consists of a new feature selection procedure and existing clustering methods. Our procedure partitions a set of patients using the set of genes in each pathway as clustering features. To select the best features, this procedure estimates the relevance of each pathway and fuses relevant pathways. We show that our pipeline finds subtypes of patients with more distinctive survival profiles than traditional subtyping methods by analyzing a TCGA colon cancer gene expression dataset. Here we demonstrate that our pipeline improves three different clustering methods: k-means, SNF, and hierarchical clustering.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Saria, S., Goldenberg, A.: Subtyping: what it is and its role in precision medicine. IEEE Intell. Syst. 30(4), 70–75 (2015)

    Article  Google Scholar 

  2. Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proc. Natl. Acad. Sci. 95(25), 14863–14868 (1998)

    Article  Google Scholar 

  3. Kim, E.Y., Kim, S.Y., Ashlock, D., Nam, D.: MULTI-K: accurate classification of microarray subtypes using ensemble k-means clustering. BMC Bioinform. 10, 260 (2009)

    Article  Google Scholar 

  4. Wang, B., Mezlini, A.M., Demir, F., Fiume, M., Tu, Z., Brudno, M., Haibe-Kains, B., Goldenberg, A.: Similarity network fusion for aggregating data types on a genomic scale. Nat. Methods 11(3), 333–337 (2014)

    Article  Google Scholar 

  5. Hsu, J.J., Finkelstein, D.M., Schoenfeld, D.A.: Outcome-driven cluster analysis with application to microarray data. PLoS ONE 10(11), e0141874 (2015)

    Article  Google Scholar 

  6. Shai, R., Shi, T., Kremen, T.J., Horvath, S., Liau, L.M., Cloughesy, T.F., Mischel, P.S., Nelson, S.F.: Gene expression profiling identifies molecular subtypes of gliomas. Oncogene 22(31), 4918–4923 (2003)

    Article  Google Scholar 

  7. Hira, Z.M., Gillies, D.F., Hira, Z.M., Gillies, D.F.: A review of feature selection and feature extraction methods applied on microarray data. Adv. Bioinform. 2015, e198363 (2015)

    Google Scholar 

  8. Huang, G.T., Cunningham, K.I., Benos, P.V., Chennubhotla, C.S.: Spectral clustering strategies for heterogeneous disease expression data. In: Pacific Symposium on Biocomputing, pp. 212–223 (2013)

    Google Scholar 

  9. Pyatnitskiy, M., Mazo, I., Shkrob, M., Schwartz, E., Kotelnikova, E.: Clustering gene expression regulators: new approach to disease subtyping. PLoS ONE 9(1), e84955 (2014)

    Article  Google Scholar 

  10. Li, T., Zhang, C., Ogihara, M.: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15), 2429–2437 (2004)

    Article  Google Scholar 

  11. Hernández-Torruco, J., Canul-Reich, J., Frausto-Solís, J., Méndez-Castillo, J.J.: Feature selection for better identification of subtypes of Guillain-Barré. Comput. Math. Methods Med. 2014, e432109 (2014)

    Article  Google Scholar 

  12. Sebastiani, F.: Machine learning in automated text categorization. ACM Comput. Surv. 34(1), 1–47 (2002)

    Article  Google Scholar 

  13. Liu, Y., Schumann, M.: Data mining feature selection for credit scoring models. J. Oper. Res. Soc. 56(9), 1099–1108 (2005)

    Article  MATH  Google Scholar 

  14. Zheng, Z., Wu, X., Srihari, R.: Feature selection for text categorization on imbalanced data. SIGKDD Explor. Newsl. 6(1), 80–89 (2004)

    Article  Google Scholar 

  15. Hall, M.A.: Correlation-based feature selection for machine learning. Ph.D. thesis, The University of Waikato (1999)

    Google Scholar 

  16. Diaz-Uriarte, R., de Andres, S.A.: Gene selection and classification of microarray data using random forest. BMC Bioinform. 7, 3 (2006)

    Article  Google Scholar 

  17. Sharma, A., Imoto, S., Miyano, S., Sharma, V.: Null space based feature selection method for gene expression data. Int. J. Mach. Learn. Cybern. 3(4), 269–276 (2011)

    Article  Google Scholar 

  18. Bair, E., Tibshirani, R.: Semi-supervised methods to predict patient survival from gene expression data. PLOS Biol. 2(4), e108 (2004)

    Article  Google Scholar 

  19. Paoli, S., Jurman, G., Albanese, D., Merler, S., Furlanello, C.: Integrating gene expression profiling and clinical data. Int. J. Approx. Reason. 47(1), 58–69 (2008)

    Article  Google Scholar 

  20. Bushel, P.R., Wolfinger, R.D., Gibson, G.: Simultaneous clustering of gene expression data with clinical chemistry and pathological evaluations reveals phenotypic prototypes. BMC Syst. Biol. 1, 15 (2007)

    Article  Google Scholar 

  21. Chalise, P., Koestler, D.C., Bimali, M., Yu, Q., Fridley, B.L.: Integrative clustering methods for high-dimensional molecular data. Transl. Cancer Res. 3(3), 202–216 (2014)

    Google Scholar 

  22. Kanehisa, M., Goto, S.: KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28(1), 27–30 (2000)

    Article  Google Scholar 

  23. Croft, D., Mundo, A.F., Haw, R., Milacic, M., Weiser, J., Wu, G., Caudy, M., Garapati, P., Gillespie, M., Kamdar, M.R., Jassal, B., Jupe, S., Matthews, L., May, B., Palatnik, S., Rothfels, K., Shamovsky, V., Song, H., Williams, M., Birney, E., Hermjakob, H., Stein, L., D’Eustachio, P.: The Reactome pathway knowledgebase. Nucleic Acids Res. 42(D1), D472–D477 (2014)

    Article  Google Scholar 

  24. Hanisch, D., Zien, A., Zimmer, R., Lengauer, T.: Co-clustering of biological networks and gene expression data. Bioinformatics 18(suppl. 1), S145–S154 (2002)

    Article  Google Scholar 

  25. Huang, D., Pan, W.: Incorporating biological knowledge into distance-based clustering analysis of microarray gene expression data. Bioinformatics 22(10), 1259–1268 (2006)

    Article  Google Scholar 

  26. Rapaport, F., Zinovyev, A., Dutreix, M., Barillot, E., Vert, J.P.: Classification of microarray data using gene networks. BMC Bioinform. 8, 35 (2007)

    Article  Google Scholar 

  27. Pok, G., Liu, J.C.S., Ryu, K.H.: Effective feature selection framework for cluster analysis of microarray data. Bioinformation 4(8), 385–389 (2010)

    Article  Google Scholar 

  28. Prlić, A., Procter, J.B.: Ten Simple rules for the open development of scientific software. PLOS Comput. Biol. 8(12), e1002802 (2012)

    Article  Google Scholar 

  29. Carey, V.J., Stodden, V.: Reproducible research concepts and tools for cancer bioinformatics. In: Ochs, M.F., Casagrande, J.T., Davuluri, R.V. (eds.) Biomedical Informatics for Cancer Research, pp. 149–175. Springer, New York (2010). doi:10.1007/978-1-4419-5714-6_8

    Chapter  Google Scholar 

  30. Diaz, D., Draghici, S.: mirIntegrator: Integrating miRNAs into signaling pathways. R package (2015)

    Google Scholar 

Download references

Acknowledgments

This study used data generated by the TCGA Research Network; we thank donors and research groups for sharing these valuable data. This research was supported in part by the following grants: NIH R01 DK089167, R42 GM087013 and NSF DBI-0965741, and by the Robert J. Sokol Endowment in Systems Biology. Any opinions, findings, conclusions, or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of any of the funding agencies.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Diana Diaz .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing AG

About this paper

Cite this paper

Diaz, D., Nguyen, T., Draghici, S. (2016). A Systems Biology Approach for Unsupervised Clustering of High-Dimensional Data. In: Pardalos, P., Conca, P., Giuffrida, G., Nicosia, G. (eds) Machine Learning, Optimization, and Big Data. MOD 2016. Lecture Notes in Computer Science(), vol 10122. Springer, Cham. https://doi.org/10.1007/978-3-319-51469-7_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-51469-7_16

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-51468-0

  • Online ISBN: 978-3-319-51469-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics