Abstract
We developed a methodology that both facilitates and enhances the search for homogeneous subtypes in data. We applied this methodology to medical research on Osteoarthritis and Parkinson’s Disease and to chemoinformatics research on the chemical structure of molecule profiles. We release this methodology as the R SubtypeDiscovery package to enable reproducibility of our analyses. In this paper, we present the package implementation and we illustrate its output on molecular data from chemoinformatics. Our methodology includes different techniques to process the data, a computational approach repeating data modelling to select for a number of subtypes or a type of model, and additional methods to characterize, compare and evaluate the top ranking models. Therefore, this methodology does not solely cluster data but it also produces a complete set of results to conduct a subtype discovery analysis.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
References
Colas, F., Meulenbelt, I., Houwing-Duistermaat, J., van Rooden, S., Visser, M., Marinus, H., van Hilten, B., Slagboom, P.E., Kok, J.N.: Stability of clusters for different time adjustments in complex disease research. In: 30th Annual International IEEE EMBS Conference (EMBC 2008), Vancouver, British Columbia, Canada (August 2008)
Meulenbelt, I.: Genetic predisposing factors of osteoarthritis. PhD thesis, Universiteit van Leiden (1997)
Riyazi, N.: Familial osteoarthritis, risk factors and determinants of outcome. PhD thesis, Universiteit van Leiden (2006)
Neurology Department: SCales for Outcomes in PArkinson’s Disease-PROfiling PARKinson’s Disease. Leiden University Medical Center, Leiden, The Netherlands
Cannon, E.O., Nigsch, F., Mitchell, J.B.O.: A novel hybrid ultrafast shape descriptor method for use in virtual screening. Chemistry Central Journal 2 (2008)
Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning, Data Mining, Inference, and Prediction. Springer Series in Statistics. Springer, Heidelberg (2001)
Beyer, K., Goldstein, J., Ramakrishnan, R., Shaft, U.: When is “nearest neighbor” meaningful? In: Beeri, C., Bruneman, P. (eds.) ICDT 1999. LNCS, vol. 1540, pp. 217–235. Springer, Heidelberg (1998)
Sneath, P.H.A., Sokal, R.R.: Numerical Taxonomy, The Principles and Practice of Numerical Classification. Books in Biology. W. H. Freeman and Company, New York (1973)
Fraley, C., Raftery, A.E.: MCLUST: Software for model-based cluster analysis. Journal of Classification 16, 297–306 (1999)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis and density estimation. Journal of the American Statistical Association 97, 611–631 (2002)
Fraley, C., Raftery, A.E.: Enhanced software for model-based clustering, density estimation, and discriminant analysis: MCLUST. Journal of Classification 20, 263–286 (2003)
Fraley, C., Raftery, A.E.: MCLUST version 3 for R: Normal mixture modeling and model-based clustering. Technical Report 504, University of Washington, Department of Statistics (September 2006)
Banfield, J.D., Raftery, A.E.: Model-based Gaussian and non-Gaussian clustering. Biometrics 49, 803–821 (1993)
Kass, R.E., Raftery, A.E.: Bayes factors. Journal of the American Statistical Association 90(430) (1995)
Tukey, J.W.: Exploratory Data Analysis. Addison-Wesley, Reading (1977)
Tufte, E.R.: The Visual Display of Quantitative Information. Graphics Press, Cheshire (1983)
Tufte, E.R.: Envisioning Information. Graphics Press, Cheshire (1990)
Brewer, C.A.: 7. In: Color Use Guidelines for Mapping and Visualization, pp. 123–147. Elsevier Science, Tarrytown (1994)
Eisen, M.B., Spellman, P.T., Brown, P.O., Botstein, D.: Cluster analysis and display of genome-wide expression patterns. Proceedings of National Academy of Science USA 95, 11863–14868 (1998)
Inselberg, A.: The plane with parallel coordinates. The Visual Computer 1(2), 69–91 (1985)
R Development Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria (2008) ISBN 3-900051-07-0
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2008 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Colas, F. et al. (2008). A Scenario Implementation in R for SubtypeDiscovery Examplified on Chemoinformatics Data. In: Margaria, T., Steffen, B. (eds) Leveraging Applications of Formal Methods, Verification and Validation. ISoLA 2008. Communications in Computer and Information Science, vol 17. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-540-88479-8_48
Download citation
DOI: https://doi.org/10.1007/978-3-540-88479-8_48
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-88478-1
Online ISBN: 978-3-540-88479-8
eBook Packages: Computer ScienceComputer Science (R0)