ABSTRACT
Analyzing gene expression data from microarray devices has many important application in medicine and biology, but presents significant challenges to data mining. Microarray data typically has many attributes (genes) and few examples (samples), making the process of correctly analyzing such data difficult to formulate and prone to common mistakes. For this reason it is unusually important to capture and record good practices for this form of data mining. This paper presents a process for analyzing microarray data, including pre-processing, gene selection, randomization testing, classification and clustering; this process is captured with "Clementine Application Templates". The paper describes the process in detail and includes three case studies, showing how the process is applied to 2-class classification, multi-class classification and clustering analyses for publicly available microarray datasets.
- Brown et al., Knowledge-based analysis of microarray gene expression data by using support vector machines, PNAS 97(1):262--267, 2000.Google ScholarCross Ref
- CAMDA 2000, Proceedings of Critical Assessment of Microarrays Conference, Duke University, 2000.Google Scholar
- Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. CRISP-DM 1.0 Step-by-step data mining guide, CRISP-DM Consortium, 2000, available at http://www.crisp-dm.orgGoogle Scholar
- Dubitzky et al., Symbolic and Subsymbolic Machine Learning Approaches for Molecular Classification of Cancer and Ranking of Genes, in Proceedings of CAMDA 2000, Duke University, 2000.Google Scholar
- Golub et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, vol. 286, October 1999.Google Scholar
- Khabaza, T. and Shearer, C., Data Mining with Clementine, IEE Colloquium on Knowledge Discovery in Databases, IEE Digest No 1995/021(B), London, February 1995.Google Scholar
- Khabaza, T. & Sigerson, D., WebCAT: the Clementine Application Template for Web-Mining and Analytical eCRM, web-mining workshop paper, First SIAM International Conference on Data Mining, Chicago, April 2001.Google Scholar
- Kohavi, R, John, G., Wrappers for Feature Subset Selection, Artificial Intelligence, 1997. Google ScholarDigital Library
- NIST Engineering and Statistics Handbook, http://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htmGoogle Scholar
- Pomeroy et al., Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature, vol. 415, January 2002.Google Scholar
- Ramaswamy, S. et al, Multiclass cancer diagnosis using tumor gene expression signatures, PNAS 98(26):15149--15154, 2001.Google ScholarCross Ref
- Shearer, C. and Khabaza, T., Data Mining by Data Owners, Intelligent Data Analysis, Baden-Baden, Germany, August 1995.Google Scholar
- Tamayo, P., personal communication, 2002.Google Scholar
- Tusher, Tibshirani, and Chu, Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001 98: 5116--5121.Google Scholar
- Whitehead (MIT) Institute Cancer Genomics Publications Data Sets, http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgiGoogle Scholar
- Wirth, R. & Hipp, J., CRISP-DM: Towards a Standard Process Model for Data Mining, in Proc. of the 4th Int. Conf. on The Practical Applications of Knowledge Discovery and Data Mining, Manchester UK, April 2000, The Practical Application Company.Google Scholar
- Saccone, R. A., Rauniyar, R. K. and Patti M.-E., Sources of Experimental Variability In Expression Data Derived From High-Density Oligonucleotide Microarrays: Practical Experience From An Academic Core Laboratory, 2nd Annual UMass Bioinformatics Conference, UMass Lowell, 2002.Google Scholar
Index Terms
- Capturing best practice for microarray gene expression data analysis
Recommendations
Investigating Gene and MicroRNA Expression in Glioblastoma
IJCBS '09: Proceedings of the 2009 International Joint Conference on Bioinformatics, Systems Biology and Intelligent ComputingGlioblastoma is the most common primary brain tumor in adults. Here we present an integrated analysis of microRNA expression and gene expression in 237 tumor tissues and 10 normal tissues. We indentified 1,236 genes, and 131 pathways significantly ...
Microarray analysis reveals CC chemokine CCL-1 responsive gene expression in human HeLa cells
SCSC '07: Proceedings of the 2007 Summer Computer Simulation ConferenceHuman CC chemokine, CCL-1 is a chemotactic cytokine implicated in a variety of biological responses, namely inflammation and chemotaxis, in many cell types. We have previously shown that the CC chemokine, CCL-1 and its receptor CCR8 are expressed and ...
Interpretation of gene expression microarray experiments
Proceedings of the 2007 conference on Emerging Artificial Intelligence Applications in Computer Engineering: Real Word AI Systems with Applications in eHealth, HCI, Information Retrieval and Pervasive TechnologiesMicroarrays nowadays have an almost ubiquitous presence in modern biological research The extent and versatility of the techniques that are available for analysis and interpretation of microarray experiments can be somehow bewildering to the interested ...
Comments