skip to main content
10.1145/956750.956797acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
Article

Capturing best practice for microarray gene expression data analysis

Published:24 August 2003Publication History

ABSTRACT

Analyzing gene expression data from microarray devices has many important application in medicine and biology, but presents significant challenges to data mining. Microarray data typically has many attributes (genes) and few examples (samples), making the process of correctly analyzing such data difficult to formulate and prone to common mistakes. For this reason it is unusually important to capture and record good practices for this form of data mining. This paper presents a process for analyzing microarray data, including pre-processing, gene selection, randomization testing, classification and clustering; this process is captured with "Clementine Application Templates". The paper describes the process in detail and includes three case studies, showing how the process is applied to 2-class classification, multi-class classification and clustering analyses for publicly available microarray datasets.

References

  1. Brown et al., Knowledge-based analysis of microarray gene expression data by using support vector machines, PNAS 97(1):262--267, 2000.Google ScholarGoogle ScholarCross RefCross Ref
  2. CAMDA 2000, Proceedings of Critical Assessment of Microarrays Conference, Duke University, 2000.Google ScholarGoogle Scholar
  3. Chapman, P., Clinton, J., Kerber, R., Khabaza, T., Reinartz, T., Shearer, C. and Wirth, R. CRISP-DM 1.0 Step-by-step data mining guide, CRISP-DM Consortium, 2000, available at http://www.crisp-dm.orgGoogle ScholarGoogle Scholar
  4. Dubitzky et al., Symbolic and Subsymbolic Machine Learning Approaches for Molecular Classification of Cancer and Ranking of Genes, in Proceedings of CAMDA 2000, Duke University, 2000.Google ScholarGoogle Scholar
  5. Golub et al., Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring, Science, vol. 286, October 1999.Google ScholarGoogle Scholar
  6. Khabaza, T. and Shearer, C., Data Mining with Clementine, IEE Colloquium on Knowledge Discovery in Databases, IEE Digest No 1995/021(B), London, February 1995.Google ScholarGoogle Scholar
  7. Khabaza, T. & Sigerson, D., WebCAT: the Clementine Application Template for Web-Mining and Analytical eCRM, web-mining workshop paper, First SIAM International Conference on Data Mining, Chicago, April 2001.Google ScholarGoogle Scholar
  8. Kohavi, R, John, G., Wrappers for Feature Subset Selection, Artificial Intelligence, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. NIST Engineering and Statistics Handbook, http://www.itl.nist.gov/div898/handbook/eda/section3/eda353.htmGoogle ScholarGoogle Scholar
  10. Pomeroy et al., Prediction of central nervous system embryonal tumour outcome based on gene expression, Nature, vol. 415, January 2002.Google ScholarGoogle Scholar
  11. Ramaswamy, S. et al, Multiclass cancer diagnosis using tumor gene expression signatures, PNAS 98(26):15149--15154, 2001.Google ScholarGoogle ScholarCross RefCross Ref
  12. Shearer, C. and Khabaza, T., Data Mining by Data Owners, Intelligent Data Analysis, Baden-Baden, Germany, August 1995.Google ScholarGoogle Scholar
  13. Tamayo, P., personal communication, 2002.Google ScholarGoogle Scholar
  14. Tusher, Tibshirani, and Chu, Significance analysis of microarrays applied to the ionizing radiation response. PNAS 2001 98: 5116--5121.Google ScholarGoogle Scholar
  15. Whitehead (MIT) Institute Cancer Genomics Publications Data Sets, http://www-genome.wi.mit.edu/cgi-bin/cancer/datasets.cgiGoogle ScholarGoogle Scholar
  16. Wirth, R. & Hipp, J., CRISP-DM: Towards a Standard Process Model for Data Mining, in Proc. of the 4th Int. Conf. on The Practical Applications of Knowledge Discovery and Data Mining, Manchester UK, April 2000, The Practical Application Company.Google ScholarGoogle Scholar
  17. Saccone, R. A., Rauniyar, R. K. and Patti M.-E., Sources of Experimental Variability In Expression Data Derived From High-Density Oligonucleotide Microarrays: Practical Experience From An Academic Core Laboratory, 2nd Annual UMass Bioinformatics Conference, UMass Lowell, 2002.Google ScholarGoogle Scholar

Index Terms

  1. Capturing best practice for microarray gene expression data analysis

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        KDD '03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining
        August 2003
        736 pages
        ISBN:1581137370
        DOI:10.1145/956750

        Copyright © 2003 ACM

        Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 24 August 2003

        Permissions

        Request permissions about this article.

        Request Permissions

        Check for updates

        Qualifiers

        • Article

        Acceptance Rates

        KDD '03 Paper Acceptance Rate46of298submissions,15%Overall Acceptance Rate1,133of8,635submissions,13%

        Upcoming Conference

        KDD '24

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader