Abstract
Cancer class prediction and discovery is beneficial to imperfect non-automated cancer diagnoses which affect patient cancer treatments. Serial Analysis of Gene Expression (SAGE) is a relatively new method for monitoring gene expression levels and is expected to contribute significantly to the progress in cancer treatment by enabling an automatic, precise and early diagnosis. A promising application of SAGE gene expression data is classification of cancers. In this paper, we build three event models (the multivariate Bernoulli model, the multinomial model and the normalized multinomial model) for SAGE gene expression profiles. The event models based methods are compared with the standard Naïve Bayes method. Both binary classification and multicategory classification are investigated. Experiments results on several SAGE datasets show that event models are better than standard Naïve Bayes in general. Normalized Information Gain (NIG), an extension of Information Gain (IG), is proposed for gene selection. The impact of gene correlation on the classification performance is investigated.
Similar content being viewed by others
References
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T Jr, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Staudt LM (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511
Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V, Hayward N, Trent J (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406(6795):536–540
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Helman P, Veroff R, Atlas SR, Willman CL (2004) A Bayesian network classification methodology for gene expression data. J Comput Biol 11(4):581–615
Dettling M (2004) BagBoosting for tumor classification with gene expression data. Bioinformatics 20(18):3583–3593
Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270:484–487
Sander J, Ng RT, Sleumer MC, Saint Yuen M, Jones SJ (2005) A methodology for analyzing SAGE libraries for cancer profiling. ACM Trans Inf Syst 23(1):35–60
Yamamoto M, Wakatsuki T, Hada A, Ryo A (2001) Use of serial analysis of gene expression (SAGE) technology. J Immunol Methods 250:45–66
Ruijter JM, Van Kampen AHC, Baas F (2002) Statistical evaluation of SAGE libraries: consequences for experimental design. Physiol Genomics 11:37–44
Patino WD, Mian OY, Hwang PM (2002) Serial analysis of gene expression. Circ Res 91:565–569
Man MZ, Wang X, Wang Y (2000) Power SAGE: comparing statistical tests for SAGE experiments. Bioinformatics 16:953–959
Ryo A, Kondoh N, Wakatsuki T, Hada A, Yamamoto N, Yamamoto M (2000) A modified serial analysis of gene expression that generates longer sequence tags by nonpalindromic cohesive linker ligation. Anal Biochem 277:160–162
Polyak K, Riggins GJ (2001) Gene discovery using the serial analysis of gene expression technique: implication for cancer research. J Clin Oncol 19(11):2948–2958
SAGENET (Accessed 2005) http://www.sagenet.org/findings/index.html
McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of AAAI-98 workshop on learning for text categorization. AAAI Press, Menlo Park pp 41–48
Uren VS, Addis TA (2002) How weak text categorizers based upon different principles can strengthen performance. Comput J 45:511–524
Jin X, Xu A, Zhao G, Ma J, Bie R (2006) Event models for tumor classification with SAGE gene expression data. In: Alexandrov VN et al (eds) ICCS 2006, part II. Lecture notes in computer science, vol 3992, pp 775–782
Jin X, Zhou W, Bie R (2007) Multinomial event Naive Bayesian Modeling for SAGE Data Classification. Comput Stat
NCBI SAGE data: ftp://ftp.ncbi.nih.gov/pub/sage or http://www.ncbi.nlm.nih.gov/projects/SAGE/ (Accessed 2007)
SAGEMap (2005) http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL4
Weston GJ, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1/3):389–422
Cover T (1991) Elements of information theory. Wiley, New York
Han J, Kamber M (2000) Data mining concepts and techniques. Kaufmann, Los Altos
Hall MA (1998) Correlation-based feature subset selection for machine learning. Hamilton, New Zealand
Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130
Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163
Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive Bayes text classifiers. In: Twentieth international conference on machine learning, August 22 2003
Lewis DD (1998) Naive (Bayes) at forty: the independence assumption in information retrieval. In: Proceedings of ECML98
Hilden J (1984) Statistical diagnosis based on conditional independence does not require it. Comput Methods Biol Med 14(4):429–435
Hellerstein J, Thathachar J, Rish I (2000) Recognizing end-user transactions in performance management. In: Proceedings of AAAI-2000, Austin, TX, pp 596–602
Li C, Haiyan H, Seth B, Jun L, Connie C, Wing W (2004) Clustering analysis of SAGE data using a Poisson approach. Genome Biol 5:R51
Ng RT, Sander J, Sleumer MC (2001) Hierarchical cluster analysis of SAGE data for cancer profiling. BIOKDD 65-72
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Jin, X., Xu, A. & Bie, R. Cancer classification from serial analysis of gene expression with event models. Appl Intell 29, 35–46 (2008). https://doi.org/10.1007/s10489-007-0079-6
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-007-0079-6