Skip to main content
Log in

Cancer classification from serial analysis of gene expression with event models

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Cancer class prediction and discovery is beneficial to imperfect non-automated cancer diagnoses which affect patient cancer treatments. Serial Analysis of Gene Expression (SAGE) is a relatively new method for monitoring gene expression levels and is expected to contribute significantly to the progress in cancer treatment by enabling an automatic, precise and early diagnosis. A promising application of SAGE gene expression data is classification of cancers. In this paper, we build three event models (the multivariate Bernoulli model, the multinomial model and the normalized multinomial model) for SAGE gene expression profiles. The event models based methods are compared with the standard Naïve Bayes method. Both binary classification and multicategory classification are investigated. Experiments results on several SAGE datasets show that event models are better than standard Naïve Bayes in general. Normalized Information Gain (NIG), an extension of Information Gain (IG), is proposed for gene selection. The impact of gene correlation on the classification performance is investigated.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T Jr, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Staudt LM (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403(6769):503–511

    Article  Google Scholar 

  2. Bittner M, Meltzer P, Chen Y, Jiang Y, Seftor E, Hendrix M, Radmacher M, Simon R, Yakhini Z, Ben-Dor A, Sampas N, Dougherty E, Wang E, Marincola F, Gooden C, Lueders J, Glatfelter A, Pollock P, Carpten J, Gillanders E, Leja D, Dietrich K, Beaudry C, Berens M, Alberts D, Sondak V, Hayward N, Trent J (2000) Molecular classification of cutaneous malignant melanoma by gene expression profiling. Nature 406(6795):536–540

    Article  Google Scholar 

  3. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  Google Scholar 

  4. Helman P, Veroff R, Atlas SR, Willman CL (2004) A Bayesian network classification methodology for gene expression data. J Comput Biol 11(4):581–615

    Article  Google Scholar 

  5. Dettling M (2004) BagBoosting for tumor classification with gene expression data. Bioinformatics 20(18):3583–3593

    Article  Google Scholar 

  6. Velculescu VE, Zhang L, Vogelstein B, Kinzler KW (1995) Serial analysis of gene expression. Science 270:484–487

    Article  Google Scholar 

  7. Sander J, Ng RT, Sleumer MC, Saint Yuen M, Jones SJ (2005) A methodology for analyzing SAGE libraries for cancer profiling. ACM Trans Inf Syst 23(1):35–60

    Article  Google Scholar 

  8. Yamamoto M, Wakatsuki T, Hada A, Ryo A (2001) Use of serial analysis of gene expression (SAGE) technology. J Immunol Methods 250:45–66

    Article  Google Scholar 

  9. Ruijter JM, Van Kampen AHC, Baas F (2002) Statistical evaluation of SAGE libraries: consequences for experimental design. Physiol Genomics 11:37–44

    Google Scholar 

  10. Patino WD, Mian OY, Hwang PM (2002) Serial analysis of gene expression. Circ Res 91:565–569

    Article  Google Scholar 

  11. Man MZ, Wang X, Wang Y (2000) Power SAGE: comparing statistical tests for SAGE experiments. Bioinformatics 16:953–959

    Article  Google Scholar 

  12. Ryo A, Kondoh N, Wakatsuki T, Hada A, Yamamoto N, Yamamoto M (2000) A modified serial analysis of gene expression that generates longer sequence tags by nonpalindromic cohesive linker ligation. Anal Biochem 277:160–162

    Article  Google Scholar 

  13. Polyak K, Riggins GJ (2001) Gene discovery using the serial analysis of gene expression technique: implication for cancer research. J Clin Oncol 19(11):2948–2958

    Google Scholar 

  14. SAGENET (Accessed 2005) http://www.sagenet.org/findings/index.html

  15. McCallum A, Nigam K (1998) A comparison of event models for naive Bayes text classification. In: Proceedings of AAAI-98 workshop on learning for text categorization. AAAI Press, Menlo Park pp 41–48

    Google Scholar 

  16. Uren VS, Addis TA (2002) How weak text categorizers based upon different principles can strengthen performance. Comput J 45:511–524

    Article  MATH  Google Scholar 

  17. Jin X, Xu A, Zhao G, Ma J, Bie R (2006) Event models for tumor classification with SAGE gene expression data. In: Alexandrov VN et al (eds) ICCS 2006, part II. Lecture notes in computer science, vol 3992, pp 775–782

  18. Jin X, Zhou W, Bie R (2007) Multinomial event Naive Bayesian Modeling for SAGE Data Classification. Comput Stat

  19. NCBI SAGE data: ftp://ftp.ncbi.nih.gov/pub/sage or http://www.ncbi.nlm.nih.gov/projects/SAGE/ (Accessed 2007)

  20. SAGEMap (2005) http://www.ncbi.nlm.nih.gov/geo/query/acc.cgi?acc=GPL4

  21. Weston GJ, Barnhill S, Vapnik V (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46(1/3):389–422

    Article  MATH  Google Scholar 

  22. Cover T (1991) Elements of information theory. Wiley, New York

    MATH  Google Scholar 

  23. Han J, Kamber M (2000) Data mining concepts and techniques. Kaufmann, Los Altos

    Google Scholar 

  24. Hall MA (1998) Correlation-based feature subset selection for machine learning. Hamilton, New Zealand

    Google Scholar 

  25. Domingos P, Pazzani M (1997) On the optimality of the simple Bayesian classifier under zero-one loss. Mach Learn 29:103–130

    Article  MATH  Google Scholar 

  26. Friedman N, Geiger D, Goldszmidt M (1997) Bayesian network classifiers. Mach Learn 29:131–163

    Article  MATH  Google Scholar 

  27. Rennie JD, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive Bayes text classifiers. In: Twentieth international conference on machine learning, August 22 2003

  28. Lewis DD (1998) Naive (Bayes) at forty: the independence assumption in information retrieval. In: Proceedings of ECML98

  29. Hilden J (1984) Statistical diagnosis based on conditional independence does not require it. Comput Methods Biol Med 14(4):429–435

    Article  Google Scholar 

  30. Hellerstein J, Thathachar J, Rish I (2000) Recognizing end-user transactions in performance management. In: Proceedings of AAAI-2000, Austin, TX, pp 596–602

  31. Li C, Haiyan H, Seth B, Jun L, Connie C, Wing W (2004) Clustering analysis of SAGE data using a Poisson approach. Genome Biol 5:R51

    Article  Google Scholar 

  32. Ng RT, Sander J, Sleumer MC (2001) Hierarchical cluster analysis of SAGE data for cancer profiling. BIOKDD 65-72

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Rongfang Bie.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Jin, X., Xu, A. & Bie, R. Cancer classification from serial analysis of gene expression with event models. Appl Intell 29, 35–46 (2008). https://doi.org/10.1007/s10489-007-0079-6

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-007-0079-6

Keywords

Navigation