Skip to main content
Log in

Regularized Gaussian Mixture Model based discretization for gene expression data association mining

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

Association rule has shown its usefulness in the gene expression data based disease diagnosis for its good interpretability. The large number of rules generated from the high dimensional gene expression data is one of the main challenges of its applications. In this work, we reveal that the discretization preprocessing is one of the reasons for the association rule number explosion problem. To alleviate this problem, a Regularized Gaussian Mixture Model (RGMM) is proposed to discretize the continuous gene expression data. RGMM explores both the complexity of the discretization model and the information loss of the discretization procedure, under the Minimal Description Length framework. Extensive experiments show the effectiveness of RGMM on real-life gene expression data sets.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Algorithm 1

Similar content being viewed by others

References

  1. Ahmed N, Gokhale D (1989) Entropy expressions and their estimators for multivariate distributions. IEEE Trans Inf Theory 35(3):688–692

    Article  MathSciNet  MATH  Google Scholar 

  2. Alcalá-Fdez J, Alcala R, Herrera F (2011) A fuzzy association rule-based classification model for high-dimensional problems with genetic rule selection and lateral tuning. IEEE Trans Fuzzy Syst 19(5):857–872

    Article  Google Scholar 

  3. Alon U, Barka N et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12):5

    Article  Google Scholar 

  4. Bay S (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512

    Article  MATH  Google Scholar 

  5. Biba M, Esposito F, Ferilli S, Di Mauro N, Basile T (2007) Unsupervised discretization using kernel density estimation. In: International joint conference on artificial intelligence, pp 696–701

    Google Scholar 

  6. Botev Z, Grotowski J, Kroese D (2010) Kernel density estimation via diffusion. Ann Stat 38(5):2916–2957

    Article  MathSciNet  MATH  Google Scholar 

  7. Boulle M (2004) Khiops: a statistical discretization method of continuous attributes. Mach Learn 55(1):53–69

    Article  MATH  Google Scholar 

  8. Cai R, Tung AK, Zhang Z, Hao Z (2011) What is unequal among the equals? Ranking equivalent rules from gene expression data. IEEE Trans Knowl Data Eng 23(11):1735

    Article  Google Scholar 

  9. Clarke E, Barton B (2000) Entropy and mdl discretization of continuous variables for Bayesian belief networks. Int J Intell Syst 15(1):61–92

    Article  Google Scholar 

  10. Cong G, Tan K-L, Tung AKH, Xu X (2005) Mining top-k covering rule groups for gene expression data. In: ACM’s special interest group on management of data (SIGMOD), pp 670–681

    Google Scholar 

  11. Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: International conference on machine learning, pp 194–202

    Google Scholar 

  12. Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: International joint conference on uncertainty in AI, pp 1022–1027

    Google Scholar 

  13. Flores M, Gámez J, Martínez A, Puerta J (2011) Handling numeric attributes when comparing Bayesian network classifiers: does the discretization method matter? Appl Intell 34:372–385

    Article  Google Scholar 

  14. García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064

    Article  Google Scholar 

  15. Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):6

    Article  Google Scholar 

  16. Gordon GJ, Jensen RV, Hsiao LL et al (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62(17):4963–4967

    Google Scholar 

  17. Gupta A, Mehrotra K, Mohan C (2010) A clustering-based discretization for supervised learning. Stat Probab Lett 80(9):816–824

    Article  MathSciNet  MATH  Google Scholar 

  18. http://nusdm.comp.nus.edu.sg/gemini/geminiii.zip

  19. http://www.khiops.com

  20. Kerber R (1992) Chimerge: discretization of numeric attributes. In: International conference on artificial intelligence, pp 123–128

    Google Scholar 

  21. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86

    Article  MathSciNet  MATH  Google Scholar 

  22. Kurgan L, Cios K (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153

    Article  Google Scholar 

  23. Luengo J, Saez J, Lopez V, Herrera F et al (2012) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng. doi:10.1109/TKDE.2012.35

    Google Scholar 

  24. Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185

    Article  Google Scholar 

  25. Popovic BM, Janev M, Pekar D, Jakovljevic N, Gnjatovic M, Secujski M, Delic V (2012) A novel split-and-merge algorithm for hierarchical clustering of Gaussian mixture models. Appl Intell 37:377–389

    Article  Google Scholar 

  26. Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J et al (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98(26):15149–15154

    Article  Google Scholar 

  27. Schmidberger G, Frank E (2005) Unsupervised discretization using tree-based density estimation. In: Principles and practice of knowledge discovery in databases (PKDD), pp 240–251

    Google Scholar 

  28. Shipp M, Ross K, Tamayo P, Weng A, Kutok J, Aguiar R, Gaasenbeek M, Angelo M, Reich M, Pinkus G et al (2002) Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74

    Article  Google Scholar 

  29. Singh D, Febbo PG, Ross K et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209

    Article  Google Scholar 

  30. Singh G, Minz S (2007) Discretization using clustering and rough set theory. In: International conference on computing: theory and applications, March 2007, pp 330–336

    Google Scholar 

  31. Zighed D, Rabaseda S, Rakotomalala R (1998) Fusinter: a method for discretization of continuous attributes. Int J Uncertain Fuzziness Knowl-Based Syst 6:307–326

    Article  MATH  Google Scholar 

Download references

Acknowledgements

This work is financially supported by Natural Science Foundation of China (61070033, 61100148, 61202269), Natural Science Foundation of Guangdong Province (S2011040004804), Key Technology Research and Development Programs of Guangdong Province (2010B050400011), Opening Project of the State Key Laboratory for Novel Software Technology (KFKT2011B19), Foundation for Distinguished Young Talents in Higher Education of Guangdong, China (LYM11060), Science and Technology Plan Project of Guangzhou City (12C42111607, 201200000031), Science and Technology Plan Project of Panyu District Guangzhou (2012-Z-03-67).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ruichu Cai.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Cai, R., Hao, Z., Wen, W. et al. Regularized Gaussian Mixture Model based discretization for gene expression data association mining. Appl Intell 39, 607–613 (2013). https://doi.org/10.1007/s10489-013-0435-7

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-013-0435-7

Keywords

Navigation