Abstract
Association rule has shown its usefulness in the gene expression data based disease diagnosis for its good interpretability. The large number of rules generated from the high dimensional gene expression data is one of the main challenges of its applications. In this work, we reveal that the discretization preprocessing is one of the reasons for the association rule number explosion problem. To alleviate this problem, a Regularized Gaussian Mixture Model (RGMM) is proposed to discretize the continuous gene expression data. RGMM explores both the complexity of the discretization model and the information loss of the discretization procedure, under the Minimal Description Length framework. Extensive experiments show the effectiveness of RGMM on real-life gene expression data sets.
Similar content being viewed by others
References
Ahmed N, Gokhale D (1989) Entropy expressions and their estimators for multivariate distributions. IEEE Trans Inf Theory 35(3):688–692
Alcalá-Fdez J, Alcala R, Herrera F (2011) A fuzzy association rule-based classification model for high-dimensional problems with genetic rule selection and lateral tuning. IEEE Trans Fuzzy Syst 19(5):857–872
Alon U, Barka N et al (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12):5
Bay S (2001) Multivariate discretization for set mining. Knowl Inf Syst 3(4):491–512
Biba M, Esposito F, Ferilli S, Di Mauro N, Basile T (2007) Unsupervised discretization using kernel density estimation. In: International joint conference on artificial intelligence, pp 696–701
Botev Z, Grotowski J, Kroese D (2010) Kernel density estimation via diffusion. Ann Stat 38(5):2916–2957
Boulle M (2004) Khiops: a statistical discretization method of continuous attributes. Mach Learn 55(1):53–69
Cai R, Tung AK, Zhang Z, Hao Z (2011) What is unequal among the equals? Ranking equivalent rules from gene expression data. IEEE Trans Knowl Data Eng 23(11):1735
Clarke E, Barton B (2000) Entropy and mdl discretization of continuous variables for Bayesian belief networks. Int J Intell Syst 15(1):61–92
Cong G, Tan K-L, Tung AKH, Xu X (2005) Mining top-k covering rule groups for gene expression data. In: ACM’s special interest group on management of data (SIGMOD), pp 670–681
Dougherty J, Kohavi R, Sahami M (1995) Supervised and unsupervised discretization of continuous features. In: International conference on machine learning, pp 194–202
Fayyad U, Irani K (1993) Multi-interval discretization of continuous-valued attributes for classification learning. In: International joint conference on uncertainty in AI, pp 1022–1027
Flores M, Gámez J, Martínez A, Puerta J (2011) Handling numeric attributes when comparing Bayesian network classifiers: does the discretization method matter? Appl Intell 34:372–385
García S, Fernández A, Luengo J, Herrera F (2010) Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: experimental analysis of power. Inf Sci 180(10):2044–2064
Golub TR, Slonim DK, Tamayo P et al (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):6
Gordon GJ, Jensen RV, Hsiao LL et al (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62(17):4963–4967
Gupta A, Mehrotra K, Mohan C (2010) A clustering-based discretization for supervised learning. Stat Probab Lett 80(9):816–824
Kerber R (1992) Chimerge: discretization of numeric attributes. In: International conference on artificial intelligence, pp 123–128
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22:79–86
Kurgan L, Cios K (2004) Caim discretization algorithm. IEEE Trans Knowl Data Eng 16(2):145–153
Luengo J, Saez J, Lopez V, Herrera F et al (2012) A survey of discretization techniques: taxonomy and empirical analysis in supervised learning. IEEE Trans Knowl Data Eng. doi:10.1109/TKDE.2012.35
Mehta S, Parthasarathy S, Yang H (2005) Toward unsupervised correlation preserving discretization. IEEE Trans Knowl Data Eng 17(9):1174–1185
Popovic BM, Janev M, Pekar D, Jakovljevic N, Gnjatovic M, Secujski M, Delic V (2012) A novel split-and-merge algorithm for hierarchical clustering of Gaussian mixture models. Appl Intell 37:377–389
Ramaswamy S, Tamayo P, Rifkin R, Mukherjee S, Yeang C, Angelo M, Ladd C, Reich M, Latulippe E, Mesirov J et al (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98(26):15149–15154
Schmidberger G, Frank E (2005) Unsupervised discretization using tree-based density estimation. In: Principles and practice of knowledge discovery in databases (PKDD), pp 240–251
Shipp M, Ross K, Tamayo P, Weng A, Kutok J, Aguiar R, Gaasenbeek M, Angelo M, Reich M, Pinkus G et al (2002) Diffuse large b-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning. Nat Med 8(1):68–74
Singh D, Febbo PG, Ross K et al (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209
Singh G, Minz S (2007) Discretization using clustering and rough set theory. In: International conference on computing: theory and applications, March 2007, pp 330–336
Zighed D, Rabaseda S, Rakotomalala R (1998) Fusinter: a method for discretization of continuous attributes. Int J Uncertain Fuzziness Knowl-Based Syst 6:307–326
Acknowledgements
This work is financially supported by Natural Science Foundation of China (61070033, 61100148, 61202269), Natural Science Foundation of Guangdong Province (S2011040004804), Key Technology Research and Development Programs of Guangdong Province (2010B050400011), Opening Project of the State Key Laboratory for Novel Software Technology (KFKT2011B19), Foundation for Distinguished Young Talents in Higher Education of Guangdong, China (LYM11060), Science and Technology Plan Project of Guangzhou City (12C42111607, 201200000031), Science and Technology Plan Project of Panyu District Guangzhou (2012-Z-03-67).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Cai, R., Hao, Z., Wen, W. et al. Regularized Gaussian Mixture Model based discretization for gene expression data association mining. Appl Intell 39, 607–613 (2013). https://doi.org/10.1007/s10489-013-0435-7
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10489-013-0435-7