Abstract
Lately, the enormous generation of databases in almost every aspect of life has created a great demand for new powerful tools for turning data into useful information. Therefore, researchers were encouraged to explore and develop new machine learning ideas and methods. Mixture models are one of the machine learning techniques receiving considerable attention due to their ability to handle efficiently and effectively multidimensional data. In this paper, we represent a solution for two challenging issues: modeling non-Gaussian data and determining the set of relevant features in the data. The problem of modeling non-Gaussian data largely present in several computer vision, image processing, medical, and Bioinformatics applications is accomplished by the development of a generative infinite Gamma mixture model. The Gamma is chosen for its ability to handle long-tailed distributions, which allows it to have a good approximation to data with outliers. The proposed model, which can be viewed as a Dirichlet process mixture of Gamma distributions, takes into account the feature selection problem by determining a set of relevant features for each data cluster which provides better interpretability and generalization capabilities. We propose then an efficient algorithm to learn this infinite model’s parameters by estimating all its posterior quantities of interest using Markov Chain Monte Carlo (MCMC) simulations. Thus, our algorithm is able to perform model selection, parameter learning, and feature selection simultaneously in a single step for the Gamma Mixture model. Furthermore, we show how the model can be used, while comparing it with other popular models in the literature, in two challenging applications namely medical images and gene expressions classification.
Similar content being viewed by others
References
Allili MS, Ziou D, Bouguila N, Boutemedjet S (2010) Image and video segmentation by combining unsupervised generalized Gaussian mixture modeling and feature selection. IEEE Trans Circuits Systems Video Technol 20(10):1373–1377
Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci 96(12):6745–6750
Alto H, Rangayyan RM, Desautels JEL (2005) Content-based retrieval and analysis of mammographic masses. J Electron Imaging 14(2):1–17
Aykroyd RG, Green PJ (1991) Global and local priors, and the location of lesions using Gamma-camera imagery. Philos Trans Phys Sci Eng 337(1647):323–342
Bouguila N, Ziou D (2010) A Dirichlet process mixture of generalized Dirichlet distributions for proportional data modeling. IEEE Trans Neural Netw 21(1):107–122
Bouguila N, Almakadmeh K, Boutemedjet S (2012) A finite mixture model for simultaneous high-dimensional clustering, localized feature selection and outlier rejection. Expert Systems Appl 39(7):6641–6656
Bouguila N, Ziou D (2008) A Dirichlet process mixture of Dirichlet distributions for classification and prediction. In: Proc. of the IEEE Workshop on Machine learning for signal processing (MLSP), pp 297-302
Boutemedjet S, Bouguila N, Ziou D (2009) A hybrid feature extraction selection approach for high-dimensional non-Gaussian data clustering. IEEE Trans Pattern Anal Mach Intell 31(8):1429–1443
Boys RJ, BoysHenderson DA (2004) A Bayesian approach to DNA sequence segmentation (with discussion). Biometrics 60(3):573– 588
Brzakovic D, Luo XM, Brzakovic P (1990) An approach to automated detection of tumours in mammograms. IEEE Trans Med Imaging 9(3):233–241
Buciu I, Gacsadi A (2009) Gabor wavelet based features for medical image analysis and classification. In: Proc. of the 2nd International Symposium on applied sciences in biomedical and communication technologies, pp 1–4
Camastra F (2003) Data dimensionality estimation methods: a survey. Pattern Recogn 36(12):2945–2954
Chan HP, Sahiner B, Lam KL, Petrick N, Helvie MA, Goodsitt MM, Adler DD (1998) Computerized analysis of mammographic microcalcifications in morphological and texture feature spaces. Med Phys 25(10):2007–2019
Cho S-B, Jain AK (2003) Machine learning in DNA microarray analysis for cancer classification. In: Proc. of the First Asia-Pacific bioinformatics conference on Bioinformatics, pp 189–198
Dahmen J, Theiner T, Keysers D, Ney H, Lehmann T, Wein BB (2000) Classification of radiographs in the ‘image retrieval in medical applications’ system (IRMA). In: Proc. of the 6th International RIAO conference on content-based multimedia information, Access, pp 551–566
Elguebaly T, Bouguila N (2011) Infinite generalized Gaussian mixture modeling and applications. In: Proc. of the International conference on image analysis and recognition (ICIAR), pp 201–210
Ferguson TS (1973) A Bayesian analysis of some nonparametric problems. Ann Stat 1(2):209–230
Forstner W (1994) A framework for low level feature extraction. In: Eklundh Jan-Olof (ed) Proc. the European conference on computer vision (ECCV), volume 801 of Lecture Notes in Computer Science. Springer, Berlin Heidelberg, pp 383–394
Ghosh JK, Delampady M, Samanta T (2006) An introduction to Bayesian analysis theory and methods. Springer, New York
Giger ML, Yin F-F, Doi K, Metz CE, Schmidt RA, Vyborny CJ (2003) Investigation of methods for the computerized detection and analysis of mammographic masses. In: Proc. of the SPIE medical imaging and image processing, pp 183–184
Gilks WR, Clayton DG, Spiegelhalter GJ, Best NG, McNeil AJ (1993) Modelling complexity: applications of Gibbs sampling in medicine. J Royal Stat Soc. Series B (Methodological), 55(1):39–52
Gilks WR, Wild P (1993) Algorithm AS 287: adaptive rejection sampling from log-concave density functions. Appl Stat 42(4):701–709
Glad IK, Sebastiani G (1995) A Bayesian approach to synthetic magnetic resonance imaging. Biometrika 82(2):237–250
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537
Haralick R, Shanmugam K, Dinstein I (1973) Textural features for image classification. IEEE Trans Systems Man Cybern SMC 3(6):610–621
Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York
Hoff PD (2006) Model-based subspace clustering. Bayesian Anal 1(2):321–344
Karlis D, Xekalaki E (2001) Robust inference for finite poisson mixtures. J Stat Plan Inference 93(1–2):93–115
Katzer M, Kummert F, Sagerer G (2003) Methods for automatic microarray image segmentation. IEEE Trans NanoBiosci 2(4):202–214
Keysers D, Dahmen J, Theiner T, Ney H (2000) Experiments with an extended tangent distance. In: Proc. of the International conference on pattern recognition (ICPR), pp 2038–2042
Law MHC, Figueiredo MAT, Jain AK (2004) Simultaneous feature selection and clustering using mixture models. IEEE Trans Pattern Anal Mach Intell 26(9):1154–1166
Li T, Zhang C, Ogihara M (2004) A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics 20(15):2429– 2437
Li Y, Dong M, Hua J (2009) Simultaneous localized feature selection and model detection for Gaussian mixtures. IEEE Trans Pattern Anal Mach Intell 31(5):953–960
McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York
Miller P, Astley S (1992) Classification of breast tissue by texture analysis. Image Vis Comput 10(5):277–283
Mladenic D, Brank J, Grobelnik M, Milic-Frayling N (2004) Feature selection using linear classifier weights: interaction with classification models. In: Proc. of the 27th Annual International ACM SIGIR conference on research and development in information retrieval (SIGIR), pp 234–241
Neal RM (2000) Markov Chain sampling methods for Dirichlet Process mixture models. J Comput Graph Stat 9:249–265
Neemuchwala H, Hero AO, Carson PL (2001) Feature coincidence trees for registration of ultrasound breast images. In: Proc. of the IEEE International conference on image processing (ICIP), pp III.10–III.13
Ojala T, Pietikainen M, Harwood D (1996) A comparative study of texture measures with classification based on featured distributions. Pattern Recogn 29(1):51–59
Ojala T, Pietikainen M, Maenpaa T (2002) Multiresolution gray-scale and rotation invariant texture classification with local binary patterns. IEEE Trans Pattern Anal Mach Intell 24(7):971–987
Paddock SM, Ridgeway G, Lin R, Louis TA (2006) Flexible distributions for triple-goal estimates in two-stage hierarchical models. Comput Stat Data Anal 50(11):3243–3262
Pappas TN (1992) An adaptive clustering algorithm for image segmentation. IEEE Trans Signal Process 40(4):901–914
Powell D, Fair J, LeClaire R, Moore L, Thompson D (2005) Sensitivity analysis of an infectious disease model. In: Proc. of the International system dynamics conference, pp 17–21
Quinlan JR (1996) Improved use of continuous attributes in C4.5. J Artif Intell Res 4(1):77–90
Raftery AE, Lewis SM (1992) One long run with diagnostics: implementation strategies for Markov Chain Monte Carlo. Stat Sci 7(4):493–497
Raftery AE, Lewis SM (1996) Implementing MCMC. In: Spiegelhalter DJ, Gilks WR, Richardson S (eds) Markov Chain Monte Carlo in Practice. Chapman and Hall, London, pp 115–130
Rangayyan RM, Mudigonda NR, Desautels JEL (2000) Boundary modelling and shape analysis methods for classification of mammographic masses. Med Biol Eng Comput 38(5):487–496
Rangayyan RM, El-Faramawy NM, Desautels JEL, Alim OA (2000) Measures of acutance and shape for classification of breast tumors. IEEE Trans Med Imaging 16(6):799–810
Rasmussen CE (2000) The infinite gaussian mixture model. In: Advances in neural information processing systems (NIPS), pp 554–560
Robert CP (2007) The Bayesian choice from decision-theoretic foundations to computational implementation, second edition. Springer, New York
Sahiner BS, Chan HP, Petrick N, Wagner RF, Hadjiiski L (2000) Feature selection and classifier performance in computer-aided diagnosis: the effect of finite sample size. Med Phys 27(7):1509–1522
Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR (2002) Gene expression correlates of clinical prostate cancer behavior. Cancer Cell 1(2):203–209
Spiegelhalter DJ, Knill-Jones RP (1984) Statistical and knowledge-based approaches to clinical decision-support systems, with an application in gastroenterology (with discussion). J Royal Stat Soc. Series A (General), 147(1):35–77
Suri JS, Rangayyan RM (2006) Recent advances in breast imaging, mammography, and computer-aided diagnosis of breast cancer. SPIE Press, Washington
Tagare HD, Jaffe CC, Duncan JJ (1997) Medical image databases: a content-based retrieval approach. J Am Med Inf Assoc 4(3):184–198
Theodoridis S, Koutroumbas K (2005) Pattern recognition. Elsevier Academic Press, New York
Tuceryan M, Jain AK (1998) Texture analysis. In: The Handbook of pattern recognition and computer vision, pp 207–248
Vapnik V (2000) The nature of statistical learning theory. Springer, New York
Welsh JB, Sapinoso LM, Su AI, Kern SG, Wang-Rodriguez J, Moskaluk CA, Frierson HF Jr, Hampton GM (2001) Analysis of gene expression identifies candidate markers and pharmacological targets in prostate cancer. Cancer Res 61(16):5974–5978
Wu X, Kumar V, Quinlan JR, Ghosh J, Yang Q, Motoda H, McLachlan G, Ng A, Liu B, Yu P et al (2008) Top 10 algorithms in data mining. Knowl Inf Systems 14(1):1–37
Wu X, Yu K, Wang H, Ding W (2010) Online streaming feature selection. In: Proc. of the 27th International conference on machine learning (ICML), pp 1159–1166
Xu W, Wang W, Zhang X, Wang L, Feng H (2008) SDED: A novel filter method for cancer related gene selection. Bioinformation 2(7):301–303
Yonghong H, Englehart KB, Hudgins B, Chan ADC (2005) A Gaussian mixture model based classification scheme for myoelectric control of powered upper limb prostheses. IEEE Trans Biomed Eng 52(11):1801–1811
Yukinawa N, Oba S, Kato K, Taniguchi K, Iwao-Koizumi K, Tamaki Y, Noguchi S, Ishii S (1996) A multi-class predictor based on a probabilistic model: application to gene expression profiling-based diagnosis of thyroid tumors. BMC Genomics 7(190)
Zadeh HS, Rad FR, Nejad SP (2004) Comparison of multiwavelet, wavelet, Haralick, and shape features for microcalcification classification in mammograms. Pattern Recogn 37(10):1973–1986
Zink S, Jaffe CC (1993) Medical imaging databases. Investig Radiol 28(4):366–372
Ziou D, Bouguila N, Allili MS, El Zaart A (2009) Finite Gamma mixture modeling using minimum message length inference: Application to SAR image analysis. Int J Remote Sens 30(3):771–792
Ziou D, Bouguila N (2004) Unsupervised learning of a finite Gamma mixture using MML: Application to SAR image analysis. In: Proc. of the International conference on pattern recognition (ICPR), pp 68–71
Zou F, Zheng Y, Zhou Z, Agyepong K (2008) Gradient vector flow field and mass region extraction in digital mammograms. In: Proc. of the 21st IEEE International symposium on computer-based medical systems, pp 41–43
Acknowledgments
The completion of this research was made possible thanks to the Natural Sciences and Engineering Research Council of Canada (NSERC).
Author information
Authors and Affiliations
Corresponding author
Additional information
Communicated by Y. Jin.
Rights and permissions
About this article
Cite this article
Elguebaly, T., Bouguila, N. A hierarchical nonparametric Bayesian approach for medical images and gene expressions classification. Soft Comput 19, 189–204 (2015). https://doi.org/10.1007/s00500-014-1242-8
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-014-1242-8