Skip to main content

Kernel-based mixture models for classification

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

A generative model for classification based on kernels and mixtures of univariate Gamma distributions is introduced. It models the point distances to cluster centroids in the transformed Hilbert space associated with the inner product induced by the kernel. The distances are readily computed using the kernel trick. Nested within this kernel-based Gamma mixture model (KMM) are two special cases corresponding to the kernel-based mixture of exponentials and the kernel-based mixture of spherical Gaussians. The Akaike information criterion is used to select an appropriate parsimonious type-of-mixture model for the data at hand. A powerful classification rule based on the knowledge of all point distances to every class centroid is developed based on this model. The flexibility in the choice of the kernel and the probabilistic nature of a mixture distribution makes KMM appealing for modeling and inference. A comparison with other popular classification methods shows that this model is very efficient when handling high dimensional data.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8

Similar content being viewed by others

References

  • Aizerman M, Braverman E, Rozonoer L (1964) Theoretical foundations of the potential function method in pattern recognition learning. Autom Remote Control 25:821–837

    MathSciNet  Google Scholar 

  • Abramson IS (1982) On bandwidth variation in kernel estimates—a square root law. Ann Stat 10:1217–1223

    Article  MATH  MathSciNet  Google Scholar 

  • Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723

  • Alon U, Barkai N, Notterman DA, Gish K, Ybarra S, Mack D, Levine AJ (1999) Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA 96(12):6745–6750

    Article  Google Scholar 

  • Anderson E (1935) The irises of the Gaspé Peninsula. Bull Am Iris Soc 59:2–5

    Google Scholar 

  • Armstrong SA, Staunton JE, Silverman LB, Pieters R, den Boer ML, Minden MD, Sallan SE, Lander ES, Golub TR, Korsmeyer SJ (2002) MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nat Gen 30:41–47

    Article  Google Scholar 

  • Asuncion A, Newman DJ (2007) UCI Machine learning repository. http://www.ics.uci.edu/~mlearn/MLRepository.html. University of California, School of Information and Computer Science, Irvine

  • Banfield JD, Raftery AE (1993) Model-based Gaussian and non-Gaussian clustering. Biometrics 49:803–821

  • Bohanec M, Rajkovic V (1988) Knowledge acquisition and explanation for multi-attribute decision making. In: 8th international workshop on expert systems and their applications, pp 59–78

  • Breiman L, Friedman JH, Olshen A, Stone J (1984) Classification and regression trees. Wadsworth International Group, Belmont

    MATH  Google Scholar 

  • Cortez P, Cerdeira A, Almeida F, Matos T, Reis J (2009) Modeling wine preferences by data mining from physicochemical properties. Decis Supp Syst 47(4):547–553

    Article  Google Scholar 

  • Forina M, Armanino C (1982) Eigenvector projection and simplified non-linear mapping of fatty acid content of Italian olive oils. Ann Chim 72:127–141

    Google Scholar 

  • Girolami M, Rogers S (2006) Variational Bayesian multinomial probit regression with Gaussian process priors. Neural Comput 18:1790–1817

    Article  MATH  MathSciNet  Google Scholar 

  • Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286(5439):531–537

    Article  Google Scholar 

  • Gordon GJ, Jensen RV, Hsiao LL, Gullans SR, Blumenstock JE, Ramaswamy S, Richards WG, Sugarbaker DJ, Bueno R (2002) Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma. Cancer Res 62(17):4963–4967

    Google Scholar 

  • Jing XS, Li XS, Zhang D, Lan C, Yang J (2012) Optimal subset-division based discrimination and its kernelization for face and palmprint recognition. Pattern Recogn 45(10):3590–3602

    Article  MATH  Google Scholar 

  • Joachims T (1998) Text categorization with support vector machines: learning with many relevant features. In: European conference on machine learning

  • Kaiser HF (1960) The application of electronic computers to factor analysis. Educ Psychol Meas 20(1):141–151

  • Kashima H, Inokuchi A (2002) Kernels for graph classification. In: IEEE ICDM workshop on active mining.

  • Kurgan LA, Cios KJ, Tadeusiewicz R, Ogiela M, Goodenday LS (2001) Knowledge discovery approach to automated cardiac SPECT diagnosis. Artif Intell Med 23(2):149–169

    Article  Google Scholar 

  • Lauer F, Guermeur Y (2011) MSVMpack: a multi-mlass support vector machine package. J Mach Learn Res 12:2293–2296

    MATH  MathSciNet  Google Scholar 

  • McLachlan G, Basford K (1988) Mixture models: inference and applications to clustering. Marcel Dekker, NY

    MATH  Google Scholar 

  • Mangasarian OL, Street WN, Wolberg WH (1995) Breast cancer diagnosis and prognosis via linear programming. Oper Res 43(4):570–577

    Article  MATH  MathSciNet  Google Scholar 

  • Murua A, Stanberry L, Stuetzle W (2008) On Potts model clustering, kernel K-means and density estimation. J Comput Graph Stat 17(3):629–658

    Article  MathSciNet  Google Scholar 

  • Nakai K, Kanehisa M (1991) Expert system for predicting protein localization sites in Gram-negative bacteria. PROT Struct Funct Genet 11:95–110

    Article  Google Scholar 

  • Nakai K, Kanehisa M (1992) A knowledge base for predicting protein localization sites in eukaryotic cells. Genomics 14:897–911

    Article  Google Scholar 

  • Neal RM (1998) Regression and classification using Gaussian process priors. In: Dawid P, Bernardo JM, Berger JO, Smith AFM (eds) Bayesian statistics 6. Oxford University Press, Oxford, pp 475–501

    Google Scholar 

  • Petricoin EF, Ardekani AM, Hitt BA, Levine PJ, Fusaro VA, Steinberg SM, Mills GB, Simone C, Fishman DA, Kohn EC, Liotta LA (2002) Use of proteomic patterns in serum to identify ovarian cancer. Lancet 359:572–577

    Article  Google Scholar 

  • Schliep A, Costa IG, Steinhoff C, Schonhuth A (2005) Analyzing gene expression time-courses. IEEE/ACM Trans Comput Biol Bioinform 2(3):179–193

    Article  Google Scholar 

  • Schölkopf B, Smola AJ (2002) Learning with kernels. MIT Press, Cambridge

    Google Scholar 

  • Silverman BW (1986) Density estimation for statistics and data analysis. Chapman and Hall, London

    Book  MATH  Google Scholar 

  • Smith JW, Everhart JE, Dickson WC, Knowler WC, Johannes RS (1988) Using the ADAP learning algorithm to forecast the onset of diabetes mellitus. Proc Symp Comput Appl Med Care 261–265

  • Song Q, Wang G, Wang C (2012) Automatic recommendation of classification algorithms based on data set characteristics. Pattern Recogn 45(7):2672–2689

    Article  Google Scholar 

  • Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B (1998) Comprehensive identification of cell cycle-regulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization. Mol Biol Cell 9(12):3273–3297

    Article  Google Scholar 

  • Thung K-H, Paramesran R, Lim C-L (2012) Content-based image quality metric using similarity measure of moment vectors. Pattern Recogn 45(6):2193–2204

    Article  MATH  Google Scholar 

  • Tsuda K (1999) Support vector classification with asymmetric kernel function. In: Proceedings of the Seventh European Symposium on Artificial Neural Networks, pp 183–188

  • Weston J, Watkins C (1998) Multi-class support vector machines. Technical report CSD-TR-98-04, University of London, Royal Holloway

  • Wicker N, Perrin GR, Thierry JC, Poch O (2001) Secator : a program for inferring protein subfamilies from phylogenetic trees. Mol Biol Evol 18(8):1435–1441

    Article  Google Scholar 

  • Yeoh EJ, Ross ME, Shurtleff SA, Williams WK, Patel D, Mahfouz R, Behm FG, Raimondi SC, Relling MV, Patel A, Cheng C, Campana D, Wilkins D, Zhou X, Li J, Liu H, Pui CH, Evans WE, Naeve C, Wong L, Downing JR (2002) Classification, subtype discovery, and prediction of outcome in pediatric acute lymphoblastic leukemia by gene expression profiling. Cancer Cell 1(2):133–143

    Article  Google Scholar 

  • Yousri NA, Kamel MS, Ismail MA (2009) A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities. Pattern Recogn 42(7):1193–1209

    Article  MATH  Google Scholar 

Download references

Acknowledgments

This research was partially supported by NSERC Grant 327689-06.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nicolas Wicker.

Appendix

Appendix

The data sets used in Sect. 5 were:

  • The well-known Iris data set (Anderson 1935). It consists of 150 four-dimensional instances associated with each of the three Iris types Setosa, Versicolor and Virginica.

  • The Yeast data set (Nakai and Kanehisa 1991, 1992). It contains 1,459 proteins described by five variables: mcg, gvh, alm, mit and vac. They correspond to different prediction scores. The eight different classes are possible protein localizations: cytosolic or cytoskeletal, nuclear, mitochondrial, membrane protein with no N-terminal signal, membrane protein with uncleaved signal, membrane protein with cleaved signal, extracellular and vacuolar. This is a modification of the original data set, which has dimension eight and ten different classes.

  • The Bupa Liver Disorder data set (Asuncion and Newman 2007). It consists of 345 male individuals described by six variables. The first five variables are blood test results thought to be sensitive to liver disorders. The last variable measures the alcoholic beverage intake per day. The two classes indicate whether or not the patient presents a liver disorder.

  • The Car Evaluation data set (Bohanec and Rajkovic 1988). In this data set, 1,728 cars are described by six variables: buying price, maintenance price, number of doors, person capacity, luggage boot size, and safety of the car. The cars are classed in four groups: unacceptable, acceptable, good, and very good.

  • The Olive Oil data set (Forina and Armanino 1982). It contains 572 samples of olive oil described by their content in eight fatty acids. Nine different types of oil origins are represented, namely South-Apulia, North-Apulia, Calabria, Sicily, Inland-Sardinia, Coast-Sardinia, Umbria, East-Liguria and West-Liguria.

  • The Pima Indians Diabetes data set (Smith et al. 1988). In this data set, 768 persons are described by \(8\) variables (Number of times pregnant, Plasma glucose concentration, blood pressure, Triceps skin fold thickness, 2-h serum insulin, Body mass index, Diabetes pedigree function and Age). The two classes indicate whether a person is tested positive for diabetes.

  • The Glass data set (Asuncion and Newman 2007). In this data set, 214 glasses are described by nine variables (refractive index, sodium, magnesium, aluminium, silicon, potassium, calcium, barium and iron). There are seven classes corresponding all to a type of glass.

  • The two Wine Quality data sets (Cortez et al. 2009). The two data sets are related to red and white variants of the Portuguese “Vinho Verde” wine. The 1,599 red and 4,898 white wines are described by eleven physico-chemical variables (fixed acidity, volatile acidity, citric acid, residual sugar, chlorides, free sulfur dioxide, total sulfur dioxide, density, pH, sulphates, alcohol). The different classes correspond to a quality score between 0 and 10 resulting in six classes for the red wine and seven for the white wine.

  • The Yeast cell cycle data set (Schliep et al. 2005) describes the expression of \(386\) genes along \(17\) time points. This data set is in fact an excerpt of a larger data set (Spellman et al. 1998). The genes have each a peak expression at one among five different phases producing thus five different subgroups.

  • The image segmentation data set SEG (Asuncion and Newman 2007). In this data set, 2,310 regions of \(3\times 3\) pixels are taken from seven outdoor images. These regions are described by \(19\) variables and are classified among seven different classes: brickface, sky, foliage, cement, window, path and grass.

  • The waveform data set WAV (Breiman et al. 1984). In this data set, 5,000 instances of waves are obtained by combining \(2\) of \(3\) base waves, leading to three classes, and by adding noise. The number of variables is equal to \(21\).

  • The SPECT Heart data set (Kurgan et al. 2001). In this data set, \(267\) patients are described by \(22\) binary feature variables synthesizing information on collected images for each patient. There are two classes corresponding to either normal or abnormal patients.

  • The Wisconsin Data Breast Cancer (Wdbc) (Mangasarian et al. 1995). The features are computed from digitized images of a fine needle aspirate (FNA) of a breast mass taken from \(569\) patients with or without cancer. The features describe characteristics of the cell nuclei present in the image. Ten real-valued features are computed for each cell nucleus. The mean, standard error, and “worst” or largest (mean of the three largest values) of these ten features were computed for each image.

  • The Satellite data set (Asuncion and Newman 2007). The database consists of the multi-spectral values of pixels in \(3\times 3\)-neighborhoods in a satellite image. There are four different wave lengths so that the total dimension is \(36\). There are 4,435 and 2,000 pixels in the training and the testing set, respectively. The six different classes correspond to six different area types: red soil, cotton crop, grey soil, damp grey soil, soil with vegetation stubble and very damp grey soil.

  • The colon tumor data set (Alon et al. 1999) contains \(62\) samples coming from biopsies from tumors and normal tissues. The samples are described by 2,000 genes whose expression is observed through oligonucleotide arrays.

  • The ALL/AML data set (Golub et al. 1999) contains \(72\) samples coming either from acute lymphoblastic leukemia (ALL) or acute myelogenous leukemia (AML). The samples are described by the expression of 7,130 genes monitored by DNA microarrays.

  • The lung cancer data set (Gordon et al. 2002) contains \(181\) samples from malignant pleural mesothelioma (MPM) and adenocarcinoma (ADCA) from the lung. Each sample is described by the expression level of 12,533 genes.

  • The all seven cancer data set (Yeoh et al. 2002) contains seven subgroups corresponding to six gene rearrangements, namely of T-ALL, hyperdiploid with \(>\)50 chromosomes, BCR-ABL, E2A-PBX1, TEL-AML1 and MLL. The last subgroup was identified in the same study as being distinct from the first known six. In total, there are \(215\) samples described by the expression of 12,559 genes.

  • The MLL data set (Armstrong et al. 2002) contains \(72\) samples described by the expression of 12,583 genes. The samples are divided into three classes: ALL (acute lymphoblastic leukemia), MLL (MLL translocation leukemia) and AML (acute myelogenous leukemia).

  • The ovarian data set (Petricoin et al. 2002) contains \(253\) proteomic spectra of \(91\) normal (control) patients and \(162\) ovarian cancers. All of them are described by 15,154 molecular mass per charge intensities.

  • The sm data set contains a protein alignment description of \(102\) so-called Sm protein (Wicker et al. 2001). They are divided among eight subgroups found by the Secator method (Wicker et al. 2001). Although this clustering has been found by an automatic method, it has been validated biologically (Wicker et al. 2001).

  • The nucrecept data set has been kindly provided by Jean-Marie Wurtz and Jérome Fagart in the same aforementioned study (Wicker et al. 2001) and were divided into \(16\) biologically meaningful subgroups.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Murua, A., Wicker, N. Kernel-based mixture models for classification. Comput Stat 30, 317–344 (2015). https://doi.org/10.1007/s00180-014-0535-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-014-0535-9

Keywords