Bayesian variable selection in multinomial probit model for classifying high-dimensional data

Yang, Aijun; Li, Yunxian; Tang, Niansheng; Lin, Jinguan

doi:10.1007/s00180-014-0540-z

Bayesian variable selection in multinomial probit model for classifying high-dimensional data

Original Paper
Published: 04 December 2014

Volume 30, pages 399–418, (2015)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Aijun Yang^1,2,
Yunxian Li^3,4,
Niansheng Tang⁴ &
…
Jinguan Lin⁵

1027 Accesses
2 Citations
1 Altmetric
Explore all metrics

Abstract

Selecting a small number of relevant genes for classification has received a great deal of attention in microarray data analysis. While the development of methods for microarray data with only two classes is relevant, developing more efficient algorithms for classification with any number of classes is important. In this paper, we propose a Bayesian stochastic search variable selection approach for multi-class classification, which can identify relevant genes by assessing sets of genes jointly. We consider a multinomial probit model with a generalized \(g\)-prior for the regression coefficients. An efficient algorithm using simulation-based MCMC methods are developed for simulating parameters from the posterior distribution. This algorithm is robust to the choice of initial value, and produces posterior probabilities of relevant genes for biological interpretation. We demonstrate the performance of the approach with two well-known gene expression profiling data: leukemia data, lymphoma data, SRBCTs data and NCI60 data. Compared with other classification approaches, our approach selects smaller numbers of relevant genes and obtains competitive classification accuracy based on obtained results.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian variable selection with sparse and correlation priors for high-dimensional data analysis

Article 06 June 2016

High-dimensional variable selection with the plaid mixture model for clustering

Article 17 May 2018

Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure

Article Open access 21 March 2018

References

Albert J, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88:669–679
Article MATH MathSciNet Google Scholar
Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Staudt LM et al (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403:503–511
Article Google Scholar
Ambroise C, McLachlan GJ (2002) Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA 99:6562–6566
Article MATH Google Scholar
Antonov AV, Tetko IV, Mader MT, Budczies J, Mewes HW (2004) Optimization models for cancer classification: extracting gene interaction information from microarray expression data. Bioinformatics 20:644–652
Article Google Scholar
Ben-Dor A, Bruhn L, Friedman N, Nachman I, Schummer M, Yakhini Z (2000) Tissue classification with gene expression profiles. J Comput Biol 7:559–583
Article Google Scholar
Brown PJ (1993) Measurement, regression, and calibration. Clarendon, Oxford
MATH Google Scholar
Brown PJ, Vannucci M, Fearn T (1998) Multivariate Bayesian variable selection and prediction. J R Stat Soc B 60:627–641
Article MATH MathSciNet Google Scholar
Chu W, Ghahramani Z, Falciani F, Wild DL (2005) Biomarker discovery in microarray gene expression data with Gaussian processes. Bioinformatics 21:3385–3393
Article Google Scholar
Dawid AP (1981) Some matrix-variate distribution theory: notational considerations and a Bayesian application. Biometrika 68:265–274
Article MATH MathSciNet Google Scholar
Dettling M (2004) BagBoosting for tumor classification with gene expression data. Bioinformatics 20:3583–3593
Article Google Scholar
Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19:1061–1069
Article Google Scholar
Draminski M et al (2008) Monte Carlo feature selection for supervised classification. Bioinformatics 24:110–117
Article Google Scholar
Díza-Uriarte, Andés (2006) Gene selection and classification of microarray data using random forest. BMC Bioinformatics 7:3
Article Google Scholar
Dudoit Y, Yang H, Callow M, Speed T (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87
Article MATH Google Scholar
Genz A, Bretz F (2002) Methods for the computation of multivariate t-probabilities. J Comput Graph Stat 11:950–971
Article MathSciNet Google Scholar
Gelfand A (1996) Model determination using sampling-based methods. In: Gilks WR, Richardson S, Spiegelhalter DJ (eds) Markov chain Monte Carlo in practice. Chapman and Hall, London, pp 145–158
Google Scholar
George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889
Article Google Scholar
Geman S, Geman D (1984) Stochastic relaxation, Gibbls distribution, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741
Article MATH Google Scholar
Gilks W, Richardson S, Spiegelhalter D (1996) Markov chain Monte Carlo in practise. Chapman and Hall, London
Google Scholar
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer:class discovery and class prediction by gene expression monitoring. Science 286:531–537
Article Google Scholar
Gupta M, Ibrahim JG (2007) Variable selection in regression mixture modeling for the discovery of gene regulatory networks. J Am Stat Assoc 102:867–880
Article MATH MathSciNet Google Scholar
Gupta M, Ibrahim JG (2009) An information matrix prior for Bayesian analysis in generalized linear models with high dimensional data. Stat Sin 19:1641–1663
MATH MathSciNet Google Scholar
Guyon I, Weston J, Barnhill S, Vapnik V (2012) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422
Article Google Scholar
Ha HJ, Kubagawa H, Burrows PD (1992) Molecular cloning and expression pattern of a human gene homologous to the murine mb-1 gene. J Immunol 148:1526–1531
Google Scholar
Jaeger J, Sengupta R, Ruzzo WL (2003) Improved gene selection for classification of microarrays. Pac Symp Biocomput 8:53–64
Google Scholar
Khan J, Wei JS, Ringnr M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS (2001) Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med 7:673–679
Article Google Scholar
Kamps MP, Murre C, Sun X-H, Baltimore D (1990) A new homeobox gene contributes the DNA binding domain of the t(1;19) translocation protein in pre-B ALL. Cell 6:547–555
Article Google Scholar
Kingsmore SF, Watson ML, Seldin MF (1995) Genetic mapping of the T lymphocyte-specific transcription factor 7 gene on mouse chromosome 11. Mamm Genome 6:378–380
Google Scholar
Koo JY, Sohn I, Kim S, Lee JW (2006) Structured polychotomous machine diagnosis of multiple cancer types using gene expression. Bioinformatics 22:950–958
Article Google Scholar
Lachenbruch PA, Mickey MR (1968) Estimation of error rates in discriminant analysis. Technometrics 10:1–11
Article MathSciNet Google Scholar
Lamnisos D, Griffin JE, Steel FJ (2009) Mark Transdimensional sampling algorithms for Bayesian variable selection in classification problems with many more variables than observations. J Comput Graph Stat 18:592–612
Article MathSciNet Google Scholar
Le Cao K-A, Chabrier P (2008) ofw: an R package to selection continuous variables for multi-class classification with a stochastic wrapper method. J Stat Softw 28:1–16
Google Scholar
Lee Y, Lee CK (2003) Classification of multiple cancer types by multicategory support vector machines using gene expression data. Bioinformatics 19:1132–1139
Article Google Scholar
Lee Y, Lin Y, Wahba G (2004) Multicategory support vector machines: theory and application to the classification of microarray data and satellite radiance data. J Am Stat Assoc 99:67–81
Article MATH MathSciNet Google Scholar
McLachlan GJ (1992) Discriminant analysis and statistical pattern recognition. Wiley, New York
Book Google Scholar
Nguyen DV, Rocke DM (2002) Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18:1216–1226
Article Google Scholar
Panagiotelisa A, Smith M (2008) Bayesian identification, selection and estimation of semiparametric functions in high dimensional additive models. J Econometr 143:291–316
Article Google Scholar
Rocke DR, Ideker T, Troyanskaya O, Quackenbush J, Dopazo J (2009) Papers on normalization, variable selection, classification or clustering of microarray data. Bioinformatics 25:701–702
Article Google Scholar
Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, de Rijn MV, Waltham M, Pergamenschikov A, Lee JCF, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO (2000) Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet 24:227–235
Article Google Scholar
Sha N, Vannucci M, Tadesse MG, Brown PJ, Dragoni I, Davies N, Roberts TC, Contestabile A, Salmon N, Buckley C, Falciani F (2004) Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics 60:812–819
Article MATH MathSciNet Google Scholar
Smith M, Kohn R (1996) Nonparametric regression via Bayesian variable selection. J Econometr 75:317–343
Article MATH Google Scholar
Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D (2005) Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics 21:3896–3904
Article Google Scholar
Tibshirani R, Hastie T, Narasimhan B, Chu G (2003) Class prediction by nearest shrunken centroids, with applications to DNA microarrays. Stat Sci 18:104–117
Article MATH MathSciNet Google Scholar
Train K (2003) Discrete choice methods with simulation. Cambridge University Press, Cambridge
Book MATH Google Scholar
Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB (2001) Missing value estimation methods for DNA microarrays. Bioinformatics 17:520–525
Article Google Scholar
Tusher VG, Tibshirani R, Chu G (2001) Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci USA 98:5116–5121
Article MATH Google Scholar
Yang AJ, Song XY (2010) Bayesian variable selection for disease classification using gene expression data. Bioinformatics 26:215–222
Article Google Scholar
Yeo G, Poggio T (2001) Multiclass classification of SRBCTs, DSpace@MIT. Massachusetts Institute of Technology
Yeung KY, Bumgarner RE (2003) Multi-class classification of microarray data with repeated measurements: application to cancer. Genome Biol 4:R83
Article Google Scholar
Yeung KY, Bumgarner RE, Raftery AE (2005) Bayesian model averaging: development of an improved multi-class, gene selection and classification tool for microarray data. Bioinformatics 21:2394–2402
Article Google Scholar
Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Bayesian inference and decision techniques: essays in honor of Bruno de Finetti, Amsterdam, pp 233–243
Zhou X, Wang X, Dougherty ER (2006) Multi-class cancer classification using multinomial probit regression with Bayesian gene selection. IEE Proc Syst Biol 153:70–78
Article Google Scholar

Download references

Acknowledgments

The authors would like to thank two referees and the editor for their constructive comments which have substantially improved this article.

Author information

Authors and Affiliations

College of Economics and Management, Nanjing Forestry University, Nanjing, Jiangsu, China
Aijun Yang
School of Economics and Management, Southeast University, Nanjing, Jiangsu, China
Aijun Yang
School of Finance, Yunnan University of Economics and Finance, Kunming, Yunan, China
Yunxian Li
Department of Statistics, Yunnan University, Kunming, Yunan, China
Yunxian Li & Niansheng Tang
Department of Mathematics, Southeast University, Nanjing, Jiangsu, China
Jinguan Lin

Authors

Aijun Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yunxian Li
View author publications
You can also search for this author in PubMed Google Scholar
Niansheng Tang
View author publications
You can also search for this author in PubMed Google Scholar
Jinguan Lin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Aijun Yang.

Additional information

Natural Science Foundation of China (11171065,11225103), and Natural Science Foundation of Jiangsu (BK20141326).

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (pdf 96 KB)

Rights and permissions

Reprints and permissions

About this article

Cite this article

Yang, A., Li, Y., Tang, N. et al. Bayesian variable selection in multinomial probit model for classifying high-dimensional data. Comput Stat 30, 399–418 (2015). https://doi.org/10.1007/s00180-014-0540-z

Download citation

Received: 30 May 2013
Accepted: 03 November 2014
Published: 04 December 2014
Issue Date: June 2015
DOI: https://doi.org/10.1007/s00180-014-0540-z

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian variable selection in multinomial probit model for classifying high-dimensional data

Abstract

Access this article

Similar content being viewed by others

Bayesian variable selection with sparse and correlation priors for high-dimensional data analysis

High-dimensional variable selection with the plaid mixture model for clustering

Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 96 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bayesian variable selection in multinomial probit model for classifying high-dimensional data

Abstract

Access this article

Similar content being viewed by others

Bayesian variable selection with sparse and correlation priors for high-dimensional data analysis

High-dimensional variable selection with the plaid mixture model for clustering

Heuristic algorithms for feature selection under Bayesian models with block-diagonal covariance structure

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Additional information

Electronic supplementary material

Supplementary material 1 (pdf 96 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation