Skip to main content
Log in

Bayesian variable selection with sparse and correlation priors for high-dimensional data analysis

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

The main challenge in working with gene expression microarrays is that the sample size is small compared to the large number of variables (genes). In many studies, the main focus is on finding a small subset of the genes, which are the most important ones for differentiating between different types of cancer, for simpler and cheaper diagnostic arrays. In this paper, a sparse Bayesian variable selection method in probit model is proposed for gene selection and classification. We assign a sparse prior for regression parameters and perform variable selection by indexing the covariates of the model with a binary vector. The correlation prior for the binary vector assigned in this paper is able to distinguish models with the same size. The performance of the proposed method is demonstrated with one simulated data and two well known real data sets, and the results show that our method is comparable with other existing methods in variable selection and classification.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  • Albert J, Chib S (1993) Bayesian analysis of binary and polychotomous response data. J Am Stat Assoc 88:669–679

    Article  MathSciNet  MATH  Google Scholar 

  • Armagan A, Dunson DB, Lee J (2013) Generalized double Pareto shrinkage. Stat Sin 3(1):119–143

    MathSciNet  MATH  Google Scholar 

  • Bae K, Mallick BK (2004) Gene selection using a two-level hierarchical Bayesian model. Bioinformatics 20(18):3423–3430

    Article  Google Scholar 

  • Baragatti M (2011) Bayesian variable selection for probit mixed models applied to gene selection. Bayesian Anal 6(2):209–230

    Article  MathSciNet  MATH  Google Scholar 

  • Baragatti M, Pommeret D (2012) A study of variable selection using g-prior distribution with ridge parameter. Comput Stat Data Anal 56:1920–1934

    Article  MathSciNet  MATH  Google Scholar 

  • Bradley P, Mangasarian O (1998) Feature selection via concave minimization and support vector machines. In: Proceedings of the 15th international conference on machine learning, pp 82–90

  • Brotherick I, Robson CN, Browell DA, Shenfine J, White MD, Cunliffe WJ, Shenton BK, Egan M, Webb LA, Lunt LG, Young JR, Higgs MJ (1998) Cytokeratin expression in breast cancer: phenotypic changes associated with disease progression. Cytometry 32:301–308

    Article  Google Scholar 

  • Chakraborty S (2009) Bayesian Binary kernel probit model for microarray based cancer classification and gene selection. Comput Stat Data Anal 53:4198–4209

    Article  MathSciNet  MATH  Google Scholar 

  • Chakraborty S, Guo R (2011) Bayesian hybrid huberized SVM and its applications in high dimensional medical data. Comput Stat Data Anal 55(3):1342–1356

    Article  MATH  Google Scholar 

  • Chhikara R, Folks L (1989) The inverse Gaussian distribution: theory, methodology, and applications. Marcel Dekker, New York

    MATH  Google Scholar 

  • Devroye L (1986) Non-uniform random variate generation. Springer, New York

    Book  MATH  Google Scholar 

  • Dougherty ER (2001) Small sample issues for microarray-based classification. Comp Funct Genomics 2:28–34

    Article  Google Scholar 

  • Dudoit Y, Yang H, Callow M, Speed T (2002) Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc 97:77–87

    Article  MathSciNet  MATH  Google Scholar 

  • Geman S, Geman D (1984) Stochastic relaxation, Gibbls distribution, and the Bayesian restoration of images. IEEE Trans Pattern Anal Mach Intell 6:721–741

    Article  MATH  Google Scholar 

  • George EI, McCulloch RE (1993) Variable selection via Gibbs sampling. J Am Stat Assoc 88:881–889

    Article  Google Scholar 

  • Geyer CJ (1992) Practical Markov chain Monte Carlo. Stat Sci 7:473–511

    Article  Google Scholar 

  • Gilks W, Richardson S, Spiegelhalter D (1996) Markov Chain Monte Carlo in practise. Chapman and Hall, London

    MATH  Google Scholar 

  • Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286:531–537

    Article  Google Scholar 

  • Gupta M, Ibrahim JG (2007) Variable selection in regression mixture modeling for the discovery of gene regulatory networks. J Am Stat Assoc 102(479):867–880

    Article  MathSciNet  MATH  Google Scholar 

  • Guyon I, Weston J, Barnhill S, Vapnik V et al (2002) Gene selection for cancer classification using support vector machines. Mach Learn 46:389–422

    Article  MATH  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2001) The element of statistical learning. Springer, New York

    Book  MATH  Google Scholar 

  • Hendenfalk I, Duggan D, Chen Y, Radmacher M, Bittner M, Simon R, Meltzer P, Gusterson B, Esteller M, Kallioniemi OP, Wilfond B, Borg A, Trent J (2001) Gene expression profiles in hereditary breast cancer. N Engl J Med 344:539–548

    Article  Google Scholar 

  • Hirota T, Morisaki T, Nishiyama Y, Marumoto T, Tada K, Hara T, Masuko N, Inagaki M, Hatakeyama K, Saya H (2000) Zyxin a regulator of actin filament assembly, targets the mitotic apparatus by interacting with h-warts/LATS1 tumor suppressor. J Cell Biol 149:1073–1086

    Article  Google Scholar 

  • Ishwaran H, Rao JS (2005) Spike and slab variable selection: frequentist and bayesian strategies. Ann Stat 33(2):730–773

    Article  MathSciNet  MATH  Google Scholar 

  • Kass RE, Carlin BP, Gelman A, Neal R (1998) Markov Chain Monte Carlo in practice: a roundtable discussion. Am Stat 52:93–100

    MathSciNet  Google Scholar 

  • Lamnisos D, Griffin JE, Steel FJ Mark (2009) Transdimensional sampling algorithms for Bayesian variable selection in classification problems with many more variables than observations. J Comput Graph Stat 18:592–612

    Article  MathSciNet  Google Scholar 

  • Lee KE et al (2003) Gene selection: a Bayesian variable selection approach. Bioinformatics 19:90–97

    Article  Google Scholar 

  • Li F, Zhang NR (2010) Bayesian variable selection in structured high-dimensional covariate spaces with applications in genomics. J Am Stat Assoc 105(491):1202–1214

    Article  MathSciNet  MATH  Google Scholar 

  • Liu X, Krishnan A, Mondry A (2005) An entropy-based gene selection method for cancer classification using microarray data. BMC Bioinform 6:76

    Article  Google Scholar 

  • Mallick BK, Ghosh D, Ghosh M (2005) Bayesian classification of tumors using gene expression data. J R Stat Soc B 67:219–232

    Article  MATH  Google Scholar 

  • Maruyama Y, George EI (2011) gBF: a fully Bayes factor with a generalized g-prior. Technical Report, University of Pennsylvania. arXiv:0801.4410

  • Mitchell TJ, Beauchamp JJ (1988) Bayesian variable selection in linear regression. J Am Stat Assoc 83:1023–1036

    Article  MathSciNet  MATH  Google Scholar 

  • Nguyen DV, Rocke DM (2002) Multi-class cancer classification via partial least squares with gene expression profiles. Bioinformatics 18:1216–1226

    Article  Google Scholar 

  • OHara RB, Sillanpaa MJ (2009) A review of Bayesian variable selection methods: what, how and which. Bayesian Anal 4:85–118

    Article  MathSciNet  MATH  Google Scholar 

  • Panagiotelisa A, Smith M (2008) Bayesian identification, selection and estimation of semiparametric functions in high dimensional additive models. J Econom 143:291–316

    Article  MathSciNet  Google Scholar 

  • Park K, Casella G (2008) The Bayesian Lasso. J Am Stat Assoc 103:681–686

    Article  MathSciNet  MATH  Google Scholar 

  • Quintana MA, Conti DV (2013) Integrative variable selection via Bayesian model uncertainty. Stat Med 32(28):4938–4953

    Article  MathSciNet  Google Scholar 

  • Sha N, Vannucci M, Tadesse M, Brown P, Dragoni I, Davies N, Roberts T, Contestabile A, Salmon M, Buckley C, Falciani F (2004) Bayesian variable selection in multinomial probit models to identify molecular signatures of disease stage. Biometrics 60:812–819

    Article  MathSciNet  MATH  Google Scholar 

  • Stingo FC, Vannucci M (2011) Variable selection for discriminant analysis with Markov random field priors for the analysis of microarray data. Bioinformatics 27(4):495–501

    Article  Google Scholar 

  • Strawderman WE (1971) Proper Bayes minimax estimators of the multivariate normal mean. Ann Math Stat 42:385–388

    Article  MathSciNet  MATH  Google Scholar 

  • Tolosi L, Lengauer T (2011) Classification with correlated features: unreliability of feature ranking and solutions. Bioinformatics 27:1986–1994

    Article  Google Scholar 

  • Yang K, Cai Z, Li J, Lin G (2006) A stable gene selection in microarray data analysis. BMC Bioinform 7:228

    Article  Google Scholar 

  • Yang A, Song X (2010) Bayesian variable selection for disease classication using gene expression data. Bioinformatics 26(2):215–222

    Article  Google Scholar 

  • Yuan M, Lin Y (2005) Efficient empirical bayes variable selection and estimation in linear models. J Am Stat Assoc 472:1215–1225

    Article  MathSciNet  MATH  Google Scholar 

  • Zellner A (1986) On assessing prior distributions and Bayesian regression analysis with g-prior distributions. Bayesian inference and decision techniques: essays in honor of Bruno de Finetti. NorthHolland, Amsterdam, pp 233–243

  • Zhou X, Liu K, Wong S (2004) Cancer classification and prediction using logistic regression with Bayesian gene selection. J Biomed Inform 37:249–259

    Article  Google Scholar 

Download references

Acknowledgments

The authors gratefully acknowledge the financial support of the Natural Science Foundation of China (11501294, 11101432, 11571073), the China Postdoctoral Science Foundation (2015M580374), and the Natural Science Foundation of Jiangsu (BK20141326).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xuejun Jiang.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Supplementary material 1 (rar 9 KB)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yang, A., Jiang, X., Shu, L. et al. Bayesian variable selection with sparse and correlation priors for high-dimensional data analysis. Comput Stat 32, 127–143 (2017). https://doi.org/10.1007/s00180-016-0665-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-016-0665-3

Keywords

Navigation