Abstract
When two groups of individuals are to be compared with respect to gene expression there will often be some potentially confounding variables that differ between the groups. Matching is an established approach for obtaining comparable groups and enabling subsequent univariate tests for each gene. Alternatively, the confounders might be incorporated directly into a multivariable regression model for adjustment. In contrast to univariate tests, such models can consider all genes simultaneously. Aiming to combine the advantages of both approaches, matching and multivariable modeling, we consider a matching-based boosting procedure for fitting risk prediction models in two-group settings. This possibly allows to identify and automatically remove problematic observations that might negatively affect the regression model. Therefore, we compare the ability to identify important covariates for this combination of matching and boosting with only boosting for different covariate correlation structures in a simulation study. Furthermore, we analyze the prediction performance of these approaches on two gene expression microarray studies. The first study comprises patients with B-cell and T-cell type acute lymphoblastic leukemia and the second patients with acute megakaryoblastic leukemia. While the matching component can in principle guard against problematic observations, the combined approach is seen to neither improve identification of important covariates nor to improve prediction performance. Therefore, a combination of the two approaches cannot be recommended. Adjustment for potential confounders is seen to provide the best performance, i.e. a pure multivariable regression modeling strategy seems to be promising even in presence of considerable heterogeneity.
Similar content being viewed by others
References
Binder H, Schumacher M (2008a) Allowing for mandatory covariates in boosting estimation of sparse high-dimensional survival models. BMC Bioinf 9: 14
Binder H, Schumacher M (2008b) Comment on ’network-constrained regularization and variable selection for analysis of genomic data’. Bioinformatics 24(21): 2566–2568
Binder H, Schumacher M (2008c) Adapting prediction error estimates for biased complexity selection in high-dimensional bootstrap samples. Stat Appl Genet Mol Biol 7(1): 12
Binder H, Porzelius C, Schumacher M (2009) Rank-based p-values for sparse high-dimensional risk prediction models fitted by componentwise boosting, FDM-Preprint Nr.101
Boulesteix A-L, Hothorn T (2010) Testing the additional predictive value of high-dimensional data. BMC Bioinf 11: 78
Bourquin J et al (2006) Identification of distinct molecular phenotypes in acute megakaryoblastic leukemia by gene expression profiling. PNAS 103(9): 3339–3344
Breiman L (2001) Random forests. Mach Learn 45: 5–32
Brier G (1950) Verification of forecast expressed in terms of probability. Mon Weather Rev 78(1): 1–3
Cepeda MS et al (2003) Optimal matching with a variable number of controls vs. a fixed number of controls for a cohort study: trade-offs. J Clin Epidemiol 56: 230–237
Chiaretti S et al (2004) Gene expression profile of adult T-cell acute lymphocytic leukemia identifies distinct subsets of patients with different response to therapy and survival. Blood 103: 2771–2778
Cochran W, Rubin D (1973) Controlling bias in observational studies: a review. Indian J Stat Ser A 35(4): 417–446
Cristianini N, Shawe-Taylor J (1999) An introduction to SVM. Cambridge University Press, Cambridge
Dettling M, Bühlmann P (2003) Boosting for tumor classification with gene expression data. Bioinformatics 19: 1061–1069
Gu X, Rosenbaum P (1993) Comparison of multivariable matching methods: structures, distances and algorithms. J Comput Graph Stat 2: 405–420
Hansen B (2004) Full matching in an observational study coaching for the SAT. J Am Stat Assoc 99(467): 609–618
Heller R et al (2009) Matching methods for observational microarray studies. Bioinformatics 25(7): 904–909
Hummel M et al (2008) GlobalANCOVA: exploration and assessment of gene group effects. Bioinformatics 24(1): 78–85
Ming K, Rosenbaum P (2000) Substantial gains in bias reduction from matching with a variable number of controls. Biometrics 56(1): 118–124
Rosenbaum P, Rubin D (1985) The bias due to incomplete matching. Biometrics 41: 103–116
Rosenbaum P (1989) Optimal matching for observational studies. J Am Stat Assoc 84(408): 1024–1032
Rubin D (1973) Matching to remove bias in observational studies. Biometrics 29(1): 159–183
Rubin D (1979) Using multivariable matched sampling and regression adjustment to control bias in observational studies. J Am Stat Assoc 74: 318–324
Rubin D (1980) Bias reduction using Mahalanobis metric matching. Biometrics 36: 293–298
Simon R et al (2003) Pitfalls in the use of DNA microarray data for diagnostic and prognostic classification. J Natl Cancer Inst 95(1): 14–18
Smith H (1997) Matching with multiple controls to estimate treatment effects in observational studies. Sociol Methodol 27(1): 325–353
Smyth GK (2004) Linear models and empirical bayes methods for assessing differential expression in microarray experiments. Stat Appl Genet Mol Biol 3 (Article 3)
Thomas JG et al (2001) An efficient and robust statistical modeling approach to discover differentially expressed genes using genomic expression profiles. Genom Res 11: 1227–1236
Tusher VG et al (2001) Significant analysis of microarrays applied to the ioonizing radiation response. Proc Natl Acad Sci USA 98: 5116–5121
Tutz G, Binder H (2007) Boosting ridge regression. Comput Stat Data Anal 51(12): 6044–6059
Vapnik V (1995) The nature of statistical learning theory. Springer, New York
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Reiser, V., Porzelius, C., Stampf, S. et al. Can matching improve the performance of boosting for identifying important genes in observational studies?. Comput Stat 28, 37–49 (2013). https://doi.org/10.1007/s00180-012-0306-4
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-012-0306-4