High–Dimensional Sparse Matched Case–Control and Case–Crossover Data: A Review of Recent Works, Description of an R Tool and an Illustration of the Use in Epidemiological Studies

Avalos, Marta; Grandvalet, Yves; Pouyes, Hélène; Orriols, Ludivine; Lagarde, Emmanuel

doi:10.1007/978-3-319-09042-9_8

Marta Avalos^7,8,9,
Yves Grandvalet¹⁰,
Hélène Pouyes^7,11,
Ludivine Orriols^7,8 &
…
Emmanuel Lagarde^7,8

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 8452))

Included in the following conference series:

International Meeting on Computational Intelligence Methods for Bioinformatics and Biostatistics

Abstract

The conditional logistic regression model is the standard tool for the analysis of epidemiological studies in which one or more cases (the event of interest), are matched with one or more controls (not showing the event). These situations arise, for example, in matched case–control and case–crossover studies. In sparse and high-dimensional settings, penalized methods, such as the Lasso, have emerged as an alternative to conventional estimation and variable selection procedures. We describe the R package clogitLasso, which brings together algorithms to estimate parameters of conditional logistic models using sparsity-inducing penalties. Most individually matched designs are covered, and, beside Lasso, Elastic Net, adaptive Lasso and bootstrapped versions are available. Different criteria for choosing the regularization term are implemented, accounting for the dependency of data. Finally, stability is assessed by resampling methods. We previously review the recent works pertaining to clogitLasso. We also report the use in exploratory analysis of a large pharmacoepidemiological study.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Semiparametric approaches for matched case–control studies with error-in-covariates

Article 03 April 2019

Cox model inference for relative hazard and pure risk from stratified weight-calibrated case-cohort data

Article Open access 02 April 2024

Individually-matched etiologic studies: classical estimators made new again

Article 24 August 2018

References

Bland, J.M., Altman, D.G.: Matching. BMJ 309, 1128 (1994)
Article Google Scholar
Kupper, L.L., Karon, J.M., Kleinbaum, D.G., Morgenstern, H., Lewis, D.K.: Matching in epidemiologic studies: validity and efficiency considerations. Biometrics 37, 271–291 (1981)
Article MATH MathSciNet Google Scholar
Karon, J.M., Kupper, L.L.: In defense of matching. Am. J. Epidemiol. 116, 852–866 (1982)
Google Scholar
Constanza, M.C.: Matching. Preventive Med. 24, 425–433 (1995)
Article Google Scholar
Rothman, K., Greenland, S.: Modern Epidemiology, 2nd edn. Lippincott, Williams and Wilkins, Philadelphia (1998)
Google Scholar
Stürmer, T., Brenner, H.: Flexible matching strategies to increase power and efficiency to detect and estimate gene-environment interactions in case-control studies. Am. J. Epidemiol. 155, 593–602 (2002)
Article Google Scholar
Vandenbroucke, J.P., von Elm, E., Altman, D.G., Gotzsche, P.C., Mulrow, C.D., Pocock, S.J., Poole, C., Schlesselman, J.J., Egger, M.: Strengthening the reporting of observational studies in epidemiology (strobe): explanation and elaboration. PLoS Med. 4, 1628–1654 (2007)
Article Google Scholar
Hansson, L., Khamis, H.: Matched samples logistic regression in case-control studies with missing values: when to break the matches. Stat. Methods Med. Res. 17, 595–607 (2008)
Article MathSciNet Google Scholar
Rose, S., Van der Laan, M.J.: Why match? investigating matched case-control study designs with causal effect estimation. Int. J. Biostat. 5, Art. 1 (2009). doi: 10.2202/1557-4679.1127
Stuart, E.: Matching methods for causal inference: a review and a look forward. Stat. Sci. 25, 1–21 (2010)
Article MathSciNet Google Scholar
Maclure, M.: The case-crossover design: a method for studying transient effects on the risk of acute event. Am. J. Epidemiol. 133, 144–153 (1991)
Google Scholar
Delaney, J., Suissa, S.: The case-crossover study design in pharmacoepidemiology. Stat. Methods Med. Res. 18, 53–65 (2009)
Article MathSciNet Google Scholar
Mittleman, M., Maclure, M., Robins, J.: Control sampling strategies for case-crossover studies: an assessment of relative efficiency. Am. J. Epidemiol. 142, 91–98 (1995)
Google Scholar
Janes, H., Sheppard, L., Lumley, T.: Overlap bias in the case-crossover design, with application to air pollution exposures. Stat. Med. 24, 285–300 (2005)
Article MathSciNet Google Scholar
Tibshirani, R.: Regression shrinkage and selection via the lasso. J. Roy. Stat. Soc. Ser. B 58, 267–288 (1996)
MATH MathSciNet Google Scholar
Avalos, M.: Model selection via the lasso in conditional logistic regression. In: Proceedings of the Second International Biometric Society Channel Network Conference, Ghent, Belgium, 6–8 April 2009
Google Scholar
Avalos, M., Grandvalet, Y., Duran-Adroher, N., Orriols, L., Lagarde, E.: Analysis of multiple exposures in the case-crossover design via sparse conditional likelihood. Stat. Med. 31, 2290–2302 (2012)
Article MathSciNet Google Scholar
Breslow, N.E., Day, N.E.: Statistical Methods in Cancer Research. The analysis of case-control studies, vol. 1. IARC Scientific Publications, Lyon (1980)
Google Scholar
Knight, K., Fu, W.: Asymptotics for lasso-type estimators. Ann. Stat. 28, 1356–1378 (2000)
Google Scholar
Zhao, P., Yu, B.: On model selection consistency of lasso. J. Mach. Learn. Res. 7, 2541–2563 (2006)
MATH MathSciNet Google Scholar
Candes, E.J., Plan, Y.: Near-ideal model selection by L1 minimization. Technical report, Caltech, USA (2007)
Google Scholar
Bach, F.: Bolasso: model consistent lasso estimation through the bootstrap. In: McCallum, A., Roweis, S.T. (eds.) Proceedings of the 25th International Conference on Machine Learning (ICML 2008), Helsinki, Finland, 5–9 July 2008
Google Scholar
Zhang, T.: Some sharp performance bounds for least squares regression with L1 regularization. Ann. Stat. 37, 2109–2114 (2009)
Article MATH Google Scholar
Wainwright, M.J.: Sharp thresholds for noisy and high-dimensional recovery of sparsity using L1-constrained quadratic programming (lasso). IEEE Trans. Inf. Theory 55, 2183 (2009)
Article MathSciNet Google Scholar
Juditsky, A., Nemirovski, A.: On verifiable sufficient conditions for sparse signal recovery via L1 minimization. Math. Program. 127, 57–88 (2011)
Article MATH MathSciNet Google Scholar
Van de Geer, S.: High-dimensional generalized linear models and the lasso. Ann. Stat. 36, 614–645 (2008)
Article MATH Google Scholar
Huang, J., Ma, S., Zhang, C.: The iterated lasso for high-dimensional logistic regression. Technical report, The University of Iowa, USA, No. 392 (2008)
Google Scholar
Bunea, F., Barbu, A.: Dimension reduction and variable selection in case control studies via regularized likelihood optimization. Electron. J. Stat. 3, 1257–1287 (2009)
Article MATH MathSciNet Google Scholar
Huang, J., Zhang, C.: Estimation and selection via absolute penalized convex minimization and its multistage adaptive applications. J. Mach. Learn. Res. 13, 1839–1864 (2012)
Google Scholar
Meinshausen, N., Yu, B.: Lasso-type recovery of sparse representations for high-dimensional data. Ann. Stat. 37, 246–270 (2009)
Article MATH MathSciNet Google Scholar
Bach, F.: Self-concordant analysis for logistic regression. Electron. J. Stat. 4, 384–414 (2010)
Google Scholar
Zou, H.: The adaptive lasso and its oracle properties. J. Am. Stat. Assoc. 101, 1418–1429 (2006)
Article MATH Google Scholar
Hall, P., Lee, E., Park, B.: Bootstrap-based penalty choice for the lasso, achieving oracle performance. Stat. Sinica 19, 449–471 (2009)
MATH MathSciNet Google Scholar
She, Y.: Thresholding-based iterative selection procedures for model selection and shrinkage. Electron. J. Stat. 3, 384–415 (2009)
Article MATH MathSciNet Google Scholar
Meinshausen, N., Bühlmann, P.: Stability selection. J. Roy. Stat. Soc. Ser. B 72, 417–473 (2010)
Article Google Scholar
Chatterjee, A., Lahiri, S.N.: Bootstrapping lasso estimators. J. Am. Stat. Assoc. 106, 608–625 (2011)
Article MATH MathSciNet Google Scholar
Wang, S., Nan, B., Rosset, S., Zhu, J.: Random lasso. Ann. Appl. Stat. 5(1), 468–485 (2011)
Google Scholar
Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. Roy. Stat. Soc. Ser. B 67, 301–320 (2005)
Article MATH MathSciNet Google Scholar
Shi, W., Lee, K., Wahba, G.: Detecting disease-causing genes by lasso-patternsearch algorithm. BMC Proc. 1(Suppl 1), S60 (2007)
Article MathSciNet Google Scholar
Van der Laan, M., Dudoit, S., Keles, S.: Asymptotic optimality of likelihood-based cross-validation. Stat. Appl.Genet. Mol. Biol. 3, Art. 4. (2004). doi: 10.2202/1544-6115.1036
Van Houwelingen, H.C., Bruinsma, T., Hart, A.A.M., van’t Veer, L.J., Wessels, L.F.A.: Cross-validated Cox regression on microarray gene expression data. Stat. Med. 25, 3201–3216 (2006)
Google Scholar
Arlot, S., Celisse, C.: A survey of cross-validation procedures for model selection. Stat. Surv. 4, 40–79 (2010)
Article MATH MathSciNet Google Scholar
Ye, J.: On measuring and correcting the effects of data mining and model selection. J. Am. Stat. Assoc. 93, 120–131 (1998)
Article MATH Google Scholar
Zou, H., Hastie, T., Tibshirani, R.: On the degrees of freedom of the lasso. Ann. Stat. 35, 2173–2192 (2007)
Article MATH MathSciNet Google Scholar
Tibshirani, R.J., Taylor, J.: Degrees of freedom in lasso problems. Ann. Stat. 40, 1198–1232 (2012)
Article MATH MathSciNet Google Scholar
Yang, Y.: Can the strengths of AIC and BIC be shared? a conflict between model identification and regression estimation. Biometrika 92, 937–950 (2005)
Article MATH MathSciNet Google Scholar
Yang, Y.: Comparing learning methods for classification. Stat. Sinica 16, 635–657 (2006)
MATH Google Scholar
Leng, C., Lin, Y., Wahba, G.: A note on the lasso and related procedures in model selection. Stat. Sinica 16, 1273–1284 (2006)
MATH MathSciNet Google Scholar
Yang, Y.: Consistency of cross validation for comparing regression procedures. Ann. Stat. 35, 2450–2473 (2007)
Article MATH Google Scholar
Liao, H., Lynn, H.S., Li, S., Hsu, L., Peng, J., Wang, P.: Bootstrap inference for network construction with an application to a breast cancer microarray study. Ann. Appl. Stat. 7, 391–417 (2013)
Article MathSciNet Google Scholar
Waldron, L., Pintilie, M., Tsao, M.S., Shepherd, F., Huttenhower, C., Jurisica, I.: Optimized application of penalized regression methods to diverse genomic data. Bioinformatics 27, 3399–3406 (2011)
Article Google Scholar
Lê Cao, K.A., Boitard, S., Besse, P.: Sparse pls discriminant analysis: biologically relevant feature selection and graphical displays for multiclass problems. BMC Bioinform. 12, 253 (2011)
Article Google Scholar
Bunea, F., She, Y., Ombao, H., Gongvatana, A., Devlin, K., Cohen, R.: Penalized least squares regression methods and applications to neuroimaging. Neuroimage 55, 1519–1527 (2011)
Article Google Scholar
Rohart, F., Villa-Vialaneix, N., Paris, A., Laurent, B., SanCristobal, M.: Phenotypic prediction based on metabolomic data: lasso vs bolasso, primary data vs wavelet data. In: Proceedings of the 9th World Congress on Genetics Applied to Livestock Production (WCGALP), Leipzig, Germany (2010)
Google Scholar
Avalos, M., Orriols, L., Pouyes, H., Grandvalet, Y., Thiessard, F., Lagarde, E.: Variable selection on large case-crossover data: application to a registry-based study of prescription drugs and road-traffic crashes. Pharmacoepidemiol. Drug Saf. 23, 140–151 (2013). (Epub ahead of print)
Article Google Scholar
Greenland, S.: Invited commentary: variable selection versus shrinkage in the control of multiple confounders. Am. J. Epidemiol. 167, 523–529 (2008)
Article Google Scholar
Walter, S., Tiemeier, H.: Variable selection: current practice in epidemiological studies. Eur. J. Epidemiol. 24, 733–736 (2009)
Article Google Scholar
Hurvich, C.M., Tsai, C.L.: The impact of model selection on inference in linear regression. Am. Stat. 44, 214–217 (1990)
Google Scholar
Breiman, L.: Heuristics of instability and stabilization in model selection. Ann. Stat. 24, 2350–2383 (1996)
Article MATH MathSciNet Google Scholar
Austin, P.C.: Using the bootstrap to improve estimation and confidence intervals for regression coefficients selected using backwards variable elimination. Stat. Med. 27, 3286–3300 (2008)
Article MathSciNet Google Scholar
Wiegand, R.E.: Performance of using multiple stepwise algorithms for variable selection. Stat. Med. 29, 1647–59 (2010)
MathSciNet Google Scholar
Tibshirani, R.: The lasso method for variable selection in the Cox model. Stat. Med. 16, 385–95 (1997)
Article Google Scholar
Osborne, M.R., Presnell, B., Turlach, B.A.: On the lasso and its dual. J. Comput. Graph. Stat. 9, 319–337 (2000)
MathSciNet Google Scholar
Chatterjee, A., Lahiri, S.N.: Asymptotic properties of the residual bootstrap for lasso estimators. Proc. Am. Math. Soc. 138, 4497–4509 (2010)
Article MATH MathSciNet Google Scholar
Park, M., Hastie, T.: $l_{1}$-regularization path algorithm for generalized linear models. J. Roy. Stat. Soc. Ser. B 69, 659–677 (2007)
Article MathSciNet Google Scholar
D’Angelo, G.M., Rao, D.C., Gu, C.C.: Combining least absolute shrinkage and selection operator (lasso) and principal-components analysis for detection of gene-gene interactions in genome-wide association studies. BMC Proc. 3, S62 (2009)
Article Google Scholar
Pötscher, B.: Confidence sets based on sparse estimators are necessarily large. Sankhya 71, 1–18 (2009)
Google Scholar
Pötscher, B., Schneider, U.: Confidence sets based on penalized maximum likelihood estimators in Gaussian regression. Electron. J. Stat. 4, 334–360 (2010)
Article MATH MathSciNet Google Scholar
Farchione, D., Kabaila, P.: Variable-width confidence intervals in gaussian regression and penalized maximum likelihood estimators. Technical report, Department of Mathematics and Statistics, La Trobe University, Australia (2010)
Google Scholar
Sperrin, M., Jaki, T.: Direct effects testing: a two-stage procedure to test for effect size and variable importance for correlated binary predictors and a binary response. Stat. Med. 29, 2544–2556 (2010)
Article MathSciNet Google Scholar
Goeman, J.: $l_{1}$ penalized estimation in the Cox proportional hazards model. Biometrical J. 52, 70–84 (2010)
MATH MathSciNet Google Scholar
Sartori, S.: Penalized regression: Bootstrap confidence intervals and variable selection for high-dimensional data sets. PhD thesis, Raleigh, NC (2011)
Google Scholar
Avalos, M., Duran-Adroher, N., Thiessard, F., Grandvalet, Y., Orriols, L., Lagarde, E.: Prescription-drug-related risk in driving comparing conventional and lasso shrinkage logistic regressions. Epidemiology 23, 706–12 (2012)
Article Google Scholar
Park, M., Casella, G.: The bayesian lasso. J. Am. Stat. Assoc. 103, 681–686 (2008)
Article MATH MathSciNet Google Scholar
Hans, C.: Model uncertainty and variable selection in bayesian lasso regression. Stat. Comput. 20, 221–229 (2010)
Article MathSciNet Google Scholar
Sardy, S.: On the practice of rescaling covariates. Int. Stat. Rev. 76, 285–297 (2008)
Article Google Scholar
Belloni, A., Chernozhukov, V.: Least squares after model selection in high-dimensional sparse models. Bernoulli 19, 521–547 (2013)
Article MATH MathSciNet Google Scholar
Bien, J., Taylor, J., Tibshirani, R.: A lasso for hierarchical interactions. Ann. Stat. 41, 1111–1141 (2013)
Article MATH MathSciNet Google Scholar
Gertheiss, J., Tutz, G.: Sparse modeling of categorial explanatory variables. Ann. Appl. Stat. 4, 2150–2180 (2010)
Article MATH MathSciNet Google Scholar
Avalos, M., Pouyes, H., Grandvalet, Y., Orriols, L., Lagarde, E.: Sparse conditional logistic regression for analyzing large-scale matched data from epidemiological studies: A simple implementation in r. Technical report, Bordeaux School of Public Health, University Bordeaux Segalen (2013) (Submitted)
Google Scholar
Jörnsten, R., Abenius, T., Kling, T., Schmidt, L., Johansson, E., Nordling, T., Nordlander, B., Sander, C., Gennemark, P., Funa, K., Nilsson, B., Lindahl, L., Nelander, S.: Network modeling of the transcriptional effects of copy number aberrations in glioblastoma. Mol. Syst. Biol. 7, Art. 486 (2011). doi: 10.1038/msb.2011.17
Bunea, F.: Honest variable selection in linear and logistic regression models via $l_{1}$ and $l_{1}+l_{2}$ penalization. Electron. J. Stat. 2, 1153–1194 (2008)
Article MATH MathSciNet Google Scholar

Download references

Author information

Authors and Affiliations

INSERM U897-Epidémiologie-Biostatistique, 33000, Bordeaux, France
Marta Avalos, Hélène Pouyes, Ludivine Orriols & Emmanuel Lagarde
Univ. Bordeaux, 33000, Bordeaux, France
Marta Avalos, Ludivine Orriols & Emmanuel Lagarde
INRIA SISTM Bordeaux, 33000, Bordeaux, France
Marta Avalos
CNRS, Heudiasyc UMR6599, Univ. Compiègne, 60200, Compiègne, France
Yves Grandvalet
Univ. de Pau et des Pays de l’ Adour, 64012, Pau, France
Hélène Pouyes

Authors

Marta Avalos
View author publications
You can also search for this author in PubMed Google Scholar
Yves Grandvalet
View author publications
You can also search for this author in PubMed Google Scholar
Hélène Pouyes
View author publications
You can also search for this author in PubMed Google Scholar
Ludivine Orriols
View author publications
You can also search for this author in PubMed Google Scholar
Emmanuel Lagarde
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Marta Avalos .

Editor information

Editors and Affiliations

University Nice Sophia Antipolis, Sophia Antipolis, France
Enrico Formenti
University of Salerno, Fisciano, Italy
Roberto Tagliaferri
University of Groningen, AG Groningen, The Netherlands
Ernst Wit

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Avalos, M., Grandvalet, Y., Pouyes, H., Orriols, L., Lagarde, E. (2014). High–Dimensional Sparse Matched Case–Control and Case–Crossover Data: A Review of Recent Works, Description of an R Tool and an Illustration of the Use in Epidemiological Studies. In: Formenti, E., Tagliaferri, R., Wit, E. (eds) Computational Intelligence Methods for Bioinformatics and Biostatistics. CIBB 2013. Lecture Notes in Computer Science(), vol 8452. Springer, Cham. https://doi.org/10.1007/978-3-319-09042-9_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-09042-9_8
Published: 16 July 2014
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-09041-2
Online ISBN: 978-3-319-09042-9
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics