Abstract
The problem of fitting logistic regression to binary model allowing for missppecification of the response function is reconsidered. We introduce two-stage procedure which consists first in ordering predictors with respect to deviances of the models with the predictor in question omitted and then choosing the minimizer of Generalized Information Criterion in the resulting nested family of models. This allows for large number of potential predictors to be considered in contrast to an exhaustive method. We prove that the procedure consistently chooses model \(t^{*}\) which is the closest in the averaged Kullback-Leibler sense to the true binary model t. We then consider interplay between t and \(t^{*}\) and prove that for monotone response function when there is genuine dependence of response on predictors, \(t^{*}\) is necessarily nonempty. This implies consistency of a deviance test of significance under misspecification. For a class of distributions of predictors, including normal family, Rudd’s result asserts that \(t^{*}=t\). Numerical experiments reveal that for normally distributed predictors probability of correct selection and power of deviance test depend monotonically on Rudd’s proportionality constant \(\eta \).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Bache K, Lichman M (2013) UCI machine learning repository. University of California, Irvine
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Bogdan M, Doerge R, Ghosh J (2004) Modifying the Schwarz Bayesian Information Criterion to locate multiple interacting quantitative trait loci. Genetics 167:989–999
Bozdogan H (1987) Model selection and Akaike’s information criterion (AIC): the general theory and its analitycal extensions. Psychometrika 52:345–370
Burnham K, Anderson D (2002) Model selection and multimodel inference. A practical information-theoretic approach. Springer, New York
Carroll R, Pederson S (1993) On robustness in the logistic regression model. J R Stat Soc B 55:693–706
Casella G, Giron J, Martinez M, Moreno E (2009) Consistency of Bayes procedures for variable selection. Ann Stat 37:1207–1228
Chen J, Chen Z (2008) Extended Bayesian Information Criteria for model selection with large model spaces. Biometrika 95:759–771
Chen J, Chen Z (2012) Extended BIC for small-n-large-p sparse glm. Statistica Sinica 22:555–574
Claeskens G, Hjort N (2008) Model selection and model averaging. Cambridge University Press, Cambridge
Czado C, Santner T (1992) The effect of link misspecification on binary regression inference. J Stat Plann Infer 33:213–231
Fahrmeir L (1987) Asymptotic testing theory for generalized linear models. Statistics 1:65–76
Fahrmeir L (1990) Maximum likelihood estimation in misspecified generalized linear models. Statistics 4:487–502
Fan J, Li R (2001) Variable selection via nonconcave penalized likelihood and its oracle properties. J Am Stat Stat Assoc 96:1348–1360
Foster D, George E (1994) The risk inflation criterion for multiple regression. Ann Stat 22:1947–1975
Hjort N, Pollard D (1993) Asymptotics for minimisers of convex processes. Unpublished manuscript
Konishi S, Kitagawa G (2008) Information criteria and statistical modeling. Springer, New York
Lehmann E (1959) Testing statistical hypotheses. Wiley, New York
Li K, Duan N (1991) Slicing regression: a link-free regression method. Ann Stat 19(2):505–530
Qian G, Field C (2002) Law of iterated logarithm and consistent model selection criterion in logistic regression. Stat Probab Lett 56:101–112
Ruud P (1983) Sufficient conditions for the consistency of maximum likelihood estimation despite misspecification of distribution in multinomial discrete choice models. Econometrica 51(1):225–228
Sin C, White H (1996) Information criteria for selecting possibly misspecified parametric models. J Econometrics 71:207–225
Zak-Szatkowska M, Bogdan M (2011) Modified versions of Baysian Information Criterion for sparse generalized linear models. Comput Stat Data Anal 5:2908–2924
Zou H, Hastie T (2005) Regularization and variable selection via the elastic net. J R Stat Soc B 67(2):301–320
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix A: Auxiliary Lemmas
Appendix A: Auxiliary Lemmas
This section contains some auxiliary facts used in the proofs. The following theorem states the asymptotic normality of maximum likelihood estimator.
Theorem 6
Assume (A1) and (A2). Then
where J and K are defined in (5) and (9), respectively.
The above Theorem is stated in [11] (Theorem 3.1) and in [16] ((2.10) and Sect. 5B).
Lemma 4
Assume that \(\max _{1\le i\le n}|\mathbf {x}_i'(\varvec{\upgamma }-\varvec{\beta })|\le C\) for some \(C>0\) and some \(\varvec{\upgamma }\in R^{p+1}\). Then for any \(\mathbf {c}\in R^{p+1}\)
Proof
It suffices to show that for \(i=1,\ldots ,n\)
Observe that for \(\varvec{\upgamma }\) such that \(\max _{i\le n}|\mathbf {x}_i'(\varvec{\upgamma }-\varvec{\beta })|\le C\) there is
By replacing \(\varvec{\beta }\) and \(\varvec{\upgamma }\) in (15) we obtain the upper bound for \(\mathbf {c}'J_{n}(\varvec{\upgamma })\mathbf {c}\).
Lemma 5
Assume (A1) and (A2). Then \(l(\hat{\varvec{\beta }},\mathbf {Y}|\mathbf {X})-l(\varvec{\beta }^{*},\mathbf {Y}|\mathbf {X})=O_{P}(1)\).
Proof
Using Taylor expansion we have for some \(\bar{\varvec{\beta }}\) belonging to the line segment joining \(\hat{\varvec{\beta }}\) and \(\varvec{\beta }^{*}\)
Define set \(A_{n}=\{\varvec{\upgamma }: ||\varvec{\upgamma }-\varvec{\beta }^{*}||\le s_n\}\), where \(s_n\) is an arbitrary sequence such that \(ns_n^2\rightarrow 0\). Using Schwarz and Markov inequalities we have for any \(C>0\)
Thus using Lemma 4 the quadratic form in (16) is bounded with probability tending to 1 from above by
which is \(O_{P}(1)\) as \(\sqrt{n}(\hat{\varvec{\beta }}-\varvec{\beta }^{*})=O_{P}(1)\) in view of Theorem 6 and \(n^{-1}J_{n}(\varvec{\beta }^*)\xrightarrow {P}J(\varvec{\beta }^*)\).
1.1 A.1 Proof of Lemma 2
As \(\varvec{\beta }^*_{m}=\varvec{\beta }^*_c\) we have for \(c\supseteq m \supseteq t^*\)
which is \(O_{P}(1)\) in view of Remark 1 and Lemma 5.
1.2 A.2 Proof of Lemma 3
The difference \(l(\hat{\varvec{\beta }}_c,\mathbf {Y}|\mathbf {X})-l(\hat{\varvec{\beta }}_w,\mathbf {Y}|\mathbf {X})\) can be written as
It follows from Lemma 5 and Remark 1 that the first term in (17) is \(O_{P}(1)\). We will show that the probability that the second term in (17) is greater or equal \(\alpha _1nd_n^2\), for some \(\alpha _1>0\) tends to 1. Define set \(A_n=\{\varvec{\upgamma }:||\varvec{\upgamma }-\varvec{\beta }^*||\le d_n\}\). Using the Schwarz inequality we have
with probability one. Define \(H_n(\varvec{\upgamma })=l(\varvec{\beta }^*,\mathbf {Y}|\mathbf {X})-l(\varvec{\upgamma },\mathbf {Y}|\mathbf {X})\). Note that \(H(\varvec{\upgamma })\) is convex and \(H(\varvec{\beta }^*)=0\). For any incorrect model w, in view of definition (11) of \(d_n\), we have \(\hat{\varvec{\beta }}_w\notin A_n\) for sufficiently large n. Thus it suffices to show that \(P(\inf _{\varvec{\upgamma }\in \partial A_n}H_n(\varvec{\upgamma })> \alpha _1 nd_n^{2})\rightarrow 1\), as \(n\rightarrow \infty \), for some \(\alpha _1>0\). Using Taylor expansion for some \(\bar{\varvec{\upgamma }}\) belonging to the line segment joining \(\varvec{\upgamma }\) and \(\varvec{\beta }^*\)
and the last convergence is implied by
It follows from Lemma 4 and (18) that for \(\varvec{\upgamma }\in A_n\)
Let \(\tau =\exp (-3)/2\). Using (20), the probability in (19) can be bounded from above by
Let \(\lambda _{1}^{-}=\lambda _{\min }(J(\varvec{\beta }))/2\). Assuming \(\alpha _1<\lambda _{1}^{-}\tau \), the first probability in (21) can be bounded by
Consider the first probability in (22). Note that \(s_n(\varvec{\beta }^*)\) is a random vector with zero mean and the covariance matrix \(K_n(\varvec{\beta }^*)\). Using Markov’s inequality, the fact that \(\text {cov}[s_{n}(\varvec{\beta }^{*})]=nK(\varvec{\beta }^{*})\) and taking \(\alpha _1<\lambda ^{-}\tau \) it can be bounded from above by
where the last convergence follows from \(a_n\rightarrow \infty \).
The convergence to zero of the second probability in (22) follows from \(nd_n^2/a_n\xrightarrow {P}\infty \). As eigenvalues of a matrix are continuous functions of its entries, we have \(\lambda _{\min }(n^{-1}J_{n}(\varvec{\beta }^*))\xrightarrow {P}\lambda _{\min }(J(\varvec{\beta }^*))\). Thus the convergence to zero of the third probability in (22) follows from the fact that in view of (A1) matrix \(J(\varvec{\beta }^*)\) is positive definite. The second term in (21) is bounded from above by
where the last convergence follows from Lemma 4 and (18).
Lemma 6
Assume (A2) and (A3). Then we have \(\max _{i\le n}||\mathbf {x}_i||^2a_n/n\xrightarrow {P}0\).
Proof
Using Markov inequality, (A2) and (A3) we have that \(||\mathbf {x}_n||^{2}a_n/n\xrightarrow {P}0\). We show that this implies the conclusion. Denote \(g_n:=\max _{1\le i\le n}||\mathbf {x}_i||^{2}a_n/n\) and \(h_n:=||\mathbf {x}_n||^{2}a_n/n\). Define sequence \(n_k\) such that \(n_1=1\) and \(n_{k+1}=\min \{n>n_k:\max _{i\le n}||\mathbf {x}_i||^{2}>\max _{i\le n_k}||\mathbf {x}_i||^{2}\}\) (if such \(n_{k+1}\) does not exist put \(n_{k+1}=n_k\)). Without loss of generality we assume that for \(A=\{n_k\rightarrow \infty \}\) we have \(P(A)=1\) as on \(A^c\) the conclusion is trivially satisfied. Observe that \(g_{n_k}=h_{n_k}\) and \(h_{n_k}\xrightarrow {P}0 \) as a subsequence of \(h_n\xrightarrow {P}0\) and thus also \(g_{n_k}\xrightarrow {P}0\). This implies that for any \(\epsilon >0\) there exists \(n_0\in \mathbf {N}\) such that for \(n_k>n_0\) we have \(P[|g_{n_k}|\le \epsilon ]\ge 1-\epsilon \). As for \(n\in (n_k,n_{k+1})\) \(g_n\le g_{n_k}\) since \(a_n/n\) is nonincreasing we have that if \(n\ge n_0\) \(P[|g_n|\le \epsilon ]\ge 1-\epsilon \) i.e. \(g_n\xrightarrow {P}0\).
1.3 A.3 Proof of Proposition 1
Assume first that \(\tilde{\varvec{\beta }}^{*}=0\) and note that this implies \(p(\beta _{0}+\tilde{\mathbf {x}}'\tilde{\varvec{\beta }}^{*})=p(\beta _{0})=C\in (0,1)\). From (8) we have
From (24) we also have
Comparing the last equation and right-side term in (25) we obtain \(\mathbf {E}(\tilde{\mathbf {x}}|y=1)=E{\tilde{\mathbf {x}}}=\mathbf {E}(\tilde{\mathbf {x}}|y=0)\). Assume now \(\mathbf {E}(\tilde{\mathbf {x}}|y=1)=\mathbf {E}(\tilde{\mathbf {x}}|y=0)\) which implies as before that that \(\mathbf {E}(\tilde{\mathbf {x}}|y=1)=\mathbf {E}(\tilde{\mathbf {x}})\). Thus
Since \((\beta _{0}^{*},\tilde{\varvec{\beta }}^{*})\) is unique it suffices to show that (7) and (8) are satisfied for \(\tilde{\varvec{\beta }}^{*}=0\) and \(\beta _{0}^*\) such that \(Ep(\beta _{0}^*)=P(Y=1)\). This easily follows from (26).
Rights and permissions
Copyright information
© 2016 Springer International Publishing Switzerland
About this chapter
Cite this chapter
Mielniczuk, J., Teisseyre, P. (2016). What Do We Choose When We Err? Model Selection and Testing for Misspecified Logistic Regression Revisited. In: Matwin, S., Mielniczuk, J. (eds) Challenges in Computational Statistics and Data Mining. Studies in Computational Intelligence, vol 605. Springer, Cham. https://doi.org/10.1007/978-3-319-18781-5_15
Download citation
DOI: https://doi.org/10.1007/978-3-319-18781-5_15
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-18780-8
Online ISBN: 978-3-319-18781-5
eBook Packages: EngineeringEngineering (R0)