Summary
We present a new Monte-Carlo method for finding the solution of an estimating equation that can be expressed as the expected value of a ‘full data’ estimating equation in which the expected value is with respect to the distribution of the missing data given the observed data. Equations such as these arise whenever the E-M algorithm can be used. The algorithm alternates between two steps: an S-step, in which the missing data are simulated, either from the conditional distribution described above or from a more convenient importance sampling distribution, and a U-step, in which parameters are updated using a closed-form expression that does not require a numerical maximization. We present two numerical examples to illustrate the method. Theoretical results are obtained establishing consistency and asymptotic normality of the approximate solution obtained by our method.




Similar content being viewed by others
References
Andrews, D. W. K. (1992) Generic uniform convergence. Econometric Theory 8, 241–257.
Celeux, G. and Diebold, D. (1985) The SEM algorithm: a probabilistic teacher algorithm derived from the EM algorithm for the mixture problem. Comp. Stat. Quarterly, 2, 73–82.
Crouch, E. A. C., and Spiegelman, D. (1990) The Evaluation of Integrals of the Form \(\int_{ - \infty }^{ + \infty }\) f(t)exp(−t2) dt: Application to Logistic-Normal Models, J. Amer. Statist. Assoc., 85, 464–469.
Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977) Maximum likelihood from incomplete data via the EM algorithm (with discussion), J. R. Statist. Soc. B, 39, 1–37.
Diebold, J. and Ip, E. H. S. (1996) Stochastic EM: method and application. In Markov Chain Monte Carlo in Practice (ed.s W. R. Gilks, S. Richardson, and D. J. Spiegelhalter), pp. 259–273. London: Chapman & Hall.
Diggle, P. J., Liang, K.-Y., and Zeger, S. L. (1994) Analysis of Longitudinal Data, Oxford, UK: Oxford University Press.
Gelfand, A. E. and Carlin, B. P. (1993) Maximum likelihood estimation for constrained or missing data models, Canadian J. Statist. 21, 303–311.
Geyer, C. J. (1991) Markov chain Monte Carlo maximum likelihood. In Computing Science and Statistics: Proceedings of the 23 Symposium on the Interface (ed. E. M. Keramidas), pp. 156–163. Fairfax: Interface Foundation.
Geyer, C. J. and Thompson, E. A. (1992) Constrained Monte Carlo maximum likelihood for dependent data (with discussion). J. R. Statist. Soc. B, 54, 657–699.
Geyer, C. J. (1995) Estimation and Optimization of Functions. In Markov Chain Monte Carlo in Practice (eds. W. R. Gilks, S. Richardson, and D. J. Spiegelhalter), pp. 241–258. London: Chapman & Hall.
Hyde, C. C., and Hall, P. (1980) Martingale Limit Theory and its Applications, New York, NY: Academic Press.
Jones, B., and Kenward, M. G. (1989) Design and Analysis of Cross-Over Studies, London: Chapman & Hall.
Press, W. H., Teukolsky, S. A., Vetterling, B. T., and Flannery, B. P. (1992) Numerical Recipes in Fortran: The Art of Scientific Computing (2nd Edition), Cambridge, UK: Cambridge University Press.
Ruppert, D., Reish, R. L., Deriso, R. B., and Carroll, R. J. (1984) Optimization using Stochastic Approximation and Monte Carlo Simulation (with Application to Harvesting of Atlantic Menhaden). Biometrics, 40, 535–545.
Ruppert, D. (1991), Stochastic Approximation. In Handbook of Sequential Analysis (ed.s B. K. Ghosh and P. K. Sen), pp. 503–29. New York: Marcel Dekker.
Satten, G. A. (1996) Rank-based inference in the proportional hazards model for interval censored data. Biometrika, 83, 355–370.
Satten, G. A., Datta, S., and Williamson, J., (1998) Inference based on imputed failure times for the proportional hazards model with interval censored data. J. Amer. Statist. Assoc. 93, 318–327.
Sternberg, M. R. and Satten, G. A. (1999). Discrete-time nonparametric estimation for semi-Markov models of chain-of-events data. Biometrics 55, 514–522.
Tanner, M. A. (1993) Tools for Statistical Inference: Methods for the Exploration of Posterior Distributions and Likelihood Functions. New York: Springer-Verlag.
Wei, G. C. G. and Tanner, M. A. (1990). A Monte Carlo implementation of the EM algorithm and the poor man’s data augmentation algorithm. J. Amer. Statist. Assoc. 85, 699–704.
Wu, C. F. J. (1985) Efficient sequential designs with binary data. J. Amer. Statist. Assoc. 80, 974–984.
Wu, C. F. J. (1986) Maximum likelihood recursion and stochastic approximation in sequential designs. In Adaptive Statistical Procedures and Related Topics (ed. J. Van Ryzin), pp. 298–313. Hayward, CA: Institute of Mathematical Statistics.
Younes, L. (1988) Estimation and annealing for gibbsian fields. Ann. de l’Inst. Henri Poincare. Sect. B, Prob. et Statist., 24, 269–294.
Zeger, S. L., and Karim, M. R., (1991) Generalized linear models with random effects; A Gibbs sampling approach. J. Amer. Statist. Assoc. 86, 79–86.
Author information
Authors and Affiliations
Appendix
Appendix
Regularity assumptions for Theorem 1.
Let \(\Omega = [\hat \theta - \eta ,\hat \theta + \eta ]\), η > 0 be a compact neighborhood of \({\hat \theta }\), let log+(x) = max[0, log(x)], x > 0, and let ∥ ∥ denote the Euclidian norm, and let \({\rm{G}}_\theta ^{ - 1{\rm{i}}}({\rm{u}}\vert{{\rm{y}}_{\rm{i}}}) = \mathop {\inf }\limits_{\rm{x}} \,\{ {\rm{G}}_\theta ^{\rm{i}}({\rm{x}}\vert{{\rm{y}}_{\rm{i}}}) \ge {\rm{u}}\} \). Then, the following conditions are assumed to hold for each 1 ≤ i ≤ N.
C1. There exist non-negative functions Ai and hi with \(\int\limits_0^1 {{{\rm{A}}_{\rm{i}}}({\rm{u}})} \) log+[Ai(u)] du < ∞, and hi(x) ↓ 0 as x ↓ 0, such that
∀ θ, θ′ ∈ Ω, u ∈ (0,1), and
C2a. There exist non-negative functions Bi and ki with \(\int\limits_0^1 {{{\rm{B}}_{\rm{i}}}({\rm{u}})} \) log+[Bi(u)] du < ∞, and ki(x) ↓ 0 as x ↓ 0, such that
∀θ, θ′ ∈ Ω, u ∈ (0,1), and
C2b. There exist non-negative functions \({\widetilde{\rm{B}}_{\rm{i}}}\) and \({\widetilde{k}_{\rm{i}}}\) with \(\int\limits_0^1 {{{\widetilde{\rm{B}}}_{\rm{i}}}({\rm{u}}){{\log }^ + }[{{\widetilde{\rm{B}}}_{\rm{i}}}({\rm{u}})]\,{\rm{du}} < \infty } \), and \({\widetilde{k}_{\rm{i}}}({\rm{x}}) \downarrow 0\) as x ↓ 0, such that
C3. There exist non-negative functions Ci and li with \(\int\limits_0^1 {{{\rm{C}}_{\rm{i}}}({\rm{u}})} \) log+[Ci(u)] du < ∞, and li(x) ↓ 0 as x ↓ 0, such that
∀θ,θ′ ∈ Ω, u ∈ (0,1), and
C4. \({\rm{f}}_\theta ^{\rm{i}}({{\rm{y}}_{\rm{i}}}),\;\;\;\;\;\;\;{\mathbb{S}_{\rm{i}}}({{\rm{y}}_{\rm{i}}}\vert\theta )\) and ℍi(yi ∣ θ) are twice continuously differentiable in \(\theta ,\;\;\;\;\;{\rm{f}}_\theta ^{\rm{i}}({{\rm{y}}_{\rm{i}}})\) is positive, and ℍT({yi} ∣ θ) is non-singular on Ω.
Conditions (C1) − (C3) are needed for the uniform SLLN corresponding to the summands wijk, Sijk wijk, \({\widetilde{\rm{S}}_{{\rm{ijk}}}}{{\rm{w}}_{{\rm{ijk}}}}\) and \({{\cal H}_{{\rm{ijk}}}}{{\rm{w}}_{{\rm{jk}}}}\) respectively. If \({\rm{G}}_\theta ^{\rm{i}}( \cdot \vert{\rm{y}})\) does not depend on θ, these conditions take a simpler form.
Condition (C2b) is only required if \({\widetilde{\rm{S}}_{\rm{i}}}({\rm{x}},{\rm{y}}\vert\theta ) \ne {{\rm{S}}_{\rm{i}}}({\rm{x}},{\rm{y}}\vert\theta )\).
Proof of Theorem 1. For simplicity of presentation, we assume that θ is a real parameter and N = 1. In that case, can omit the subscript i throughout and denote \({\mathbb{S}_{\rm{T}}} = {\mathbb{S}_{\rm{1}}}\) by \(\mathbb{S}\) etc., throughout the proof. K will stand for a generic constant.
Note that we can write \({{\rm{w}}_{{\rm{ijk}}}} = {\rm{w}}_{{\theta _{\rm{j}}}}^{\rm{i}}[{\rm{G}}_{{\theta _{\rm{j}}}}^{{\rm{ - 1i}}}({{\rm{U}}_{{\rm{jk}}}},{\rm{y}}),\;{\rm{y}}]\), where Uijk are i.i.d. U[0,1]. It is easy to see that by (C1), the conditions BD, P-SLLN and S-LIP in the uniform SLLN of Andrews (1992) are satisfied; here, uniformity refers to uniform in θ ∈ Ω. Hence, by Theorem 3 of the same paper,
uniformly in j ≥ 1, as M → ∞, provided the θj′s lie in Ω; \(\phi _\theta ^{\rm{i}}( \cdot )\) is defined in (12). In the same way, by (C2) and (C3),
and
uniformly in j ≥ 1, as M → ∞, provided the θj′s lie in Ω. Therefore
uniformly in j ≥ 1, as M → ∞, provided the θj′s lie in Ω.
Next, note that by Taylor expanding both the numerator and the denominator of the second ratio in rj we get
Similarly,
Combining (A1.1), (A1.2) and (A1.3) we get
provided the θj′s lie in Ω.
By a Taylor expansion of \(({\rm{y}}\vert\widehat\theta )\) around θj we get
where \(\theta _{\rm{j}}^ {\ast} \) lies between θj and \(\widehat\theta \). Dividing (A1.5) by ℍ(y ∣ θj) and averaging, we obtain
Using \({\theta _{{\text{j + 1}}}} = {\bar \theta _\text{j}}/{ \mathbb{\hat{H}}_\text{j}}\), (A1.4) and (A1.6) we obtain
provided the θj′s lie in Ω.
Let P > max(2K, η−1). Then re-write the above inequality as
where \({{\rm{a}}_{\rm{j}}} = {\rm{P}}\vert\widehat\theta - {\theta _{\rm{j}}}\vert\) and ∊j = rjP, j ≥ 1. By (A1.1), on a set of probability one, select M large enough so that \({\epsilon_{\rm{j}}} < {1 \over 2},{\kern 1pt} \forall {\rm{j}} \ge {\rm{1}}\), provided the θj′s lie in Ω. Suppose that the starting value \({\theta _1} \in \Omega = [\widehat\theta - \eta ,\widehat\theta + \eta ]\). Then a1 < 1 and using (A1.7) we find that \({{\rm{a}}_2} \le ({1 \over 2} + {{\rm{K}} \over {\rm{P}}}) < 1\) which implies that θ2 ∈ Ω. Inductively using (A1.7) we find that aj ∈ Ω and \(0 \le {{\rm{a}}_{\rm{j}}} \le ({1 \over 2} + {{\rm{K}} \over {\rm{P}}})\), ∀j ≥ 1. Therefore, applying “limsup” to both sides of (A1.7) and using the fact that for aj ≥ 0,
and that \({{\rm{r}}_{\rm{j}}}\mathop \to \limits^{{\rm{a}}.{\rm{s}}.} 0\) as j → ∞ (for any given replication size M), we conclude that on a set of probability one,
which immediately implies \(\lim \sup\limits_{{\rm{j}} \to \infty } \,{{\rm{a}}_{\rm{j}}} = 0\), since \(\lim \sup\limits_{{\rm{j}} \to \infty } \,{{\rm{a}}_{\rm{j}}} \in [0,1)\). Hence, convergence of θj to \(\widehat\theta \) is established. For the case N > 1 note that while equations (A1.1) — (A1.3) become considerably more complicated, equation (A1.4) remains unchanged. □
Regularity assumptions for Theorem 2: Assume the conditions for Theorem 1 plus
N1. The matrix \(\sum\limits_{\rm{i}} {{{\rm{V}}^{\rm{i}}}} \) is positive definite.
Proof of Theorem 2. Write
Since \({\mathbb{\widehat{H}}_{\rm{j}}}\mathop \to \limits^P {\mathbb{H}_{\rm{T}}}(\{ {{\rm{y}}_{\rm{i}}}\} \vert{\widehat\theta })\) and \(\sqrt {\rm{j}} \,({\overline \theta _{\rm{j}}} - \widehat\theta )\) is Op(1), the first term of (A2.1) converges to zero in probability.
Taylor expanding \({\mathbb{S}_{\rm{T}}}(\{ {{\rm{y}}_{\rm{i}}}\} \vert{\theta _{\rm{j}}})\) around \(\widehat\theta \) and averaging, we obtain
therefore, the second term on the right hand side of (A2.1) is op(1), since \({{\rm{j}}^{1/2}}\sum\limits_{{\rm{j}}\prime = 1}^j {{\rm{|}}{\theta _{{\rm{j}}\prime }} - \widehat\theta {\rm{|}} = {{\rm{O}}_{\rm{p}}}} \)
From (5), (6) and (8), we have
It is easy to verify the conditional Lindeberg condition for linear combinations of (Ni, Di). By the martingale central limit theorem (Corollary 3.1 of Hyde and Hall, 1980), the asymptotic distribution of \(\sqrt {\rm{j}} \,\sum\limits_{{\rm{i}} = 1}^{\rm{N}} {({{\rm{N}}_{\rm{i}}},{{\rm{D}}_{\rm{i}}})} \) is the multivariate normal distribution with mean vector
and variance-covariance matrix \({{\rm{M}}^{ - 1}}\sum\limits_{{\rm{i}} = 1}^{\rm{N}} {{{\rm{V}}^{\rm{i}}}} \), where the Vi is defined in (15)–(18). By the delta method, we have
where \(\mu_{\mathrm{j}}=\sum_{\mathrm{i}=1}^{\mathrm{N}} \frac{\mathrm{j}^{-1} \sum\limits_{\mathrm{j}^{\prime}=1}^{\mathrm{j}} \mathbb{S}_{\mathrm{i}}(\mathrm{y}_{\mathrm{i}} \vert \theta_{\mathrm{j}^{\prime}}) \phi_{\theta_{\mathrm{j}^{\prime}}}^{\mathrm{i}}(\mathrm{y}_{\mathrm{i}})}{\mathrm{j}^{-1} \sum\limits_{\mathrm{j}^{\prime}=1}^{\mathrm{j}} \phi_{\theta_{\mathrm{j}^{\prime}}}^{\mathrm{i}}(\mathrm{y}_{\mathrm{i}})}\) and where \(\mathbb{V}\) is given in equation (19). Using a Taylor series argument as in the proof of Theorem 1, we get for each i
hence,
Combining (A2.1), (A2.2) and (A2.3) we find where \(\sqrt {\rm{j}} \;({\theta _{{\rm{j}} + 1}} - \widehat\theta )\;\;\mathop \to \limits^d \;\;{\rm{N}}(0,\Sigma )\),
□
Rights and permissions
About this article
Cite this article
Satten, G.A., Datta, S. The S-U algorithm for missing data problems. Computational Statistics 15, 243–277 (2000). https://doi.org/10.1007/s001800000031
Published:
Issue Date:
DOI: https://doi.org/10.1007/s001800000031