Abstract
This paper describes a threshold classification with \(K\) populations whose membership category is associated with the threshold process of a latent variable. It is seen that the optimal procedure (Bayes procedure) for the classification involves a nonlinear classification rule and hence, its analytic properties and an efficient estimation can not be explored due to its complex distribution. As an alternative, this paper proposes the best linear procedure and verifies its effectiveness. For this, the present paper provides the necessary theories for deriving the linear rule and its properties, an efficient inference, and a simulation study that sheds light on the performance of the best linear procedure. It also provides three real data examples to demonstrate the applicability of the best linear procedure.
Similar content being viewed by others
References
Anderson TW (2003) An introduction to multivariate statistical analysis, 3rd edn. Wiley, New York
Arellano-Valle RB, Branco MD, Genton MG (2006) A unified view on skewed distributions arising from selection. Can J Stat 34:581–601
Arnold BC, Beaver RJ, Groeneveld RA, Meeker WQ (1993) The nontruncated marginal of a truncated bivariate normal distribution. Psychometrica 58:471–478
Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew-normal distribution. J R Stat Soc B61:579–602
Berrendero JR, Cárcamo J (2012) The tangent classifier. Am Stat 66:185–194
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
Choi SC (1972) Classification of multiply observed data. Biometr J 14:8–11
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via EM algorithm (with discussion). J R Stat Soc B39:1–38
Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York
Fang K-T, Kotz S, Ng KW (1990) Symmetric multivariate and related distributions. Chapman and Hall, New York
Gnanadesikan R (1989) Discriminant analysis and clustering. Stat Sci 4:34–69
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis, 6th edn. Prentice Hall, London
Johnson NL, Kotz S, Balakrishnan N (1994) Continuous univariate distributions, vol 1. Wiley, New York
Kim HJ (2008) A class of weighted multivariate normal distributions and its properties. J Multivar Anal 99:1758–1771
Kim HJ (2009) Classification of observations into one of two artificially dichotomized classes by using a normal screening variable. Commun Stat Theory Methods 38:607–620
Kim HJ (2013) An optimal classification rule for multiple interval-screened scale mixture of normal populations. J Korean Stat Soc 42:191–203
Krzanowski WJ (1977) The performance of Fisher’s linear discriminant function under non-optimal conditions. Technometrics 19:191–200
Lee JC (1982) Classification of growth curves. In: Krishnaiah PR, Kanal LN (eds) Handbook of statistics, vol 2. North Holland, Amsterdam, pp 121–137
Lin TC, Lin TI (2010) Supervised learning of multivariate skew normal mixture models with missing information. Comput Stat 25:183–201
Lin TI, Ho HJ, Chen CL (2009) Analysis of multivariate skew normal models with incomplete data. J Multivar Anal 100:2337–2351
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York
Meng X, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278
Pardoe I, Yin X, Cook RD (2007) Graphical tools for quadratic discriminant analysis. Technometrics 49:172–183
Reza-Zadkarami M, Rowhani M (2010) Application of skew-normal in classification of satellite image. J Data Sci 8:597–606
Sen A, Srivastava M (1990) Regression analysis: theory, methods, and application. Springer, New York
Shumway RH (1982) Classification of growth curves. In: Krishnaiah PR, Kanal LN (eds) Handbook of statistics, vol 2. North Holland, Amsterdam, pp 1–46
Srivastava MS (1984) A measure of skewness and kurtosis and a graphical method for assessing multivariate normality. Stat Probab Lett 2:263–267
Sutradhar BC (1990) Discrimination of observations into one of two \(t\) populations. Biometrics 46:827–835
Wilhelm S, Manjunath BG (2010) tmvtnorm: A package for the truncated multivariate normal distribution and student t distribution. http://CRAN.R-project.org/package=tmvtnorm, R package version 1.1-5
Wilks SS (1962) Mathematical statistics. Wiley, New York
Acknowledgments
This Research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT and Future Planning (2013R1A2A2A01004790).
Author information
Authors and Affiliations
Corresponding author
Appendix
Appendix
1.1 Appendix: Proof of Theorem 3.1 and derivation of the Em algorithm
Here we provide the proof of Theorem 3.1 and provide the details of the derivation leading to the EM algorithm in Sect. 3.
1.1.1 Proof of Theorem 3.1
In order to minimize TPM of the linear THC regions (3.6), it is desired to make \(P(g|i)\pi _{i}\) and \(P(i |g)\pi _{g}\) small for all \(i, g =1, 2, \ldots , K.\) Also note that the df functions \(F_{(v_{\ell -1}, v_{\ell })}(\cdot ; \tau ), \; \ell =i, g,\) in (3.4) and (3.5) include \(\pi _{\ell }\) in their denominators, where \(\pi _{\ell }=\Phi (v_{\ell })-\Phi (v_{\ell -1})\) is the prior probability of \(\Pi _{\ell }.\) Therefore, to make \(P(g|i)\pi _{i}\) and \(P(i |g)\pi _{g}\) small is equivalent to make the arguments
large for all \(i, g =1, 2, \ldots , K.\) Eliminating \(c_{gi}\) from (7.1), we have
where \(\varvec{\gamma }_{gi}=\varvec{\mu }^{(g)}-\varvec{\mu }^{(i)}.\)
To maximize \(y_{g},\) for given \(y_{i},\) we differentiate \(y_{g}\) with respect to \(\varvec{\theta }_{gi}\) to obtain
If we let
then (7.3) set equal to \(\mathbf{0}\) is
From (7.4) and (7.5), we see that \(t_{gi}(g)=1-t_{gi}(i).\) If there is a scalar \(t_{gi}(i)\) (\(0 \le t_{gi}(g) \le 1)\) and a vector \(\varvec{\theta }_{gi}\) satisfying (7.5), then \(c_{gi}\) is obtained from (7.1) and (7.4) as \(c_{gi}=t_{gi}(i)(\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}) + \varvec{\theta }_{gi}^{\top } \varvec{\mu }^{(i)}.\) Now we can prove that the set \(\{y_{i}, y_{g}\}\) of (7.1) defined in this way correspond to admissible linear procedures by using the analogous argument in Anderson (2003, pp.246).
1.1.2 Derivation of the EM algorithm
At the \((m+1)\)-th iteration of the E-step, we need to calculate the \(Q\)-function, defined as
where
which is the conditional expectation of the joint distributions and \(\mathbf{X}_{gj}\) and \(X_{0gj}\) given the observed data \(\varvec{x}_{gj},\) the current estimate \(\hat{\Theta }^{(m)},\) and the threshold interval condition, \(X_{0gj}\in I_{g},\) \( g=1,\ldots , K.\) From the hierarchical representation with \(\kappa (\eta )=1,\) we see that the conditional distribution of \(X_{0gj}\) given \(\hat{\Theta }^{(m)},\) \(\varvec{x}_{gj},\) and \(X_{0gj}\in I_{g}\) is
where \(\hat{\zeta }^{(m)}_{gj}=\mu _{0}+\sigma _{0}\hat{\varvec{\delta }}^{(m)\top } \hat{\Lambda }^{(m)}(\varvec{x}_{ij} - \hat{\varvec{\mu }}^{(m)})\) and \(\hat{\tau }^{(m)}= \sigma _{0}^{2}(1-\hat{\varvec{\delta }}^{(m)\top }\hat{\Lambda }^{(m)} \hat{\varvec{\delta }}^{(m)}),\) where \(\Lambda =\Sigma ^{-1}.\) So that the two conditional expectations of \(Z_{gj}\) involved in \(\hat{H}_{ij}^{(m)}\) of (4.1) can be easily evaluated by using formulas (13.134) and (13.135) in Johnson et al. (1994) as well as \(R\) package (”tmvtnorm”) provided by Wilhelm and Manjunath (2010): Denoting \(\hat{\eta }^{(m)}_{gj}=E[X_{0gj} |(\hat{\Theta }^{(m)}, \varvec{x}_{gj})]\) and \(\hat{\gamma }^{(m)}_{gj}=E[X_{0gj}^{2} |(\hat{\Theta }^{(m)}, \varvec{x}_{gj})],\) they are estimated by
and
where \(A^{(m)}_{gj}=(a_{g-1}-\hat{\zeta }^{(m)}_{gj})/\sqrt{\hat{\tau }^{(m)}},\) \(B^{(m)}_{gj}=(a_{g}-\hat{\zeta }^{(m)}_{gj})/\sqrt{\hat{\tau }^{(m)}},\) and \(C_{gj}^{(m)}= \Big (\phi (A^{(m)}_{gj})-\phi (B^{(m)}_{gj})\Big )/ \Big (\Phi (B^{(m)}_{gj})-\Phi (A^{(m)}_{gj})\Big ).\)
Therefore, the \(m\)-th iteration of the EM algorithm can be implemented as follows:
E-step: Given the parameter vector \(\Theta =\hat{\Theta }^{(m)},\) compute \(\hat{H}_{gj}^{(m)}\) (the conditional expectation of \(H_{gj}^{(m)}\)) for \(g=1,\ldots , K; j=1, \ldots , n_{g},\) by using (7.7) and (7.8).
M-step:
-
1.
Update \(\hat{\varvec{\mu }}^{(m)}\) by
$$\begin{aligned} \hat{\varvec{\mu }}^{(m+1)}=\frac{1}{\nu } \left( \sum _{g=1}^{K}\sum _{j=1}^{n_{g}}\varvec{x}_{gj} - \hat{\varvec{\delta }}^{(m)}\sum _{g=1}^{K} \sum _{j=1}^{n_{g}}\hat{Z}_{gj}^{(m)}\right) , \end{aligned}$$where \(\hat{Z}_{gj}^{(m)}=\left( \hat{\eta }_{gj}^{(m)}-\mu _{0}\right) /\sigma _{0}.\)
-
2.
Update \(\hat{\Psi }^{(m)}\) by
$$\begin{aligned} \begin{array}{ccl}\hat{\Psi }^{(m+1)}&{}=&{}\frac{1}{\nu }\Big [\sum _{g=1}^{K}\sum _{j=1}^{n_{g}}(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})^{\top }\\ &{}&{}-2\sum _{g=1}^{K}\sum _{j=1}^{n_{g}}\hat{Z}_{gj}^{(m)}(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})\hat{\varvec{\delta }}^{(m)\top }\\ &{}&{}+\sum _{g=1}^{K}\sum _{j=1}^{n_{g}} \hat{U}^{(m)}_{gj}\hat{\varvec{\delta }}^{(m)}\hat{\varvec{\delta }}^{(m)\top }\Big ], \end{array} \end{aligned}$$where \(\hat{U}^{(m)}_{gj}=\left( \hat{\gamma }^{(m)}_{gj}-2\mu _{0}\hat{\eta }^{(m)}_{gj}+\mu _{0}^{2}\right) / \sigma _{0}^{2}.\)
-
3.
Update \(\hat{\varvec{\delta }}^{(m)}\) by
$$\begin{aligned} \hat{\varvec{\delta }}^{(m+1)}=\frac{\sum _{g=1}^{K}\sum _{j=1}^{n_{g}} \hat{Z}_{gj}^{(m)}(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})}{\sum _{g=1}^{K}\sum _{j=1}^{n_{g}} \hat{U}^{(m)}_{gj}}. \end{aligned}$$Since the stability and monotone convergence of the EM algorithm are maintained, the iterations are repeated until a suitable stopping rule is satisfied, e.g., \(\parallel \hat{\Theta }^{(m+1)}-\hat{\Theta }^{(m)}\parallel \) is sufficiently small.
Rights and permissions
About this article
Cite this article
Kim, HJ. A best linear threshold classification with scale mixture of skew normal populations. Comput Stat 30, 1–28 (2015). https://doi.org/10.1007/s00180-014-0517-y
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00180-014-0517-y