Skip to main content
Log in

A best linear threshold classification with scale mixture of skew normal populations

  • Original Paper
  • Published:
Computational Statistics Aims and scope Submit manuscript

Abstract

This paper describes a threshold classification with \(K\) populations whose membership category is associated with the threshold process of a latent variable. It is seen that the optimal procedure (Bayes procedure) for the classification involves a nonlinear classification rule and hence, its analytic properties and an efficient estimation can not be explored due to its complex distribution. As an alternative, this paper proposes the best linear procedure and verifies its effectiveness. For this, the present paper provides the necessary theories for deriving the linear rule and its properties, an efficient inference, and a simulation study that sheds light on the performance of the best linear procedure. It also provides three real data examples to demonstrate the applicability of the best linear procedure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2

Similar content being viewed by others

References

  • Anderson TW (2003) An introduction to multivariate statistical analysis, 3rd edn. Wiley, New York

    MATH  Google Scholar 

  • Arellano-Valle RB, Branco MD, Genton MG (2006) A unified view on skewed distributions arising from selection. Can J Stat 34:581–601

    Article  MATH  MathSciNet  Google Scholar 

  • Arnold BC, Beaver RJ, Groeneveld RA, Meeker WQ (1993) The nontruncated marginal of a truncated bivariate normal distribution. Psychometrica 58:471–478

    Article  MATH  MathSciNet  Google Scholar 

  • Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew-normal distribution. J R Stat Soc B61:579–602

    Article  MathSciNet  Google Scholar 

  • Berrendero JR, Cárcamo J (2012) The tangent classifier. Am Stat 66:185–194

    Article  Google Scholar 

  • Bishop CM (2006) Pattern recognition and machine learning. Springer, New York

    MATH  Google Scholar 

  • Choi SC (1972) Classification of multiply observed data. Biometr J 14:8–11

    Article  MATH  Google Scholar 

  • Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via EM algorithm (with discussion). J R Stat Soc B39:1–38

    MathSciNet  Google Scholar 

  • Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York

    Book  MATH  Google Scholar 

  • Fang K-T, Kotz S, Ng KW (1990) Symmetric multivariate and related distributions. Chapman and Hall, New York

    Book  MATH  Google Scholar 

  • Gnanadesikan R (1989) Discriminant analysis and clustering. Stat Sci 4:34–69

    Article  MATH  MathSciNet  Google Scholar 

  • Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York

    Book  Google Scholar 

  • Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis, 6th edn. Prentice Hall, London

    MATH  Google Scholar 

  • Johnson NL, Kotz S, Balakrishnan N (1994) Continuous univariate distributions, vol 1. Wiley, New York

    MATH  Google Scholar 

  • Kim HJ (2008) A class of weighted multivariate normal distributions and its properties. J Multivar Anal 99:1758–1771

    Article  MATH  Google Scholar 

  • Kim HJ (2009) Classification of observations into one of two artificially dichotomized classes by using a normal screening variable. Commun Stat Theory Methods 38:607–620

    Article  MATH  Google Scholar 

  • Kim HJ (2013) An optimal classification rule for multiple interval-screened scale mixture of normal populations. J Korean Stat Soc 42:191–203

    Article  MATH  Google Scholar 

  • Krzanowski WJ (1977) The performance of Fisher’s linear discriminant function under non-optimal conditions. Technometrics 19:191–200

    Article  MATH  Google Scholar 

  • Lee JC (1982) Classification of growth curves. In: Krishnaiah PR, Kanal LN (eds) Handbook of statistics, vol 2. North Holland, Amsterdam, pp 121–137

    Google Scholar 

  • Lin TC, Lin TI (2010) Supervised learning of multivariate skew normal mixture models with missing information. Comput Stat 25:183–201

    Article  MATH  Google Scholar 

  • Lin TI, Ho HJ, Chen CL (2009) Analysis of multivariate skew normal models with incomplete data. J Multivar Anal 100:2337–2351

    Article  MATH  MathSciNet  Google Scholar 

  • McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York

    Book  MATH  Google Scholar 

  • Meng X, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278

    Article  MATH  MathSciNet  Google Scholar 

  • Pardoe I, Yin X, Cook RD (2007) Graphical tools for quadratic discriminant analysis. Technometrics 49:172–183

    Article  MathSciNet  Google Scholar 

  • Reza-Zadkarami M, Rowhani M (2010) Application of skew-normal in classification of satellite image. J Data Sci 8:597–606

    Google Scholar 

  • Sen A, Srivastava M (1990) Regression analysis: theory, methods, and application. Springer, New York

    Google Scholar 

  • Shumway RH (1982) Classification of growth curves. In: Krishnaiah PR, Kanal LN (eds) Handbook of statistics, vol 2. North Holland, Amsterdam, pp 1–46

    Google Scholar 

  • Srivastava MS (1984) A measure of skewness and kurtosis and a graphical method for assessing multivariate normality. Stat Probab Lett 2:263–267

    Article  Google Scholar 

  • Sutradhar BC (1990) Discrimination of observations into one of two \(t\) populations. Biometrics 46:827–835

    Article  Google Scholar 

  • Wilhelm S, Manjunath BG (2010) tmvtnorm: A package for the truncated multivariate normal distribution and student t distribution. http://CRAN.R-project.org/package=tmvtnorm, R package version 1.1-5

  • Wilks SS (1962) Mathematical statistics. Wiley, New York

    MATH  Google Scholar 

Download references

Acknowledgments

This Research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT and Future Planning (2013R1A2A2A01004790).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Hea-Jung Kim.

Appendix

Appendix

1.1 Appendix: Proof of Theorem 3.1 and derivation of the Em algorithm

Here we provide the proof of Theorem 3.1 and provide the details of the derivation leading to the EM algorithm in Sect. 3.

1.1.1 Proof of Theorem 3.1

In order to minimize TPM of the linear THC regions (3.6), it is desired to make \(P(g|i)\pi _{i}\) and \(P(i |g)\pi _{g}\) small for all \(i, g =1, 2, \ldots , K.\) Also note that the df functions \(F_{(v_{\ell -1}, v_{\ell })}(\cdot ; \tau ), \; \ell =i, g,\) in (3.4) and (3.5) include \(\pi _{\ell }\) in their denominators, where \(\pi _{\ell }=\Phi (v_{\ell })-\Phi (v_{\ell -1})\) is the prior probability of \(\Pi _{\ell }.\) Therefore, to make \(P(g|i)\pi _{i}\) and \(P(i |g)\pi _{g}\) small is equivalent to make the arguments

$$\begin{aligned} y_{i}=\frac{c_{gi}-\varvec{\theta }_{gi}^{\top }\varvec{\mu }^{(i)}}{\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}}\;\;\text{ and }\;\;\; y_{g}=\frac{\varvec{\theta }_{gi}^{\top }\varvec{\mu }^{(g)}-c_{gi}}{\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi}}} \end{aligned}$$
(7.1)

large for all \(i, g =1, 2, \ldots , K.\) Eliminating \(c_{gi}\) from (7.1), we have

$$\begin{aligned} y_{g}= \Big [\varvec{\theta }_{gi}^{\top }\varvec{\gamma }_{gi}-y_{i}\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}\;\;\Big ]/\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi}}, \end{aligned}$$
(7.2)

where \(\varvec{\gamma }_{gi}=\varvec{\mu }^{(g)}-\varvec{\mu }^{(i)}.\)

To maximize \(y_{g},\) for given \(y_{i},\) we differentiate \(y_{g}\) with respect to \(\varvec{\theta }_{gi}\) to obtain

$$\begin{aligned} \frac{\partial y_{g}}{\partial \varvec{\theta }_{gi}}=\frac{\left[ \varvec{\gamma }_{gi}-y_{i}\Sigma ^{(i)} \varvec{\theta }_{gi}/\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}\right] }{\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi}}}-\frac{\left[ \varvec{\theta }_{gi}^{\top }\varvec{\gamma }_{gi}-y_{i}\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}\right] \;\Sigma ^{(g)} \varvec{\theta }_{gi}}{(\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi})^{3/2}} .\nonumber \\ \end{aligned}$$
(7.3)

If we let

$$\begin{aligned} t_{gi}(g)=\frac{\varvec{\theta }_{gi}^{\top }\varvec{\gamma }_{gi}-y_{i}\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}}{(\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi})} \;\;\;\text{ and }\;\;\;t_{gi}(i)=\frac{y_{i}}{\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}}, \end{aligned}$$
(7.4)

then (7.3) set equal to \(\mathbf{0}\) is

$$\begin{aligned} (t_{gi}(g)\Sigma ^{(g)}+ t_{gi}(i) \Sigma ^{(i)}) \varvec{\theta }_{gi} = \varvec{\gamma }_{gi}. \end{aligned}$$
(7.5)

From (7.4) and (7.5), we see that \(t_{gi}(g)=1-t_{gi}(i).\) If there is a scalar \(t_{gi}(i)\) (\(0 \le t_{gi}(g) \le 1)\) and a vector \(\varvec{\theta }_{gi}\) satisfying (7.5), then \(c_{gi}\) is obtained from (7.1) and (7.4) as \(c_{gi}=t_{gi}(i)(\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}) + \varvec{\theta }_{gi}^{\top } \varvec{\mu }^{(i)}.\) Now we can prove that the set \(\{y_{i}, y_{g}\}\) of (7.1) defined in this way correspond to admissible linear procedures by using the analogous argument in Anderson (2003, pp.246).

1.1.2 Derivation of the EM algorithm

At the \((m+1)\)-th iteration of the E-step, we need to calculate the \(Q\)-function, defined as

$$\begin{aligned} Q( \Theta \;| \hat{\Theta }^{(m)}) = \sum _{g=1}^{K} Q_{g}( \Theta \;| \hat{\Theta }^{(m)}), \end{aligned}$$
(7.6)

where

$$\begin{aligned} Q_{g}( \Theta \;| \hat{\Theta }^{(m)})= E_{\hat{\Theta }^{(m)}} \left[ \prod _{j=1}^{n_{g}}f( \varvec{x}_{gj}, X_{0gj} |\; \hat{\Theta }^{(m)}, \Pi _{g}) \right] \end{aligned}$$

which is the conditional expectation of the joint distributions and \(\mathbf{X}_{gj}\) and \(X_{0gj}\) given the observed data \(\varvec{x}_{gj},\) the current estimate \(\hat{\Theta }^{(m)},\) and the threshold interval condition, \(X_{0gj}\in I_{g},\) \( g=1,\ldots , K.\) From the hierarchical representation with \(\kappa (\eta )=1,\) we see that the conditional distribution of \(X_{0gj}\) given \(\hat{\Theta }^{(m)},\) \(\varvec{x}_{gj},\) and \(X_{0gj}\in I_{g}\) is

$$\begin{aligned} X_{0gj} \;|(\hat{\Theta }^{(m)}, \mathbf{X}_{gj}) \sim TN_{(a_{g-1}, a_{g})}( \hat{\zeta }^{(m)}_{gj},\;\hat{\tau }^{(m)} ), \end{aligned}$$

where \(\hat{\zeta }^{(m)}_{gj}=\mu _{0}+\sigma _{0}\hat{\varvec{\delta }}^{(m)\top } \hat{\Lambda }^{(m)}(\varvec{x}_{ij} - \hat{\varvec{\mu }}^{(m)})\) and \(\hat{\tau }^{(m)}= \sigma _{0}^{2}(1-\hat{\varvec{\delta }}^{(m)\top }\hat{\Lambda }^{(m)} \hat{\varvec{\delta }}^{(m)}),\) where \(\Lambda =\Sigma ^{-1}.\) So that the two conditional expectations of \(Z_{gj}\) involved in \(\hat{H}_{ij}^{(m)}\) of (4.1) can be easily evaluated by using formulas (13.134) and (13.135) in Johnson et al. (1994) as well as \(R\) package (”tmvtnorm”) provided by Wilhelm and Manjunath (2010): Denoting \(\hat{\eta }^{(m)}_{gj}=E[X_{0gj} |(\hat{\Theta }^{(m)}, \varvec{x}_{gj})]\) and \(\hat{\gamma }^{(m)}_{gj}=E[X_{0gj}^{2} |(\hat{\Theta }^{(m)}, \varvec{x}_{gj})],\) they are estimated by

$$\begin{aligned} \hat{\eta }^{(m)}_{gj}=\hat{\zeta }^{(m)}_{gj}+C_{gj}^{(m)}\; \sqrt{\hat{\tau }^{(m)}} \end{aligned}$$
(7.7)

and

$$\begin{aligned} \hat{\gamma }^{(m)}_{gj}=\left[ 1+\frac{A^{(m)}_{gj}\phi (A^{(m)}_{gj})-B^{(m)}_{gj} \phi (B^{(m)}_{gj})}{\Phi (B^{(m)}_{gj})-\Phi (A^{(m)}_{gj})} -\left( C^{(m)}_{gj}\right) ^{2}\right] \hat{\tau }^{(m)}+\left( \hat{\eta }^{(m)}_{gj} \right) ^{2}, \end{aligned}$$
(7.8)

where \(A^{(m)}_{gj}=(a_{g-1}-\hat{\zeta }^{(m)}_{gj})/\sqrt{\hat{\tau }^{(m)}},\) \(B^{(m)}_{gj}=(a_{g}-\hat{\zeta }^{(m)}_{gj})/\sqrt{\hat{\tau }^{(m)}},\) and \(C_{gj}^{(m)}= \Big (\phi (A^{(m)}_{gj})-\phi (B^{(m)}_{gj})\Big )/ \Big (\Phi (B^{(m)}_{gj})-\Phi (A^{(m)}_{gj})\Big ).\)

Therefore, the \(m\)-th iteration of the EM algorithm can be implemented as follows:

E-step: Given the parameter vector \(\Theta =\hat{\Theta }^{(m)},\) compute \(\hat{H}_{gj}^{(m)}\) (the conditional expectation of \(H_{gj}^{(m)}\)) for \(g=1,\ldots , K; j=1, \ldots , n_{g},\) by using (7.7) and (7.8).

M-step:

  1. 1.

    Update \(\hat{\varvec{\mu }}^{(m)}\) by

    $$\begin{aligned} \hat{\varvec{\mu }}^{(m+1)}=\frac{1}{\nu } \left( \sum _{g=1}^{K}\sum _{j=1}^{n_{g}}\varvec{x}_{gj} - \hat{\varvec{\delta }}^{(m)}\sum _{g=1}^{K} \sum _{j=1}^{n_{g}}\hat{Z}_{gj}^{(m)}\right) , \end{aligned}$$

    where \(\hat{Z}_{gj}^{(m)}=\left( \hat{\eta }_{gj}^{(m)}-\mu _{0}\right) /\sigma _{0}.\)

  2. 2.

    Update \(\hat{\Psi }^{(m)}\) by

    $$\begin{aligned} \begin{array}{ccl}\hat{\Psi }^{(m+1)}&{}=&{}\frac{1}{\nu }\Big [\sum _{g=1}^{K}\sum _{j=1}^{n_{g}}(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})^{\top }\\ &{}&{}-2\sum _{g=1}^{K}\sum _{j=1}^{n_{g}}\hat{Z}_{gj}^{(m)}(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})\hat{\varvec{\delta }}^{(m)\top }\\ &{}&{}+\sum _{g=1}^{K}\sum _{j=1}^{n_{g}} \hat{U}^{(m)}_{gj}\hat{\varvec{\delta }}^{(m)}\hat{\varvec{\delta }}^{(m)\top }\Big ], \end{array} \end{aligned}$$

    where \(\hat{U}^{(m)}_{gj}=\left( \hat{\gamma }^{(m)}_{gj}-2\mu _{0}\hat{\eta }^{(m)}_{gj}+\mu _{0}^{2}\right) / \sigma _{0}^{2}.\)

  3. 3.

    Update \(\hat{\varvec{\delta }}^{(m)}\) by

    $$\begin{aligned} \hat{\varvec{\delta }}^{(m+1)}=\frac{\sum _{g=1}^{K}\sum _{j=1}^{n_{g}} \hat{Z}_{gj}^{(m)}(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})}{\sum _{g=1}^{K}\sum _{j=1}^{n_{g}} \hat{U}^{(m)}_{gj}}. \end{aligned}$$

    Since the stability and monotone convergence of the EM algorithm are maintained, the iterations are repeated until a suitable stopping rule is satisfied, e.g., \(\parallel \hat{\Theta }^{(m+1)}-\hat{\Theta }^{(m)}\parallel \) is sufficiently small.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kim, HJ. A best linear threshold classification with scale mixture of skew normal populations. Comput Stat 30, 1–28 (2015). https://doi.org/10.1007/s00180-014-0517-y

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00180-014-0517-y

Keywords

Navigation