A best linear threshold classification with scale mixture of skew normal populations

Kim, Hea-Jung

doi:10.1007/s00180-014-0517-y

A best linear threshold classification with scale mixture of skew normal populations

Original Paper
Published: 03 August 2014

Volume 30, pages 1–28, (2015)
Cite this article

Computational Statistics Aims and scope Submit manuscript

Hea-Jung Kim¹

328 Accesses
2 Citations
Explore all metrics

Abstract

This paper describes a threshold classification with $K$ populations whose membership category is associated with the threshold process of a latent variable. It is seen that the optimal procedure (Bayes procedure) for the classification involves a nonlinear classification rule and hence, its analytic properties and an efficient estimation can not be explored due to its complex distribution. As an alternative, this paper proposes the best linear procedure and verifies its effectiveness. For this, the present paper provides the necessary theories for deriving the linear rule and its properties, an efficient inference, and a simulation study that sheds light on the performance of the best linear procedure. It also provides three real data examples to demonstrate the applicability of the best linear procedure.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Estimation and Classification Using Samples from Two Logistic Populations with a Common Scale Parameter

Weighted likelihood latent class linear regression

Article 23 July 2020

New Estimation Method for Mixture of Normal Distributions

References

Anderson TW (2003) An introduction to multivariate statistical analysis, 3rd edn. Wiley, New York
MATH Google Scholar
Arellano-Valle RB, Branco MD, Genton MG (2006) A unified view on skewed distributions arising from selection. Can J Stat 34:581–601
Article MATH MathSciNet Google Scholar
Arnold BC, Beaver RJ, Groeneveld RA, Meeker WQ (1993) The nontruncated marginal of a truncated bivariate normal distribution. Psychometrica 58:471–478
Article MATH MathSciNet Google Scholar
Azzalini A, Capitanio A (1999) Statistical applications of the multivariate skew-normal distribution. J R Stat Soc B61:579–602
Article MathSciNet Google Scholar
Berrendero JR, Cárcamo J (2012) The tangent classifier. Am Stat 66:185–194
Article Google Scholar
Bishop CM (2006) Pattern recognition and machine learning. Springer, New York
MATH Google Scholar
Choi SC (1972) Classification of multiply observed data. Biometr J 14:8–11
Article MATH Google Scholar
Dempster AP, Laird NM, Rubin DB (1977) Maximum likelihood from incomplete data via EM algorithm (with discussion). J R Stat Soc B39:1–38
MathSciNet Google Scholar
Devroye L, Györfi L, Lugosi G (1996) A probabilistic theory of pattern recognition. Springer, New York
Book MATH Google Scholar
Fang K-T, Kotz S, Ng KW (1990) Symmetric multivariate and related distributions. Chapman and Hall, New York
Book MATH Google Scholar
Gnanadesikan R (1989) Discriminant analysis and clustering. Stat Sci 4:34–69
Article MATH MathSciNet Google Scholar
Hastie T, Tibshirani R, Friedman J (2009) The elements of statistical learning: data mining, inference, and prediction, 2nd edn. Springer, New York
Book Google Scholar
Johnson RA, Wichern DW (2007) Applied multivariate statistical analysis, 6th edn. Prentice Hall, London
MATH Google Scholar
Johnson NL, Kotz S, Balakrishnan N (1994) Continuous univariate distributions, vol 1. Wiley, New York
MATH Google Scholar
Kim HJ (2008) A class of weighted multivariate normal distributions and its properties. J Multivar Anal 99:1758–1771
Article MATH Google Scholar
Kim HJ (2009) Classification of observations into one of two artificially dichotomized classes by using a normal screening variable. Commun Stat Theory Methods 38:607–620
Article MATH Google Scholar
Kim HJ (2013) An optimal classification rule for multiple interval-screened scale mixture of normal populations. J Korean Stat Soc 42:191–203
Article MATH Google Scholar
Krzanowski WJ (1977) The performance of Fisher’s linear discriminant function under non-optimal conditions. Technometrics 19:191–200
Article MATH Google Scholar
Lee JC (1982) Classification of growth curves. In: Krishnaiah PR, Kanal LN (eds) Handbook of statistics, vol 2. North Holland, Amsterdam, pp 121–137
Google Scholar
Lin TC, Lin TI (2010) Supervised learning of multivariate skew normal mixture models with missing information. Comput Stat 25:183–201
Article MATH Google Scholar
Lin TI, Ho HJ, Chen CL (2009) Analysis of multivariate skew normal models with incomplete data. J Multivar Anal 100:2337–2351
Article MATH MathSciNet Google Scholar
McLachlan GJ, Krishnan T (2008) The EM algorithm and extensions, 2nd edn. Wiley, New York
Book MATH Google Scholar
Meng X, Rubin DB (1993) Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika 80:267–278
Article MATH MathSciNet Google Scholar
Pardoe I, Yin X, Cook RD (2007) Graphical tools for quadratic discriminant analysis. Technometrics 49:172–183
Article MathSciNet Google Scholar
Reza-Zadkarami M, Rowhani M (2010) Application of skew-normal in classification of satellite image. J Data Sci 8:597–606
Google Scholar
Sen A, Srivastava M (1990) Regression analysis: theory, methods, and application. Springer, New York
Google Scholar
Shumway RH (1982) Classification of growth curves. In: Krishnaiah PR, Kanal LN (eds) Handbook of statistics, vol 2. North Holland, Amsterdam, pp 1–46
Google Scholar
Srivastava MS (1984) A measure of skewness and kurtosis and a graphical method for assessing multivariate normality. Stat Probab Lett 2:263–267
Article Google Scholar
Sutradhar BC (1990) Discrimination of observations into one of two $t$ populations. Biometrics 46:827–835
Article Google Scholar
Wilhelm S, Manjunath BG (2010) tmvtnorm: A package for the truncated multivariate normal distribution and student t distribution. http://CRAN.R-project.org/package=tmvtnorm, R package version 1.1-5
Wilks SS (1962) Mathematical statistics. Wiley, New York
MATH Google Scholar

Download references

Acknowledgments

This Research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Science, ICT and Future Planning (2013R1A2A2A01004790).

Author information

Authors and Affiliations

Department of Statistics, Dongguk University-Seoul, Pil-Dong 3Ga, Chung-Gu, Seoul, 100-715, Korea
Hea-Jung Kim

Authors

Hea-Jung Kim
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Hea-Jung Kim.

Appendix

1.1 Appendix: Proof of Theorem 3.1 and derivation of the Em algorithm

Here we provide the proof of Theorem 3.1 and provide the details of the derivation leading to the EM algorithm in Sect. 3.

1.1.1 Proof of Theorem 3.1

In order to minimize TPM of the linear THC regions (3.6), it is desired to make $P(g|i)\pi _{i}$ and $P(i |g)\pi _{g}$ small for all $i, g =1, 2, \ldots , K.$ Also note that the df functions $F_{(v_{\ell -1}, v_{\ell })}(\cdot ; \tau ), \; \ell =i, g,$ in (3.4) and (3.5) include $\pi _{\ell }$ in their denominators, where $\pi _{\ell }=\Phi (v_{\ell })-\Phi (v_{\ell -1})$ is the prior probability of $\Pi _{\ell }.$ Therefore, to make $P(g|i)\pi _{i}$ and $P(i |g)\pi _{g}$ small is equivalent to make the arguments

$$\begin{aligned} y_{i}=\frac{c_{gi}-\varvec{\theta }_{gi}^{\top }\varvec{\mu }^{(i)}}{\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}}\;\;\text{ and }\;\;\; y_{g}=\frac{\varvec{\theta }_{gi}^{\top }\varvec{\mu }^{(g)}-c_{gi}}{\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi}}} \end{aligned}$$

(7.1)

large for all $i, g =1, 2, \ldots , K.$ Eliminating $c_{gi}$ from (7.1), we have

$$\begin{aligned} y_{g}= \Big [\varvec{\theta }_{gi}^{\top }\varvec{\gamma }_{gi}-y_{i}\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}\;\;\Big ]/\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi}}, \end{aligned}$$

(7.2)

where $\varvec{\gamma }_{gi}=\varvec{\mu }^{(g)}-\varvec{\mu }^{(i)}.$

To maximize $y_{g},$ for given $y_{i},$ we differentiate $y_{g}$ with respect to $\varvec{\theta }_{gi}$ to obtain

$$\begin{aligned} \frac{\partial y_{g}}{\partial \varvec{\theta }_{gi}}=\frac{\left[ \varvec{\gamma }_{gi}-y_{i}\Sigma ^{(i)} \varvec{\theta }_{gi}/\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}\right] }{\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi}}}-\frac{\left[ \varvec{\theta }_{gi}^{\top }\varvec{\gamma }_{gi}-y_{i}\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}\right] \;\Sigma ^{(g)} \varvec{\theta }_{gi}}{(\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi})^{3/2}} .\nonumber \\ \end{aligned}$$

(7.3)

If we let

$$\begin{aligned} t_{gi}(g)=\frac{\varvec{\theta }_{gi}^{\top }\varvec{\gamma }_{gi}-y_{i}\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}}{(\varvec{\theta }_{gi}^{\top }\Sigma ^{(g)} \varvec{\theta }_{gi})} \;\;\;\text{ and }\;\;\;t_{gi}(i)=\frac{y_{i}}{\sqrt{\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}}}, \end{aligned}$$

(7.4)

then (7.3) set equal to $\mathbf{0}$ is

$$\begin{aligned} (t_{gi}(g)\Sigma ^{(g)}+ t_{gi}(i) \Sigma ^{(i)}) \varvec{\theta }_{gi} = \varvec{\gamma }_{gi}. \end{aligned}$$

(7.5)

From (7.4) and (7.5), we see that $t_{gi}(g)=1-t_{gi}(i).$ If there is a scalar $t_{gi}(i)$ ($0 \le t_{gi}(g) \le 1)$ and a vector $\varvec{\theta }_{gi}$ satisfying (7.5), then $c_{gi}$ is obtained from (7.1) and (7.4) as $c_{gi}=t_{gi}(i)(\varvec{\theta }_{gi}^{\top }\Sigma ^{(i)} \varvec{\theta }_{gi}) + \varvec{\theta }_{gi}^{\top } \varvec{\mu }^{(i)}.$ Now we can prove that the set $\{y_{i}, y_{g}\}$ of (7.1) defined in this way correspond to admissible linear procedures by using the analogous argument in Anderson (2003, pp.246).

1.1.2 Derivation of the EM algorithm

At the $(m+1)$-th iteration of the E-step, we need to calculate the $Q$-function, defined as

$$\begin{aligned} Q( \Theta \;| \hat{\Theta }^{(m)}) = \sum _{g=1}^{K} Q_{g}( \Theta \;| \hat{\Theta }^{(m)}), \end{aligned}$$

(7.6)

where

$$\begin{aligned} Q_{g}( \Theta \;| \hat{\Theta }^{(m)})= E_{\hat{\Theta }^{(m)}} \left[ \prod _{j=1}^{n_{g}}f( \varvec{x}_{gj}, X_{0gj} |\; \hat{\Theta }^{(m)}, \Pi _{g}) \right] \end{aligned}$$

which is the conditional expectation of the joint distributions and $\mathbf{X}_{gj}$ and $X_{0gj}$ given the observed data $\varvec{x}_{gj},$ the current estimate $\hat{\Theta }^{(m)},$ and the threshold interval condition, $X_{0gj}\in I_{g},$ $ g=1,\ldots , K.$ From the hierarchical representation with $\kappa (\eta )=1,$ we see that the conditional distribution of $X_{0gj}$ given $\hat{\Theta }^{(m)},$ $\varvec{x}_{gj},$ and $X_{0gj}\in I_{g}$ is

$$\begin{aligned} X_{0gj} \;|(\hat{\Theta }^{(m)}, \mathbf{X}_{gj}) \sim TN_{(a_{g-1}, a_{g})}( \hat{\zeta }^{(m)}_{gj},\;\hat{\tau }^{(m)} ), \end{aligned}$$

where $\hat{\zeta }^{(m)}_{gj}=\mu _{0}+\sigma _{0}\hat{\varvec{\delta }}^{(m)\top } \hat{\Lambda }^{(m)}(\varvec{x}_{ij} - \hat{\varvec{\mu }}^{(m)})$ and $\hat{\tau }^{(m)}= \sigma _{0}^{2}(1-\hat{\varvec{\delta }}^{(m)\top }\hat{\Lambda }^{(m)} \hat{\varvec{\delta }}^{(m)}),$ where $\Lambda =\Sigma ^{-1}.$ So that the two conditional expectations of $Z_{gj}$ involved in $\hat{H}_{ij}^{(m)}$ of (4.1) can be easily evaluated by using formulas (13.134) and (13.135) in Johnson et al. (1994) as well as $R$ package (”tmvtnorm”) provided by Wilhelm and Manjunath (2010): Denoting $\hat{\eta }^{(m)}_{gj}=E[X_{0gj} |(\hat{\Theta }^{(m)}, \varvec{x}_{gj})]$ and $\hat{\gamma }^{(m)}_{gj}=E[X_{0gj}^{2} |(\hat{\Theta }^{(m)}, \varvec{x}_{gj})],$ they are estimated by

$$\begin{aligned} \hat{\eta }^{(m)}_{gj}=\hat{\zeta }^{(m)}_{gj}+C_{gj}^{(m)}\; \sqrt{\hat{\tau }^{(m)}} \end{aligned}$$

(7.7)

and

$$\begin{aligned} \hat{\gamma }^{(m)}_{gj}=\left[ 1+\frac{A^{(m)}_{gj}\phi (A^{(m)}_{gj})-B^{(m)}_{gj} \phi (B^{(m)}_{gj})}{\Phi (B^{(m)}_{gj})-\Phi (A^{(m)}_{gj})} -\left( C^{(m)}_{gj}\right) ^{2}\right] \hat{\tau }^{(m)}+\left( \hat{\eta }^{(m)}_{gj} \right) ^{2}, \end{aligned}$$

(7.8)

where $A^{(m)}_{gj}=(a_{g-1}-\hat{\zeta }^{(m)}_{gj})/\sqrt{\hat{\tau }^{(m)}},$ $B^{(m)}_{gj}=(a_{g}-\hat{\zeta }^{(m)}_{gj})/\sqrt{\hat{\tau }^{(m)}},$ and $C_{gj}^{(m)}= \Big (\phi (A^{(m)}_{gj})-\phi (B^{(m)}_{gj})\Big )/ \Big (\Phi (B^{(m)}_{gj})-\Phi (A^{(m)}_{gj})\Big ).$

Therefore, the $m$-th iteration of the EM algorithm can be implemented as follows:

E-step: Given the parameter vector $\Theta =\hat{\Theta }^{(m)},$ compute $\hat{H}_{gj}^{(m)}$ (the conditional expectation of $H_{gj}^{(m)}$) for $g=1,\ldots , K; j=1, \ldots , n_{g},$ by using (7.7) and (7.8).

M-step:

1.
Update $\hat{\varvec{\mu }}^{(m)}$ by
$$\begin{aligned} \hat{\varvec{\mu }}^{(m+1)}=\frac{1}{\nu } \left( \sum _{g=1}^{K}\sum _{j=1}^{n_{g}}\varvec{x}_{gj} - \hat{\varvec{\delta }}^{(m)}\sum _{g=1}^{K} \sum _{j=1}^{n_{g}}\hat{Z}_{gj}^{(m)}\right) , \end{aligned}$$
where $\hat{Z}_{gj}^{(m)}=\left( \hat{\eta }_{gj}^{(m)}-\mu _{0}\right) /\sigma _{0}.$
2.
Update $\hat{\Psi }^{(m)}$ by
$$\begin{aligned} \begin{array}{ccl}\hat{\Psi }^{(m+1)}&{}=&{}\frac{1}{\nu }\Big [\sum _{g=1}^{K}\sum _{j=1}^{n_{g}}(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})^{\top }\\ &{}&{}-2\sum _{g=1}^{K}\sum _{j=1}^{n_{g}}\hat{Z}_{gj}^{(m)}(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})\hat{\varvec{\delta }}^{(m)\top }\\ &{}&{}+\sum _{g=1}^{K}\sum _{j=1}^{n_{g}} \hat{U}^{(m)}_{gj}\hat{\varvec{\delta }}^{(m)}\hat{\varvec{\delta }}^{(m)\top }\Big ], \end{array} \end{aligned}$$
where $\hat{U}^{(m)}_{gj}=\left( \hat{\gamma }^{(m)}_{gj}-2\mu _{0}\hat{\eta }^{(m)}_{gj}+\mu _{0}^{2}\right) / \sigma _{0}^{2}.$
3.
Update $\hat{\varvec{\delta }}^{(m)}$ by
$$\begin{aligned} \hat{\varvec{\delta }}^{(m+1)}=\frac{\sum _{g=1}^{K}\sum _{j=1}^{n_{g}} \hat{Z}_{gj}^{(m)}(\varvec{x}_{gj}-\hat{\varvec{\mu }}^{(m+1)})}{\sum _{g=1}^{K}\sum _{j=1}^{n_{g}} \hat{U}^{(m)}_{gj}}. \end{aligned}$$
Since the stability and monotone convergence of the EM algorithm are maintained, the iterations are repeated until a suitable stopping rule is satisfied, e.g., $\parallel \hat{\Theta }^{(m+1)}-\hat{\Theta }^{(m)}\parallel $ is sufficiently small.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Kim, HJ. A best linear threshold classification with scale mixture of skew normal populations. Comput Stat 30, 1–28 (2015). https://doi.org/10.1007/s00180-014-0517-y

Download citation

Received: 09 December 2013
Accepted: 01 July 2014
Published: 03 August 2014
Issue Date: March 2015
DOI: https://doi.org/10.1007/s00180-014-0517-y

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A best linear threshold classification with scale mixture of skew normal populations

Abstract

Access this article

Similar content being viewed by others

Estimation and Classification Using Samples from Two Logistic Populations with a Common Scale Parameter

Weighted likelihood latent class linear regression

New Estimation Method for Mixture of Normal Distributions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

1.1 Appendix: Proof of Theorem 3.1 and derivation of the Em algorithm

1.1.1 Proof of Theorem 3.1

1.1.2 Derivation of the EM algorithm

Rights and permissions

About this article

Cite this article

Keywords

Navigation

A best linear threshold classification with scale mixture of skew normal populations

Abstract

Access this article

Similar content being viewed by others

Estimation and Classification Using Samples from Two Logistic Populations with a Common Scale Parameter

Weighted likelihood latent class linear regression

New Estimation Method for Mixture of Normal Distributions

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Appendix

Appendix

1.1 Appendix: Proof of Theorem 3.1 and derivation of the Em algorithm

1.1.1 Proof of Theorem 3.1

1.1.2 Derivation of the EM algorithm

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation