Skip to main content
Log in

Robust and sparse multigroup classification by the optimal scoring approach

  • Published:
Data Mining and Knowledge Discovery Aims and scope Submit manuscript

Abstract

We propose a robust and sparse classification method based on the optimal scoring approach. It is also applicable if the number of variables exceeds the number of observations. The data are first projected into a low dimensional subspace according to an optimal scoring criterion. The projection only includes a subset of the original variables (sparse modeling) and is not distorted by outliers (robust modeling). In this low dimensional subspace classification is performed by minimizing a robust Mahalanobis distance to the group centers. The low dimensional representation of the data is also useful for visualization purposes. We discuss the algorithm for the proposed method in detail. A simulation study illustrates the properties of robust and sparse classification by optimal scoring compared to the non-robust and/or non-sparse alternative methods. Three real data applications are given.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6

Similar content being viewed by others

References

  • Alfons A, Croux C, Gelper S et al (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 7(1):226–248

    Article  MathSciNet  Google Scholar 

  • Armanino C, Leardi R, Lanteri S, Modi G (1989) Chemometric analysis of tuscan olive oils. Chemom Intell Lab Syst 5(4):343–354

    Article  Google Scholar 

  • Brodinova S, Ortner T, Filzmoser P, Zaharieva M, Breiteneder C (2015) Evaluation of robust PCA for supervised audio outlier detection. In: Proceeding of 22nd international conference on computational statistics (COMPSTAT)

  • Clemmensen L, Kuhn M (2012) sparseLDA: sparse discriminant analysis. R package version 0.1-6. https://CRAN.R-project.org/package=sparseLDA. Accessed 21 Oct 2015

  • Clemmensen L, Hastie T, Witten D, Ersbøll B (2012) Sparse discriminant analysis. Technometrics 53(4):406–413

    Article  MathSciNet  Google Scholar 

  • Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32(2):407–499

    Article  MathSciNet  Google Scholar 

  • Hampel F (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69(346):383–393

    Article  MathSciNet  Google Scholar 

  • Hampel F, Ronchetti E, Rousseeuw P, Stahel W (1986) Robust statistics: the approach based on influence functions. Wiley, Hoboken

    MATH  Google Scholar 

  • Hastie T, Tibshirani R, Buja A (1994) Flexible discriminant analysis by optimal scoring. J Am Stat Assoc 89(428):1255–1270

    Article  MathSciNet  Google Scholar 

  • Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. CRC Press, Boca Raton

    Book  Google Scholar 

  • Hoffmann I, Filzmoser P, Serneels S, Varmuza K (2016) Sparse and robust PLS for binary classification. J Chemom 30(4):153–162

    Article  Google Scholar 

  • Hubert M, Van Driessen K (2004) Fast and robust discriminant analysis. Comput Stat Data Anal 45(2):301–320

    Article  MathSciNet  Google Scholar 

  • Hubert M, Rousseeuw P, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23(1):92–119

    Article  MathSciNet  Google Scholar 

  • Johnson R, Wichern D et al (2002) Applied multivariate statistical analysis, vol 5. Prentice Hall, Upper Saddle River

    MATH  Google Scholar 

  • R Core Team: R (2016) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna

    Google Scholar 

  • Rousseeuw P, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223

    Article  Google Scholar 

  • Tibshirani R (2011) Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc B 73(3):273–282

    Article  MathSciNet  Google Scholar 

  • Todorov V (2016) rrcovHD: robust multivariate methods for high dimensional data. R package version 0.2-4. https://CRAN.R-project.org/package=rrcovHD. Accessed 17 Feb 2016

  • Todorov V, Pires A (2007) Comparative performance of several robust linear discriminant analysis methods. REVSTAT Stat J 5(1):63–83

    MathSciNet  MATH  Google Scholar 

  • Vanden Branden K, Hubert M (2005) Robust classification in high dimensions based on the simca method. Chemom Intell Lab Syst 79(1):10–21

    Article  Google Scholar 

  • Witten D, Tibshirani R (2011) Penalized classification using fisher’s linear discriminant. J R Stat Soc Ser B (Statistical Methodology) 73(5):753–772

    Article  MathSciNet  Google Scholar 

  • Wolke R, Schwetlick H (1988) Iteratively reweighted least squares: algorithms, convergence analysis, and numerical comparisons. SIAM J Sci Stat Comput 9(5):907–921

    Article  MathSciNet  Google Scholar 

  • Wu T, Lange K (2008) Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2(1):224–244

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgements

This work is supported by the Austrian Science Fund (FWF), Project P 26871-N20. We would like to thank the referees for useful comments.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Irene Ortner.

Additional information

Responsible editor: Jieping Ye.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

Appendix

1.1 Derivation of expression (4) for the score vector estimates

Let \(\omega _1,\ldots ,\omega _n\) be case weights for each observation. \(\varvec{\varOmega }\) is a diagonal matrix with these case weights in the diagonal. Then the weighted data matrices are \(\varvec{\tilde{Y}}=\varvec{\varOmega }^{1/2}\varvec{Y}\) and \(\varvec{\tilde{X}}=\varvec{\varOmega }^{1/2}\varvec{X}\). The diagonal matrix with weighted class proportions is \(\varvec{\tilde{D}}=\frac{1}{\sum \omega _i}\varvec{\tilde{Y}}^T \varvec{\tilde{Y}}\). The optimization problem (2) in step h for a given \(\varvec{\hat{\beta }}\) can be rewritten as

$$\begin{aligned} \min _{\varvec{\theta }}\Vert \varvec{\tilde{X}}\varvec{\hat{\beta }} - \varvec{\tilde{Y}}\varvec{\theta } \Vert ^2 \quad \text {s.t.} \quad \varvec{\theta }^T\varvec{\tilde{D}}\varvec{\theta }=1 \text { and } \varvec{C}\varvec{\theta }=\varvec{0} \in \mathbb {R}^h \end{aligned}$$
(6)

with \(\varvec{C}=[\hat{\varvec{\theta }}_{1},\ldots , \hat{\varvec{\theta }}_{h-1}]^T\varvec{\tilde{D}}\), and we drop the depending on the index h for ease of notation.

We use the method of Lagrange multipliers. The Lagrangian associated to Eq. (6) is given by

$$\begin{aligned} L=(\varvec{\tilde{X}}\varvec{\hat{\beta }} - \varvec{\tilde{Y}}\varvec{\theta })^T (\varvec{\tilde{X}}\varvec{\hat{\beta }} - \varvec{\tilde{Y}}\varvec{\theta }) - \eta (\varvec{\theta }^T\varvec{\tilde{D}}\varvec{\theta } - 1) - 2\varvec{\gamma }^T\varvec{C}\varvec{\theta }. \end{aligned}$$

The partial derivative set to zero gives

$$\begin{aligned} \frac{\partial L}{\partial \varvec{\theta }}=-2\varvec{\tilde{Y}}^T(\varvec{\tilde{X}} \varvec{\hat{\beta }}-\varvec{\tilde{Y}}\varvec{\theta }) -2\eta \varvec{\tilde{D}}\varvec{\theta } -2\varvec{C}^T\varvec{\gamma }=\varvec{0}. \end{aligned}$$

Hence,

$$\begin{aligned} \varvec{\theta }=(\varvec{\tilde{Y}}^T\varvec{\tilde{Y}} -\eta \varvec{\tilde{D}})^{-1}(\varvec{\tilde{Y}}^T \varvec{\tilde{X}}\varvec{\hat{\beta }} +\varvec{C}^T\varvec{\gamma }). \end{aligned}$$

To solve for the Lagrange multipliers \(\eta \) and \(\varvec{\gamma }\), the side constraints are used.

$$\begin{aligned} 0=\varvec{C}\varvec{\theta }= \varvec{C}(\varvec{\tilde{Y}}^T\varvec{\tilde{Y}}-\eta \varvec{\tilde{D}})^{-1} \varvec{\tilde{Y}}^T\varvec{\tilde{X}}\varvec{\hat{\beta }}+ \varvec{C}(\varvec{\tilde{Y}}^T\varvec{\tilde{Y}}-\eta \varvec{\tilde{D}})^{-1} \varvec{C}^T\varvec{\gamma } \end{aligned}$$

So

$$\begin{aligned} \varvec{\gamma }=-\left( \varvec{C} (\varvec{\tilde{Y}}^T \varvec{\tilde{Y}}-\eta \varvec{\tilde{D}})^{-1} \varvec{C}^T \right) ^{-1} \varvec{C} (\varvec{\tilde{Y}}^T\varvec{\tilde{Y}}- \eta \varvec{\tilde{D}})^{-1} \varvec{\tilde{Y}}^T \varvec{\tilde{X}}\varvec{\hat{\beta }}. \end{aligned}$$

We conclude

$$\begin{aligned} \varvec{\theta }= & {} (\varvec{\tilde{Y}}^T \varvec{\tilde{Y}} -\eta \varvec{\tilde{D}})^{-1}\nonumber \\&\left\{ \varvec{I} - \varvec{C}^T(\varvec{C}(\varvec{\tilde{Y}}^T \varvec{\tilde{Y}}- \eta \varvec{\tilde{D}})^{-1}\varvec{C}^T)^{-1} \varvec{C} (\varvec{\tilde{Y}}^T\varvec{\tilde{Y}} - \eta \varvec{\tilde{D}})^{-1} \right\} (\varvec{\tilde{Y}}^T \varvec{\tilde{X}}\varvec{\hat{\beta }}). \end{aligned}$$
(7)

Since \(\varvec{\tilde{Y}}^T\varvec{\tilde{Y}}\) is proportional to \(\varvec{\tilde{D}}\), there exists a scalar c such that

$$\begin{aligned} (\varvec{\tilde{Y}}^T\varvec{\tilde{Y}} - \eta \varvec{\tilde{D}})^{-1}=c\varvec{\tilde{D}}^{-1}. \end{aligned}$$

Formula (7) can be simplified to

$$\begin{aligned} \varvec{\theta }=c\left\{ \varvec{I} - \varvec{\tilde{D}}^{-1} \varvec{C}^T (\varvec{C} \varvec{\tilde{D}}^{-1} \varvec{C}^T)^{-1} \varvec{C}\right\} \varvec{\tilde{D}}^{-1} \varvec{\tilde{Y}}^T \varvec{\tilde{X}} \varvec{\hat{\beta }}. \end{aligned}$$

Due to the symmetry of \(\varvec{\tilde{D}}\) and with the definition of \(\varvec{C}=\varvec{Q}^T\varvec{\tilde{D}}\) we obtain

$$\begin{aligned} \varvec{\theta }=c\left\{ \varvec{I} - \varvec{Q} (\varvec{Q}^T \varvec{\tilde{D}} \varvec{Q})^{-1} \varvec{Q}^T \varvec{\tilde{D}} \right\} \varvec{\tilde{D}}^{-1} \varvec{\tilde{Y}}^T \varvec{\tilde{X}} \varvec{\hat{\beta }}. \end{aligned}$$

The scalar c can then be scaled so that the side constraint \(\varvec{\theta }^T\varvec{\tilde{D}}\varvec{\theta }=1\) is fulfilled.

1.2 Algorithm for the computation of the initial estimates for \(\varvec{\beta }_h\) and \(\varvec{\theta }_h\)

Input: \(h, \varvec{Q}_h, \varvec{X}, \varvec{Y}\), \(\lambda \)

  1. (i)

    Compute \(\varvec{D}=\frac{1}{n}\varvec{Y}^T\varvec{Y}\).

  2. (ii)

    Generate \(\varvec{\theta }_{*}\), a random vector from N(0, 1) of length K.

  3. (iii)

    Compute \(\hat{\varvec{\theta }}_h=c\left\{ \varvec{I} - \varvec{Q}_h (\varvec{Q}_h^T \varvec{D} \varvec{Q}_h)^{-1} \varvec{Q}_h^T \varvec{D} \right\} \varvec{\theta }_{*}\), with c so that \(\hat{\varvec{\theta }}_h^T\varvec{D}\hat{\varvec{\theta }}_h=1\).

Apply twice the following steps:

  1. 1.

    For fixed \(\hat{\varvec{\theta }}_h\) apply sparse least trimmed squares (sparse LTS) regression (Alfons et al. 2013) to the response \(\varvec{Y}\hat{\varvec{\theta }}_h\) and predictors \(\varvec{X}\).

    Let \(a=0.5n\) and \(\Vert \varvec{r}\Vert ^2_{1:a}=\sum _{i=1}^a r_{(i)}^2\) denote the sum of the a smallest squared elements of the vector \(\varvec{r}\). The sparse LTS estimator is a robust version of the Lasso and defined as

    $$\begin{aligned} \min _{\varvec{\beta }}\frac{1}{a}\Vert \varvec{Y} \varvec{\theta }_h-\varvec{X}\varvec{\beta }\Vert ^2_{(1):(a)} + \lambda \Vert \varvec{\beta }\Vert _1. \end{aligned}$$

    As in Alfons et al. (2013), a re-weighting step is carried out afterwards yielding \(\hat{\varvec{\beta }}_h\).

  2. 2.

    For fixed \(\hat{\varvec{\beta }}_h\) apply least absolute deviation (LAD) regression with response \(\varvec{X}\hat{\varvec{\beta }}_h\) and predictor matrix \(\varvec{Y}\):

    $$\begin{aligned} \varvec{\theta }^*=\mathop {\text{ argmin }}_{\varvec{\theta }}\Vert \varvec{Y}\varvec{\theta }-\varvec{X} \hat{\varvec{\beta }}_h\Vert _1. \end{aligned}$$

    The LDA estimator is robust to outliers in the dependent variable, but not to leverage points (i.e. outliers in the covariate space). Since the covariates are dummy variables here, leverage points cannot occur. Then we apply the transformation for satisfying the side constraints:

    $$\begin{aligned} \hat{\varvec{\theta }}_h=c\left\{ \varvec{I} - \varvec{Q}_h (\varvec{Q}_h^T \varvec{D} \varvec{Q}_h)^{-1} \varvec{Q}_h^T \varvec{D} \right\} \varvec{\theta }^*. \end{aligned}$$

Output: Initial estimators \(\hat{\varvec{\beta }}_h\) and \(\hat{\varvec{\theta }}_h\).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ortner, I., Filzmoser, P. & Croux, C. Robust and sparse multigroup classification by the optimal scoring approach. Data Min Knowl Disc 34, 723–741 (2020). https://doi.org/10.1007/s10618-019-00666-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10618-019-00666-8

Keywords

Navigation