Robust and sparse multigroup classification by the optimal scoring approach

Ortner, Irene; Filzmoser, Peter; Croux, Christophe

doi:10.1007/s10618-019-00666-8

Robust and sparse multigroup classification by the optimal scoring approach

Published: 20 February 2020

Volume 34, pages 723–741, (2020)
Cite this article

Data Mining and Knowledge Discovery Aims and scope Submit manuscript

468 Accesses
5 Citations
Explore all metrics

Abstract

We propose a robust and sparse classification method based on the optimal scoring approach. It is also applicable if the number of variables exceeds the number of observations. The data are first projected into a low dimensional subspace according to an optimal scoring criterion. The projection only includes a subset of the original variables (sparse modeling) and is not distorted by outliers (robust modeling). In this low dimensional subspace classification is performed by minimizing a robust Mahalanobis distance to the group centers. The low dimensional representation of the data is also useful for visualization purposes. We discuss the algorithm for the proposed method in detail. A simulation study illustrates the properties of robust and sparse classification by optimal scoring compared to the non-robust and/or non-sparse alternative methods. Three real data applications are given.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors

Article 14 October 2023

Sparse Clustering with K-Means - Which Penalties and for Which Data?

Sparse Weighted K-Means for Groups of Mixed-Type Variables

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Alfons A, Croux C, Gelper S et al (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 7(1):226–248
Article MathSciNet Google Scholar
Armanino C, Leardi R, Lanteri S, Modi G (1989) Chemometric analysis of tuscan olive oils. Chemom Intell Lab Syst 5(4):343–354
Article Google Scholar
Brodinova S, Ortner T, Filzmoser P, Zaharieva M, Breiteneder C (2015) Evaluation of robust PCA for supervised audio outlier detection. In: Proceeding of 22nd international conference on computational statistics (COMPSTAT)
Clemmensen L, Kuhn M (2012) sparseLDA: sparse discriminant analysis. R package version 0.1-6. https://CRAN.R-project.org/package=sparseLDA. Accessed 21 Oct 2015
Clemmensen L, Hastie T, Witten D, Ersbøll B (2012) Sparse discriminant analysis. Technometrics 53(4):406–413
Article MathSciNet Google Scholar
Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32(2):407–499
Article MathSciNet Google Scholar
Hampel F (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69(346):383–393
Article MathSciNet Google Scholar
Hampel F, Ronchetti E, Rousseeuw P, Stahel W (1986) Robust statistics: the approach based on influence functions. Wiley, Hoboken
MATH Google Scholar
Hastie T, Tibshirani R, Buja A (1994) Flexible discriminant analysis by optimal scoring. J Am Stat Assoc 89(428):1255–1270
Article MathSciNet Google Scholar
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. CRC Press, Boca Raton
Book Google Scholar
Hoffmann I, Filzmoser P, Serneels S, Varmuza K (2016) Sparse and robust PLS for binary classification. J Chemom 30(4):153–162
Article Google Scholar
Hubert M, Van Driessen K (2004) Fast and robust discriminant analysis. Comput Stat Data Anal 45(2):301–320
Article MathSciNet Google Scholar
Hubert M, Rousseeuw P, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23(1):92–119
Article MathSciNet Google Scholar
Johnson R, Wichern D et al (2002) Applied multivariate statistical analysis, vol 5. Prentice Hall, Upper Saddle River
MATH Google Scholar
R Core Team: R (2016) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Google Scholar
Rousseeuw P, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223
Article Google Scholar
Tibshirani R (2011) Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc B 73(3):273–282
Article MathSciNet Google Scholar
Todorov V (2016) rrcovHD: robust multivariate methods for high dimensional data. R package version 0.2-4. https://CRAN.R-project.org/package=rrcovHD. Accessed 17 Feb 2016
Todorov V, Pires A (2007) Comparative performance of several robust linear discriminant analysis methods. REVSTAT Stat J 5(1):63–83
MathSciNet MATH Google Scholar
Vanden Branden K, Hubert M (2005) Robust classification in high dimensions based on the simca method. Chemom Intell Lab Syst 79(1):10–21
Article Google Scholar
Witten D, Tibshirani R (2011) Penalized classification using fisher’s linear discriminant. J R Stat Soc Ser B (Statistical Methodology) 73(5):753–772
Article MathSciNet Google Scholar
Wolke R, Schwetlick H (1988) Iteratively reweighted least squares: algorithms, convergence analysis, and numerical comparisons. SIAM J Sci Stat Comput 9(5):907–921
Article MathSciNet Google Scholar
Wu T, Lange K (2008) Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2(1):224–244
Article MathSciNet Google Scholar

Download references

Acknowledgements

This work is supported by the Austrian Science Fund (FWF), Project P 26871-N20. We would like to thank the referees for useful comments.

Author information

Authors and Affiliations

Applied Statistics GmbH, Taubstummengasse 4/10, 1040, Wien, Austria
Irene Ortner
Institute of Statistics and Mathematical Methods in Economics, TU Wien, Vienna, Austria
Peter Filzmoser
Economics and Finance Faculty, Edhec Business School, Lille, France
Christophe Croux

Authors

Irene Ortner
View author publications
You can also search for this author inPubMed Google Scholar
Peter Filzmoser
View author publications
You can also search for this author inPubMed Google Scholar
Christophe Croux
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Irene Ortner.

Additional information

Responsible editor: Jieping Ye.

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Appendix

1.1 Derivation of expression (4) for the score vector estimates

Let $\omega _1,\ldots ,\omega _n$ be case weights for each observation. $\varvec{\varOmega }$ is a diagonal matrix with these case weights in the diagonal. Then the weighted data matrices are $\varvec{\tilde{Y}}=\varvec{\varOmega }^{1/2}\varvec{Y}$ and $\varvec{\tilde{X}}=\varvec{\varOmega }^{1/2}\varvec{X}$. The diagonal matrix with weighted class proportions is $\varvec{\tilde{D}}=\frac{1}{\sum \omega _i}\varvec{\tilde{Y}}^T \varvec{\tilde{Y}}$. The optimization problem (2) in step h for a given $\varvec{\hat{\beta }}$ can be rewritten as

$$\begin{aligned} \min _{\varvec{\theta }}\Vert \varvec{\tilde{X}}\varvec{\hat{\beta }} - \varvec{\tilde{Y}}\varvec{\theta } \Vert ^2 \quad \text {s.t.} \quad \varvec{\theta }^T\varvec{\tilde{D}}\varvec{\theta }=1 \text { and } \varvec{C}\varvec{\theta }=\varvec{0} \in \mathbb {R}^h \end{aligned}$$

(6)

with $\varvec{C}=[\hat{\varvec{\theta }}_{1},\ldots , \hat{\varvec{\theta }}_{h-1}]^T\varvec{\tilde{D}}$, and we drop the depending on the index h for ease of notation.

We use the method of Lagrange multipliers. The Lagrangian associated to Eq. (6) is given by

$$\begin{aligned} L=(\varvec{\tilde{X}}\varvec{\hat{\beta }} - \varvec{\tilde{Y}}\varvec{\theta })^T (\varvec{\tilde{X}}\varvec{\hat{\beta }} - \varvec{\tilde{Y}}\varvec{\theta }) - \eta (\varvec{\theta }^T\varvec{\tilde{D}}\varvec{\theta } - 1) - 2\varvec{\gamma }^T\varvec{C}\varvec{\theta }. \end{aligned}$$

The partial derivative set to zero gives

$$\begin{aligned} \frac{\partial L}{\partial \varvec{\theta }}=-2\varvec{\tilde{Y}}^T(\varvec{\tilde{X}} \varvec{\hat{\beta }}-\varvec{\tilde{Y}}\varvec{\theta }) -2\eta \varvec{\tilde{D}}\varvec{\theta } -2\varvec{C}^T\varvec{\gamma }=\varvec{0}. \end{aligned}$$

Hence,

$$\begin{aligned} \varvec{\theta }=(\varvec{\tilde{Y}}^T\varvec{\tilde{Y}} -\eta \varvec{\tilde{D}})^{-1}(\varvec{\tilde{Y}}^T \varvec{\tilde{X}}\varvec{\hat{\beta }} +\varvec{C}^T\varvec{\gamma }). \end{aligned}$$

To solve for the Lagrange multipliers $\eta $ and $\varvec{\gamma }$, the side constraints are used.

$$\begin{aligned} 0=\varvec{C}\varvec{\theta }= \varvec{C}(\varvec{\tilde{Y}}^T\varvec{\tilde{Y}}-\eta \varvec{\tilde{D}})^{-1} \varvec{\tilde{Y}}^T\varvec{\tilde{X}}\varvec{\hat{\beta }}+ \varvec{C}(\varvec{\tilde{Y}}^T\varvec{\tilde{Y}}-\eta \varvec{\tilde{D}})^{-1} \varvec{C}^T\varvec{\gamma } \end{aligned}$$

So

$$\begin{aligned} \varvec{\gamma }=-\left( \varvec{C} (\varvec{\tilde{Y}}^T \varvec{\tilde{Y}}-\eta \varvec{\tilde{D}})^{-1} \varvec{C}^T \right) ^{-1} \varvec{C} (\varvec{\tilde{Y}}^T\varvec{\tilde{Y}}- \eta \varvec{\tilde{D}})^{-1} \varvec{\tilde{Y}}^T \varvec{\tilde{X}}\varvec{\hat{\beta }}. \end{aligned}$$

We conclude

$$\begin{aligned} \varvec{\theta }= & {} (\varvec{\tilde{Y}}^T \varvec{\tilde{Y}} -\eta \varvec{\tilde{D}})^{-1}\nonumber \\&\left\{ \varvec{I} - \varvec{C}^T(\varvec{C}(\varvec{\tilde{Y}}^T \varvec{\tilde{Y}}- \eta \varvec{\tilde{D}})^{-1}\varvec{C}^T)^{-1} \varvec{C} (\varvec{\tilde{Y}}^T\varvec{\tilde{Y}} - \eta \varvec{\tilde{D}})^{-1} \right\} (\varvec{\tilde{Y}}^T \varvec{\tilde{X}}\varvec{\hat{\beta }}). \end{aligned}$$

(7)

Since $\varvec{\tilde{Y}}^T\varvec{\tilde{Y}}$ is proportional to $\varvec{\tilde{D}}$, there exists a scalar c such that

$$\begin{aligned} (\varvec{\tilde{Y}}^T\varvec{\tilde{Y}} - \eta \varvec{\tilde{D}})^{-1}=c\varvec{\tilde{D}}^{-1}. \end{aligned}$$

Formula (7) can be simplified to

$$\begin{aligned} \varvec{\theta }=c\left\{ \varvec{I} - \varvec{\tilde{D}}^{-1} \varvec{C}^T (\varvec{C} \varvec{\tilde{D}}^{-1} \varvec{C}^T)^{-1} \varvec{C}\right\} \varvec{\tilde{D}}^{-1} \varvec{\tilde{Y}}^T \varvec{\tilde{X}} \varvec{\hat{\beta }}. \end{aligned}$$

Due to the symmetry of $\varvec{\tilde{D}}$ and with the definition of $\varvec{C}=\varvec{Q}^T\varvec{\tilde{D}}$ we obtain

$$\begin{aligned} \varvec{\theta }=c\left\{ \varvec{I} - \varvec{Q} (\varvec{Q}^T \varvec{\tilde{D}} \varvec{Q})^{-1} \varvec{Q}^T \varvec{\tilde{D}} \right\} \varvec{\tilde{D}}^{-1} \varvec{\tilde{Y}}^T \varvec{\tilde{X}} \varvec{\hat{\beta }}. \end{aligned}$$

The scalar c can then be scaled so that the side constraint $\varvec{\theta }^T\varvec{\tilde{D}}\varvec{\theta }=1$ is fulfilled.

1.2 Algorithm for the computation of the initial estimates for $\varvec{\beta }_h$ and $\varvec{\theta }_h$

Input: $h, \varvec{Q}_h, \varvec{X}, \varvec{Y}$, $\lambda $

(i)
Compute $\varvec{D}=\frac{1}{n}\varvec{Y}^T\varvec{Y}$.
(ii)
Generate $\varvec{\theta }_{*}$, a random vector from N(0, 1) of length K.
(iii)
Compute $\hat{\varvec{\theta }}_h=c\left\{ \varvec{I} - \varvec{Q}_h (\varvec{Q}_h^T \varvec{D} \varvec{Q}_h)^{-1} \varvec{Q}_h^T \varvec{D} \right\} \varvec{\theta }_{*}$, with c so that $\hat{\varvec{\theta }}_h^T\varvec{D}\hat{\varvec{\theta }}_h=1$.

Apply twice the following steps:

1.
For fixed $\hat{\varvec{\theta }}_h$ apply sparse least trimmed squares (sparse LTS) regression (Alfons et al. 2013) to the response $\varvec{Y}\hat{\varvec{\theta }}_h$ and predictors $\varvec{X}$.

Let $a=0.5n$ and $\Vert \varvec{r}\Vert ^2_{1:a}=\sum _{i=1}^a r_{(i)}^2$ denote the sum of the a smallest squared elements of the vector $\varvec{r}$. The sparse LTS estimator is a robust version of the Lasso and defined as
$$\begin{aligned} \min _{\varvec{\beta }}\frac{1}{a}\Vert \varvec{Y} \varvec{\theta }_h-\varvec{X}\varvec{\beta }\Vert ^2_{(1):(a)} + \lambda \Vert \varvec{\beta }\Vert _1. \end{aligned}$$
As in Alfons et al. (2013), a re-weighting step is carried out afterwards yielding $\hat{\varvec{\beta }}_h$.
2.
For fixed $\hat{\varvec{\beta }}_h$ apply least absolute deviation (LAD) regression with response $\varvec{X}\hat{\varvec{\beta }}_h$ and predictor matrix $\varvec{Y}$:
$$\begin{aligned} \varvec{\theta }^*=\mathop {\text{ argmin }}_{\varvec{\theta }}\Vert \varvec{Y}\varvec{\theta }-\varvec{X} \hat{\varvec{\beta }}_h\Vert _1. \end{aligned}$$
The LDA estimator is robust to outliers in the dependent variable, but not to leverage points (i.e. outliers in the covariate space). Since the covariates are dummy variables here, leverage points cannot occur. Then we apply the transformation for satisfying the side constraints:
$$\begin{aligned} \hat{\varvec{\theta }}_h=c\left\{ \varvec{I} - \varvec{Q}_h (\varvec{Q}_h^T \varvec{D} \varvec{Q}_h)^{-1} \varvec{Q}_h^T \varvec{D} \right\} \varvec{\theta }^*. \end{aligned}$$

Output: Initial estimators $\hat{\varvec{\beta }}_h$ and $\hat{\varvec{\theta }}_h$.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Ortner, I., Filzmoser, P. & Croux, C. Robust and sparse multigroup classification by the optimal scoring approach. Data Min Knowl Disc 34, 723–741 (2020). https://doi.org/10.1007/s10618-019-00666-8

Download citation

Received: 12 June 2017
Accepted: 28 November 2019
Published: 20 February 2020
Issue Date: May 2020
DOI: https://doi.org/10.1007/s10618-019-00666-8

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Robust and sparse multigroup classification by the optimal scoring approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors

Sparse Clustering with K-Means - Which Penalties and for Which Data?

Sparse Weighted K-Means for Groups of Mixed-Type Variables

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

1.1 Derivation of expression (4) for the score vector estimates

1.2 Algorithm for the computation of the initial estimates for \(\varvec{\beta }_h\) and \(\varvec{\theta }_h\)

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Robust and sparse multigroup classification by the optimal scoring approach

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

Multiclass Sparse Discriminant Analysis Incorporating Graphical Structure Among Predictors

Sparse Clustering with K-Means - Which Penalties and for Which Data?

Sparse Weighted K-Means for Groups of Mixed-Type Variables

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Publisher's Note

Appendix

Appendix

1.1 Derivation of expression (4) for the score vector estimates

1.2 Algorithm for the computation of the initial estimates for \(\varvec{\beta }_h\) and \(\varvec{\theta }_h\)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now