Abstract
We propose a robust and sparse classification method based on the optimal scoring approach. It is also applicable if the number of variables exceeds the number of observations. The data are first projected into a low dimensional subspace according to an optimal scoring criterion. The projection only includes a subset of the original variables (sparse modeling) and is not distorted by outliers (robust modeling). In this low dimensional subspace classification is performed by minimizing a robust Mahalanobis distance to the group centers. The low dimensional representation of the data is also useful for visualization purposes. We discuss the algorithm for the proposed method in detail. A simulation study illustrates the properties of robust and sparse classification by optimal scoring compared to the non-robust and/or non-sparse alternative methods. Three real data applications are given.






Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alfons A, Croux C, Gelper S et al (2013) Sparse least trimmed squares regression for analyzing high-dimensional large data sets. Ann Appl Stat 7(1):226–248
Armanino C, Leardi R, Lanteri S, Modi G (1989) Chemometric analysis of tuscan olive oils. Chemom Intell Lab Syst 5(4):343–354
Brodinova S, Ortner T, Filzmoser P, Zaharieva M, Breiteneder C (2015) Evaluation of robust PCA for supervised audio outlier detection. In: Proceeding of 22nd international conference on computational statistics (COMPSTAT)
Clemmensen L, Kuhn M (2012) sparseLDA: sparse discriminant analysis. R package version 0.1-6. https://CRAN.R-project.org/package=sparseLDA. Accessed 21 Oct 2015
Clemmensen L, Hastie T, Witten D, Ersbøll B (2012) Sparse discriminant analysis. Technometrics 53(4):406–413
Efron B, Hastie T, Johnstone I, Tibshirani R et al (2004) Least angle regression. Ann Stat 32(2):407–499
Hampel F (1974) The influence curve and its role in robust estimation. J Am Stat Assoc 69(346):383–393
Hampel F, Ronchetti E, Rousseeuw P, Stahel W (1986) Robust statistics: the approach based on influence functions. Wiley, Hoboken
Hastie T, Tibshirani R, Buja A (1994) Flexible discriminant analysis by optimal scoring. J Am Stat Assoc 89(428):1255–1270
Hastie T, Tibshirani R, Wainwright M (2015) Statistical learning with sparsity: the lasso and generalizations. CRC Press, Boca Raton
Hoffmann I, Filzmoser P, Serneels S, Varmuza K (2016) Sparse and robust PLS for binary classification. J Chemom 30(4):153–162
Hubert M, Van Driessen K (2004) Fast and robust discriminant analysis. Comput Stat Data Anal 45(2):301–320
Hubert M, Rousseeuw P, Van Aelst S (2008) High-breakdown robust multivariate methods. Stat Sci 23(1):92–119
Johnson R, Wichern D et al (2002) Applied multivariate statistical analysis, vol 5. Prentice Hall, Upper Saddle River
R Core Team: R (2016) A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna
Rousseeuw P, Van Driessen K (1999) A fast algorithm for the minimum covariance determinant estimator. Technometrics 41(3):212–223
Tibshirani R (2011) Regression shrinkage and selection via the lasso: a retrospective. J R Stat Soc B 73(3):273–282
Todorov V (2016) rrcovHD: robust multivariate methods for high dimensional data. R package version 0.2-4. https://CRAN.R-project.org/package=rrcovHD. Accessed 17 Feb 2016
Todorov V, Pires A (2007) Comparative performance of several robust linear discriminant analysis methods. REVSTAT Stat J 5(1):63–83
Vanden Branden K, Hubert M (2005) Robust classification in high dimensions based on the simca method. Chemom Intell Lab Syst 79(1):10–21
Witten D, Tibshirani R (2011) Penalized classification using fisher’s linear discriminant. J R Stat Soc Ser B (Statistical Methodology) 73(5):753–772
Wolke R, Schwetlick H (1988) Iteratively reweighted least squares: algorithms, convergence analysis, and numerical comparisons. SIAM J Sci Stat Comput 9(5):907–921
Wu T, Lange K (2008) Coordinate descent algorithms for lasso penalized regression. Ann Appl Stat 2(1):224–244
Acknowledgements
This work is supported by the Austrian Science Fund (FWF), Project P 26871-N20. We would like to thank the referees for useful comments.
Author information
Authors and Affiliations
Corresponding author
Additional information
Responsible editor: Jieping Ye.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Appendix
Appendix
1.1 Derivation of expression (4) for the score vector estimates
Let \(\omega _1,\ldots ,\omega _n\) be case weights for each observation. \(\varvec{\varOmega }\) is a diagonal matrix with these case weights in the diagonal. Then the weighted data matrices are \(\varvec{\tilde{Y}}=\varvec{\varOmega }^{1/2}\varvec{Y}\) and \(\varvec{\tilde{X}}=\varvec{\varOmega }^{1/2}\varvec{X}\). The diagonal matrix with weighted class proportions is \(\varvec{\tilde{D}}=\frac{1}{\sum \omega _i}\varvec{\tilde{Y}}^T \varvec{\tilde{Y}}\). The optimization problem (2) in step h for a given \(\varvec{\hat{\beta }}\) can be rewritten as
with \(\varvec{C}=[\hat{\varvec{\theta }}_{1},\ldots , \hat{\varvec{\theta }}_{h-1}]^T\varvec{\tilde{D}}\), and we drop the depending on the index h for ease of notation.
We use the method of Lagrange multipliers. The Lagrangian associated to Eq. (6) is given by
The partial derivative set to zero gives
Hence,
To solve for the Lagrange multipliers \(\eta \) and \(\varvec{\gamma }\), the side constraints are used.
So
We conclude
Since \(\varvec{\tilde{Y}}^T\varvec{\tilde{Y}}\) is proportional to \(\varvec{\tilde{D}}\), there exists a scalar c such that
Formula (7) can be simplified to
Due to the symmetry of \(\varvec{\tilde{D}}\) and with the definition of \(\varvec{C}=\varvec{Q}^T\varvec{\tilde{D}}\) we obtain
The scalar c can then be scaled so that the side constraint \(\varvec{\theta }^T\varvec{\tilde{D}}\varvec{\theta }=1\) is fulfilled.
1.2 Algorithm for the computation of the initial estimates for \(\varvec{\beta }_h\) and \(\varvec{\theta }_h\)
Input: \(h, \varvec{Q}_h, \varvec{X}, \varvec{Y}\), \(\lambda \)
-
(i)
Compute \(\varvec{D}=\frac{1}{n}\varvec{Y}^T\varvec{Y}\).
-
(ii)
Generate \(\varvec{\theta }_{*}\), a random vector from N(0, 1) of length K.
-
(iii)
Compute \(\hat{\varvec{\theta }}_h=c\left\{ \varvec{I} - \varvec{Q}_h (\varvec{Q}_h^T \varvec{D} \varvec{Q}_h)^{-1} \varvec{Q}_h^T \varvec{D} \right\} \varvec{\theta }_{*}\), with c so that \(\hat{\varvec{\theta }}_h^T\varvec{D}\hat{\varvec{\theta }}_h=1\).
Apply twice the following steps:
-
1.
For fixed \(\hat{\varvec{\theta }}_h\) apply sparse least trimmed squares (sparse LTS) regression (Alfons et al. 2013) to the response \(\varvec{Y}\hat{\varvec{\theta }}_h\) and predictors \(\varvec{X}\).
Let \(a=0.5n\) and \(\Vert \varvec{r}\Vert ^2_{1:a}=\sum _{i=1}^a r_{(i)}^2\) denote the sum of the a smallest squared elements of the vector \(\varvec{r}\). The sparse LTS estimator is a robust version of the Lasso and defined as
$$\begin{aligned} \min _{\varvec{\beta }}\frac{1}{a}\Vert \varvec{Y} \varvec{\theta }_h-\varvec{X}\varvec{\beta }\Vert ^2_{(1):(a)} + \lambda \Vert \varvec{\beta }\Vert _1. \end{aligned}$$As in Alfons et al. (2013), a re-weighting step is carried out afterwards yielding \(\hat{\varvec{\beta }}_h\).
-
2.
For fixed \(\hat{\varvec{\beta }}_h\) apply least absolute deviation (LAD) regression with response \(\varvec{X}\hat{\varvec{\beta }}_h\) and predictor matrix \(\varvec{Y}\):
$$\begin{aligned} \varvec{\theta }^*=\mathop {\text{ argmin }}_{\varvec{\theta }}\Vert \varvec{Y}\varvec{\theta }-\varvec{X} \hat{\varvec{\beta }}_h\Vert _1. \end{aligned}$$The LDA estimator is robust to outliers in the dependent variable, but not to leverage points (i.e. outliers in the covariate space). Since the covariates are dummy variables here, leverage points cannot occur. Then we apply the transformation for satisfying the side constraints:
$$\begin{aligned} \hat{\varvec{\theta }}_h=c\left\{ \varvec{I} - \varvec{Q}_h (\varvec{Q}_h^T \varvec{D} \varvec{Q}_h)^{-1} \varvec{Q}_h^T \varvec{D} \right\} \varvec{\theta }^*. \end{aligned}$$
Output: Initial estimators \(\hat{\varvec{\beta }}_h\) and \(\hat{\varvec{\theta }}_h\).
Rights and permissions
About this article
Cite this article
Ortner, I., Filzmoser, P. & Croux, C. Robust and sparse multigroup classification by the optimal scoring approach. Data Min Knowl Disc 34, 723–741 (2020). https://doi.org/10.1007/s10618-019-00666-8
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10618-019-00666-8