Skip to main content
Log in

Sparse optimal discriminant clustering

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

In this manuscript, we reinvestigate an existing clustering procedure, optimal discriminant clustering (ODC; Zhang and Dai in Adv Neural Inf Process Syst 23(12):2241–2249, 2009), and propose to use cross-validation to select the tuning parameter. Furthermore, because in high-dimensional data many of the features may be non-informative for clustering, we develop a variation of ODC, sparse optimal discriminant clustering (SODC), by adding a group-lasso type of penalty to ODC. We also demonstrate that both ODC and SDOC can be used as a dimension reduction tool for data visualization in cluster analysis.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Ben-David, S., Von Luxburg, U., Pal, D.: A sober look at clustering stability. 19th Annual Conference on Learning Theory (COLT 2006) 4005, 5–19 (2006)

  • Ben-Hur, A., Elisseeff, A., Guyon, I.: A stability based method for discovering structure in clustered data. Pac. Symp. Biocomput. 7, 6–17 (2002)

    Google Scholar 

  • Bouveyron, C., Brunet, C.: Simultaneous model-based clustering and visualization in the Fisher discriminative subspace. Stat. Comput. 22(1), 301–324 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Calinski, R.B., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. Simul. Comput. 3(1), 1–27 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  • Cattell, R.B.: The scree test for the number of factors. Multivar. Behav. Res. 1(2), 245–276 (1966)

    Article  Google Scholar 

  • Chang, W.: On using principal components before separating a mixture of two multivaiate normal distributions. Appl. Stat. 32(3), 267–275 (1998)

    Article  Google Scholar 

  • Clemmensen, L., Hastie, T., Witten, D.M., Ersboll, B.: Sparse discriminant analysis. Technometrics 53(4), 406–413 (2011)

    Article  MathSciNet  Google Scholar 

  • Cohen, J.: A coefficient of agreement for nominal scales. Educ. Psychol. Meas. 20(1), 37–46 (1960)

    Article  Google Scholar 

  • De la Torre, F., Kanade, T.: Discriminative cluster analysis. In: The 23rd International Conference on Machine Learning, pp. 241–248 (2006)

  • Fang, Y., Wang, J.: Selection of the number of clusters via the bootstrap method. Comput. Stat. Data Anal. 56(3), 468–477 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Fowlkes, E.B., Mallows, C.L.: A method for comparing two hierarchical clusterings. J. Am. Stat. Assoc. 78(383), 553–584 (1983)

    Article  MATH  Google Scholar 

  • Friedman, J.H., Meulman, J.J.: Clustering objects on subsets of attributes (with discussion). J. R. Stat. Soc. Ser. B 66(4), 815–849 (2004)

    Article  MathSciNet  MATH  Google Scholar 

  • Friedman, J.H., Tukey, J.W.: A projection pursuit algorithm for exploratory data analysis. IEEE Trans. Comput. C–23(9), 881–890 (1974)

    Article  MATH  Google Scholar 

  • Gnanadesikan, R.: Methods for Statistical Data Analysis of Multivariate Observations, 2nd edn. Wiley, New York (1997)

    Book  MATH  Google Scholar 

  • Hartigan, J.A.: Clustering Algorithms. Wiley, New York (1975)

    MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., Buja, A.: Flexible discriminant analysis by optimal scoring. J. Am. Stat. Assoc. 89, 1255–1270 (1994)

    Article  MathSciNet  MATH  Google Scholar 

  • Hastie, T., Tibshirani, R., Friedman, J.: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. Springer Series in Statistics, 2nd edn. Springer, New York (2009)

    Book  MATH  Google Scholar 

  • Johnson, S.C.: Hierarchical clustering schemes. Psychometrika 2, 241–254 (1967)

    Article  Google Scholar 

  • Jones, M.C., Sibson, R.: What is projection pursuit? J. R. Stat. Soc. Ser. A 150(1), 1–37 (1987)

    Article  MathSciNet  MATH  Google Scholar 

  • Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluter Analysis. Wiley, New York (1990)

    Book  Google Scholar 

  • Krzanowski, W.J., Lai, Y.T.: A criterion for determining the number of groups in a data set using sum-of-squares clustering. Biometrics 44(1), 23–34 (1988)

    Article  MathSciNet  MATH  Google Scholar 

  • Lange, T., Roth, V., Braun, M., Buhmann, J.: Stability-based validation of clustering solutions. Neural Comput. 16(6), 1299–1323 (2004)

    Article  MATH  Google Scholar 

  • MacQueen, J.B.: Some methods for classification and analysis of multivariate observations. In: Proceedings of 5th Berkeley Symposium on Mathematical Statistics and Probability 1, 281–297 (1967)

  • Maugis, C., Celeux, G., Martin-Magniette, M.L.: Variable selection in model-based clustering: a general variable role modeling. Comput. Stat. Data Anal. 53(11), 3872–3882 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Melnykov, V., Chen, W.-C., Maitra, R.: MixSim: an R package for simulating data to study performance of clustering algorithms. J. Stat. Softw. 51(12), 1–25 (2012)

    Article  Google Scholar 

  • Ng, A., Jordan, M., Weiss, Y.: On spectral clustering: analysis and an algorithm. Adv. Neural Inf. Process. Syst. 14, 849–856 (2001)

  • Raftery, A.E., Dean, N.: Variable selection for model-based clustering. J. Am. Stat. Assoc. 101(473), 168–178 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Rand, W.M.: Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. (Am. Stat. Assoc.) 66(336), 846–850 (1971)

    Article  Google Scholar 

  • Rocci, R., Gattone, S.F., Vichi, M.: A new dimension reduction method: factor discriminant K-means. J. Classif. 28, 210–226 (2011)

    Article  MathSciNet  MATH  Google Scholar 

  • Shi, J., Malik, J.: Normalized cuts and image segmentation. IEEE Trans. Pattern Anal. Mach. Intell. 22(8), 888–905 (2000)

    Article  Google Scholar 

  • Steinley, D., Brusco, M.J.: A new variable weighting and selection procedure for K-means cluster analysis. Multivar. Behav. Res. 43(1), 77–108 (2008)

    Article  MathSciNet  Google Scholar 

  • Sugar, C., James, G.: Finding the number of clusters in a data set: an imformation theoretic approach. J. Am. Stat. Assoc. 98(463), 750–763 (2003)

    Article  MathSciNet  MATH  Google Scholar 

  • Sun, L., Ji, S., Ye, J.: A least squares formulation for canonical correlation analysis. In: The 25th International Conference Machine Learning, pp. 1024–1031 (2008)

  • Sun, W., Wang, J., Fang, Y.: Regularized k-means clustering of high-dimensional data and its asymptotic consistency. Electron. J. Stat. 6, 148–167 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  • Sun, W., Wang, J., Fang, Y.: Consistent selection of tuning parameters via variable selection stability. J. Mach. Learn. Res. 14, 3419–3440 (2013)

    MathSciNet  MATH  Google Scholar 

  • Tibshirani, R., Walther, G., Hastie, T.: Estimating the number of clusters in a data set via the gap statistic. J. R. Stat. Soc. Ser. B 63(2), 411–423 (2001)

    Article  MathSciNet  MATH  Google Scholar 

  • Tyler, D.E., Critchley, F., Dümbgen, L., Oja, H.: Invariant co-ordinate selection (with discussion). J. R. Stat. Soc. Ser. B 71(3), 549–592 (2009)

    Article  MathSciNet  MATH  Google Scholar 

  • Wang, J.: Consistent selection of the number of clusters via crossvalidation. Biometrika 97(4), 893–904 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Witten, D.M., Tibshirani, R.: A framework for feature selection in clustering. J. Am. Stat. Assoc. 105(490), 713–726 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Yuan, M., Lin, Y.: Model selection and estimation in regression with grouped variables. J. R. Stat. Soc. Ser. B 68(1), 49–67 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  • Zhang, Z., Dai, G.: Optimal scoring for unsupervised learning. Adv. Neural Inf. Process. Syst. 23(12), 2241–2249 (2009)

    Google Scholar 

  • Zou, H., Hastie, T.: Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B 67(2), 301–320 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  • Zou, H., Hastie, T., Tibshirani, R.: Sparse principal component analysis. J. Comput. Graph. Stat. 15(2), 265–286 (2006)

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yanhong Wang.

Appendix

Appendix

Given \(\mathbf Y\), the sub-gradient equations for (3) of \(W\) is

$$\begin{aligned}&- 2\widetilde{\mathbf{X}}_j^{'} \left( \mathbf{Y}- \sum _{l=1}^p \widetilde{\mathbf{X}}_l w_l\right) + 2\lambda _2 w_{j}\\&\quad + \lambda _1\frac{w_{j}}{\Vert w_{j}\Vert _2} = 0, \quad j=1, \ldots , p. \end{aligned}$$

Let the minimizer of (3) be \(\widehat{W}=\left( \widehat{w}_1, \ldots , \widehat{w}_p\right) '\). If

$$\begin{aligned} \left\| \widetilde{\mathbf{X}}_j^{'}\left( \mathbf{Y}- \sum _{l\ne j} \widetilde{\mathbf{X}}_l\widehat{w}_l\right) \right\| _2< \left\| \frac{\lambda _1}{2}\right\| , \end{aligned}$$

then \(\widehat{w}_j=0\). (Therefore, \(\widehat{w}_j=0\) for any \(j\) if \(\lambda >\lambda ^{\max }_1=\max _{j}\Vert \widetilde{\mathbf{X}}_j^{'}\mathbf{Y}\Vert \).) Otherwise,

$$\begin{aligned} \widehat{w}_j= \left( \widetilde{\mathbf{X}}_j^{'} \widetilde{\mathbf{X}}_j+\lambda _2+\frac{\lambda _1 }{2\Vert w_{j} \Vert _2}\right) ^{-1}V_j, \end{aligned}$$

where \(V_j=\widetilde{\mathbf{X}}_j^{'}(\mathbf{Y}- \sum _{l\ne j} \widetilde{\mathbf{X}}_l\widehat{w}_l)\). Note that \(\widetilde{\mathbf{X}}_j^{'} \widetilde{\mathbf{X}}_j\) is actually a diagonal matrix where diagonal terms are sample variances of features. If we conduct standardization on the design matrix at the beginning, we have \(\widetilde{\mathbf{X}}_j^{'} \widetilde{\mathbf{X}}_j=I_{k-1}\) then the above equation becomes

$$\begin{aligned} \widehat{w}_j = \left( \frac{2\Vert \widehat{w}_{j}\Vert _{2}}{\lambda _1+ 2(1+\lambda _2)\Vert \widehat{w}_{j}\Vert _{2}}\right) V_{j}. \end{aligned}$$

The Euclidean norm is \(\Vert \widehat{w}_{j}\Vert _{2}=\frac{2\Vert V_{j}\Vert _2-\lambda _1}{2(1+\lambda _2)}\). Plugging this norm to the above formula of \(\widehat{w}_j\), we get the formula of \(\widehat{w}_j\) stated in the theorem.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Fang, Y. & Wang, J. Sparse optimal discriminant clustering. Stat Comput 26, 629–639 (2016). https://doi.org/10.1007/s11222-015-9547-8

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-015-9547-8

Keywords

Navigation