Skip to main content

Fast Simultaneous Clustering and Feature Selection for Binary Data

  • Conference paper
Book cover Advances in Intelligent Data Analysis XIII (IDA 2014)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8819))

Included in the following conference series:

Abstract

This paper addresses the problem of clustering binary data with feature selection within the context of maximum likelihood (ML) and classification maximum likelihood (CML) approaches. In order to efficiently perform the clustering with feature selection, we propose the use of an appropriate Bernoulli model. We derive two algorithms: Expectation-Maximization (EM) and Classification EM (CEM) with feature selection. Without requiring a knowledge of the number of clusters, both algorithms optimize two approximations of the minimum message length (MML) criterion. To exploit the advantages of EM for clustering and of CEM for fast convergence, we combine the two algorithms. With Monte Carlo simulations and by varying parameters of the model, we rigorously validate the approach. We also illustrate our contribution using real datasets commonly used in document clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and applications for approximate nonnegative matrix factorization. In: Computational Statistics and Data Analysis, pp. 155–173 (2006)

    Google Scholar 

  2. Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 719–725 (2000)

    Google Scholar 

  3. Celeux, G., Govaert, G.: A classification em algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal., 315–332 (1992)

    Google Scholar 

  4. Dempster, A.P., Laird, M.N., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1–22 (1977)

    Google Scholar 

  5. Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. John Willey & Sons, New Yotk (1973)

    MATH  Google Scholar 

  6. Figueiredo, M.A.T., Jain, K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell., 381–396 (2002)

    Google Scholar 

  7. Grim, J.: Multivariate statistical pattern recognition with nonreduced dimensionality. Kybernetika, 142–157 (1986)

    Google Scholar 

  8. Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification, 193–218 (1985)

    Google Scholar 

  9. Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell., 1154–1166 (2004)

    Google Scholar 

  10. Li, M., Zhang, L.: Multinomial mixture model with feature selection for text clustering. Know.-Based Syst., 704–708 (2008)

    Google Scholar 

  11. McLachlan, G.J., Peel, D.: Finite mixture models. New York (2000)

    Google Scholar 

  12. Pudil, P., Novovicová, J., Choakjarernwanit, N., Kittler, J.: Feature selection based on the approximation of class densities by finite mixtures of special type. Pattern Recognition, 1389–1398 (1995)

    Google Scholar 

  13. Schwarz, G.E.: Estimating the dimension of a model. Annal of Statistics, 461–464 (1978)

    Google Scholar 

  14. Strehl, A., Ghosh, J.: Cluster ensembles — a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 583–617 (2003)

    Google Scholar 

  15. Symons, M.: Clustering criteria and multivariate normale mixture. Biometrics, 35–43 (1981)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Laclau, C., Nadif, M. (2014). Fast Simultaneous Clustering and Feature Selection for Binary Data. In: Blockeel, H., van Leeuwen, M., Vinciotti, V. (eds) Advances in Intelligent Data Analysis XIII. IDA 2014. Lecture Notes in Computer Science, vol 8819. Springer, Cham. https://doi.org/10.1007/978-3-319-12571-8_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-12571-8_17

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-12570-1

  • Online ISBN: 978-3-319-12571-8

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics