Fast Simultaneous Clustering and Feature Selection for Binary Data

Laclau, Charlotte; Nadif, Mohamed

doi:10.1007/978-3-319-12571-8_17

Charlotte Laclau^17,18 &
Mohamed Nadif¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 8819))

Included in the following conference series:

International Symposium on Intelligent Data Analysis

1493 Accesses
2 Citations

Abstract

This paper addresses the problem of clustering binary data with feature selection within the context of maximum likelihood (ML) and classification maximum likelihood (CML) approaches. In order to efficiently perform the clustering with feature selection, we propose the use of an appropriate Bernoulli model. We derive two algorithms: Expectation-Maximization (EM) and Classification EM (CEM) with feature selection. Without requiring a knowledge of the number of clusters, both algorithms optimize two approximations of the minimum message length (MML) criterion. To exploit the advantages of EM for clustering and of CEM for fast convergence, we combine the two algorithms. With Monte Carlo simulations and by varying parameters of the model, we rigorously validate the approach. We also illustrate our contribution using real datasets commonly used in document clustering.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Berry, M.W., Browne, M., Langville, A.N., Pauca, V.P., Plemmons, R.J.: Algorithms and applications for approximate nonnegative matrix factorization. In: Computational Statistics and Data Analysis, pp. 155–173 (2006)
Google Scholar
Biernacki, C., Celeux, G., Govaert, G.: Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 719–725 (2000)
Google Scholar
Celeux, G., Govaert, G.: A classification em algorithm for clustering and two stochastic versions. Comput. Stat. Data Anal., 315–332 (1992)
Google Scholar
Dempster, A.P., Laird, M.N., Rubin, D.B.: Maximum likelihood from incomplete data via the EM algorithm. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 1–22 (1977)
Google Scholar
Duda, R.O., Hart, P.E.: Pattern Classification and Scene Analysis. John Willey & Sons, New Yotk (1973)
MATH Google Scholar
Figueiredo, M.A.T., Jain, K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell., 381–396 (2002)
Google Scholar
Grim, J.: Multivariate statistical pattern recognition with nonreduced dimensionality. Kybernetika, 142–157 (1986)
Google Scholar
Hubert, L., Arabie, P.: Comparing partitions. Journal of Classification, 193–218 (1985)
Google Scholar
Law, M.H.C., Figueiredo, M.A.T., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell., 1154–1166 (2004)
Google Scholar
Li, M., Zhang, L.: Multinomial mixture model with feature selection for text clustering. Know.-Based Syst., 704–708 (2008)
Google Scholar
McLachlan, G.J., Peel, D.: Finite mixture models. New York (2000)
Google Scholar
Pudil, P., Novovicová, J., Choakjarernwanit, N., Kittler, J.: Feature selection based on the approximation of class densities by finite mixtures of special type. Pattern Recognition, 1389–1398 (1995)
Google Scholar
Schwarz, G.E.: Estimating the dimension of a model. Annal of Statistics, 461–464 (1978)
Google Scholar
Strehl, A., Ghosh, J.: Cluster ensembles — a knowledge reuse framework for combining multiple partitions. J. Mach. Learn. Res., 583–617 (2003)
Google Scholar
Symons, M.: Clustering criteria and multivariate normale mixture. Biometrics, 35–43 (1981)
Google Scholar

Download references

Author information

Authors and Affiliations

Université Paris Descartes, LIPADE, Paris, France
Charlotte Laclau & Mohamed Nadif
Imagine Lab, University of Ottawa, Ottawa, Canada
Charlotte Laclau

Authors

Charlotte Laclau
View author publications
You can also search for this author in PubMed Google Scholar
Mohamed Nadif
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer Science, KU Leuven, 3001, Heverlee, Belgium
Hendrik Blockeel & Matthijs van Leeuwen &
Brunel University, UB8 3PH, Uxbridge, UK
Veronica Vinciotti

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Laclau, C., Nadif, M. (2014). Fast Simultaneous Clustering and Feature Selection for Binary Data. In: Blockeel, H., van Leeuwen, M., Vinciotti, V. (eds) Advances in Intelligent Data Analysis XIII. IDA 2014. Lecture Notes in Computer Science, vol 8819. Springer, Cham. https://doi.org/10.1007/978-3-319-12571-8_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-12571-8_17
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-12570-1
Online ISBN: 978-3-319-12571-8
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics