MML-Based Approach for Determining the Number of Topics in EDCM Mixture Models

Zamzami, Nuha; Bouguila, Nizar

doi:10.1007/978-3-319-89656-4_17

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10832))

Included in the following conference series:

Canadian Conference on Artificial Intelligence

3045 Accesses
4 Citations

Abstract

This paper proposes an unsupervised algorithm for learning a finite mixture model of the exponential family approximation to the Dirichlet Compound Multinomial (EDCM). An important part of the mixture modeling problem is determining the number of components that best describes the data. In this work, we extend the Minimum Message Length (MML) principle to determine the number of topics (clusters) in case of text modeling using a mixture of EDCMs. Parameters estimation is based on the previously proposed deterministic annealing expectation-maximization approach. The proposed method is validated using several document collections. A comparison with results obtained for other selection criteria is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Bouguila, N., Ziou, D.: Improving content based image retrieval systems using finite multinomial dirichlet mixture. In: Proceedings of the 14th IEEE Signal Processing Society Workshop, pp. 23–32. IEEE (2004)
Google Scholar
Elkan, C.: Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 289–296. ACM (2006)
Google Scholar
McLachlan, G., Peel, D.: Finite mixture models. Wiley, New York (2004)
MATH Google Scholar
Baxter, R.A., Oliver, J.J.: Finding overlapping components with MML. Stat. Comput. 10(1), 5–16 (2000)
Article Google Scholar
Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)
MathSciNet MATH Google Scholar
Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2012)
MATH Google Scholar
Wallace, C.S.: Statistical and Inductive Inference by Minimum Message Length. Springer, New York (2005). https://doi.org/10.1007/0-387-27656-4
MATH Google Scholar
Conway, J.H., Sloane, N.J.A.: Sphere Packings, Lattices and Groups, vol. 290. Springer, New York (1993). https://doi.org/10.1007/978-1-4757-2249-9
MATH Google Scholar
Bouguila, N., Ziou, D.: High-dimensional unsupervised selection and estimation of a finite generalized dirichlet mixture model based on minimum message length. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1716–1731 (2007)
Article Google Scholar
Figueiredo, M.A.T., Jain, A.K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 381–396 (2002)
Article Google Scholar
Graybill, F.A.: Matrices with Applications in Statistics. Wadsworth, Belmont, CA (1983)
MATH Google Scholar
Wallace, C.S.: Classification by minimum-message-length inference. In: Akl, S.G., Fiala, F., Koczkodaj, W.W. (eds.) ICCI 1990. LNCS, vol. 468, pp. 72–81. Springer, Heidelberg (1990). https://doi.org/10.1007/3-540-53504-7_63
Chapter Google Scholar
Jefferys, W.H., Berger, J.O.: Ockham’s razor and Bayesian analysis. Am. Sci. 80(1), 64–72 (1992)
Google Scholar
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Article MathSciNet MATH Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Article MATH Google Scholar
Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Springer, Boston (1981). https://doi.org/10.1007/978-1-4757-0450-1
Book MATH Google Scholar
Lin, Y.S., Jiang, J.Y., Lee, S.J.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2014)
Article Google Scholar

Download references

Author information

Authors and Affiliations

Concordia Institute for Information Systems Engineering, Concordia University, Montreal, QC, Canada
Nuha Zamzami & Nizar Bouguila
Faculty of Computing and Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia
Nuha Zamzami

Authors

Nuha Zamzami
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Nuha Zamzami .

Editor information

Editors and Affiliations

Ryerson University, Toronto, Ontario, Canada
Ebrahim Bagheri
McGill University, Montréal, Québec, Canada
Jackie C.K. Cheung

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Zamzami, N., Bouguila, N. (2018). MML-Based Approach for Determining the Number of Topics in EDCM Mixture Models. In: Bagheri, E., Cheung, J. (eds) Advances in Artificial Intelligence. Canadian AI 2018. Lecture Notes in Computer Science(), vol 10832. Springer, Cham. https://doi.org/10.1007/978-3-319-89656-4_17

Download citation

DOI: https://doi.org/10.1007/978-3-319-89656-4_17
Published: 06 April 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-89655-7
Online ISBN: 978-3-319-89656-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics