Skip to main content

MML-Based Approach for Determining the Number of Topics in EDCM Mixture Models

  • Conference paper
  • First Online:
Advances in Artificial Intelligence (Canadian AI 2018)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 10832))

Included in the following conference series:

Abstract

This paper proposes an unsupervised algorithm for learning a finite mixture model of the exponential family approximation to the Dirichlet Compound Multinomial (EDCM). An important part of the mixture modeling problem is determining the number of components that best describes the data. In this work, we extend the Minimum Message Length (MML) principle to determine the number of topics (clusters) in case of text modeling using a mixture of EDCMs. Parameters estimation is based on the previously proposed deterministic annealing expectation-maximization approach. The proposed method is validated using several document collections. A comparison with results obtained for other selection criteria is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://www.datalab.uci.edu/author-topic/NIPs.htm.

  2. 2.

    http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html.

  3. 3.

    http://www.cs.cmu.edu/~webkb/.

References

  1. Bouguila, N., Ziou, D.: Improving content based image retrieval systems using finite multinomial dirichlet mixture. In: Proceedings of the 14th IEEE Signal Processing Society Workshop, pp. 23–32. IEEE (2004)

    Google Scholar 

  2. Elkan, C.: Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 289–296. ACM (2006)

    Google Scholar 

  3. McLachlan, G., Peel, D.: Finite mixture models. Wiley, New York (2004)

    MATH  Google Scholar 

  4. Baxter, R.A., Oliver, J.J.: Finding overlapping components with MML. Stat. Comput. 10(1), 5–16 (2000)

    Article  Google Scholar 

  5. Banerjee, A., Merugu, S., Dhillon, I.S., Ghosh, J.: Clustering with bregman divergences. J. Mach. Learn. Res. 6, 1705–1749 (2005)

    MathSciNet  MATH  Google Scholar 

  6. Cover, T.M., Thomas, J.A.: Elements of Information Theory. Wiley, New York (2012)

    MATH  Google Scholar 

  7. Wallace, C.S.: Statistical and Inductive Inference by Minimum Message Length. Springer, New York (2005). https://doi.org/10.1007/0-387-27656-4

    MATH  Google Scholar 

  8. Conway, J.H., Sloane, N.J.A.: Sphere Packings, Lattices and Groups, vol. 290. Springer, New York (1993). https://doi.org/10.1007/978-1-4757-2249-9

    MATH  Google Scholar 

  9. Bouguila, N., Ziou, D.: High-dimensional unsupervised selection and estimation of a finite generalized dirichlet mixture model based on minimum message length. IEEE Trans. Pattern Anal. Mach. Intell. 29(10), 1716–1731 (2007)

    Article  Google Scholar 

  10. Figueiredo, M.A.T., Jain, A.K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 381–396 (2002)

    Article  Google Scholar 

  11. Graybill, F.A.: Matrices with Applications in Statistics. Wadsworth, Belmont, CA (1983)

    MATH  Google Scholar 

  12. Wallace, C.S.: Classification by minimum-message-length inference. In: Akl, S.G., Fiala, F., Koczkodaj, W.W. (eds.) ICCI 1990. LNCS, vol. 468, pp. 72–81. Springer, Heidelberg (1990). https://doi.org/10.1007/3-540-53504-7_63

    Chapter  Google Scholar 

  13. Jefferys, W.H., Berger, J.O.: Ockham’s razor and Bayesian analysis. Am. Sci. 80(1), 64–72 (1992)

    Google Scholar 

  14. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  15. Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)

    Article  MATH  Google Scholar 

  16. Bezdek, J.C.: Pattern Recognition with Fuzzy Objective Function Algorithms. Springer, Boston (1981). https://doi.org/10.1007/978-1-4757-0450-1

    Book  MATH  Google Scholar 

  17. Lin, Y.S., Jiang, J.Y., Lee, S.J.: A similarity measure for text classification and clustering. IEEE Trans. Knowl. Data Eng. 26(7), 1575–1590 (2014)

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuha Zamzami .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Zamzami, N., Bouguila, N. (2018). MML-Based Approach for Determining the Number of Topics in EDCM Mixture Models. In: Bagheri, E., Cheung, J. (eds) Advances in Artificial Intelligence. Canadian AI 2018. Lecture Notes in Computer Science(), vol 10832. Springer, Cham. https://doi.org/10.1007/978-3-319-89656-4_17

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-89656-4_17

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-89655-7

  • Online ISBN: 978-3-319-89656-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics