Skip to main content

Mixture-Based Unsupervised Learning for Positively Correlated Count Data

  • Conference paper
  • First Online:
Intelligent Information and Database Systems (ACIIDS 2021)

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12672))

Included in the following conference series:

Abstract

The Multinomial distribution has been widely used to model count data. However, its Naive Bayes assumption usually degrades clustering performance especially when correlation between features is imminent, i.e., text documents. In this paper, we use the Negative Multinomial distribution to perform clustering based on finite mixture models, where the mixture parameters are to be estimated using a novel minorization-maximization algorithm, thriving in high-dimensionality optimization settings. Furthermore, we integrate a model-based feature selection approach to determine the optimal number of components in the mixture. To evaluate the clustering performance of the proposed model, three real-world applications are considered, namely, COVID-19 analysis, Web page clustering and facial expression recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 84.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 109.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge.

  2. 2.

    http://www.cs.cmu.edu/afs/cs.cmu.edu/project/theo-20/www/data/webkb-data.

  3. 3.

    https://mmifacedb.eu/.

References

  1. Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, London (2012). https://doi.org/10.1007/978-1-4614-3223-4_4

    Chapter  Google Scholar 

  2. Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)

    Article  MathSciNet  Google Scholar 

  3. Azam, M., Bouguila, N.: Bounded generalized Gaussian mixture model with ICA. Neural Process. Lett. 49(3), 1299–1320 (2019)

    Article  Google Scholar 

  4. Bakhtiari, A.S., Bouguila, N.: An expandable hierarchical statistical framework for count data modeling and its application to object classification. In: IEEE 23rd International Conference on Tools with Artificial Intelligence, ICTAI 2011, Boca Raton, FL, USA, 7–9, November 2011, pp. 817–824. IEEE Computer Society (2011)

    Google Scholar 

  5. Bakhtiari, A.S., Bouguila, N.: Online learning for two novel latent topic models. In: Linawati, M.M.S., Neuhold, E.J., Tjoa, A.M., You, I. (eds.) ICT-EurAsia 2014. LNCS, vol. 8407, pp. 286–295. Springer, Heidelberg (2014)

    Google Scholar 

  6. Baxter, R.A., Oliver, J.J.: Finding overlapping components with mml. Stat. Comput. 10(1), 5–16 (2000)

    Article  Google Scholar 

  7. Bijl, D., Hyde-Thomson, H.: Speech to text conversion, Jan 9 2001, uS Patent 6,173,259

    Google Scholar 

  8. Bouguila, N.: A data-driven mixture kernel for count data classification using support vector machines. In: 2008 IEEE Workshop on Machine Learning for Signal Processing. pp. 26–31 (2008). https://doi.org/10.1109/MLSP.2008.4685450

  9. Bouguila, N.: Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 20(4), 462–474 (2008)

    Article  Google Scholar 

  10. Bouguila, N.: A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity. IEEE Trans. Knowl. Data Eng. 21(12), 1649–1664 (2009)

    Article  Google Scholar 

  11. Bouguila, N.: Count data modeling and classification using finite mixtures of distributions. IEEE Trans. Neural Networks 22(2), 186–198 (2011)

    Article  Google Scholar 

  12. Bouguila, N., Amayri, O.: A discrete mixture-based kernel for SVMs: application to spam and image categorization. Inf. Process. Manag. 45(6), 631–642 (2009)

    Article  Google Scholar 

  13. Bouguila, N., ElGuebaly, W.: A generative model for spatial color image databases categorization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, March 30–April 4, 2008, Caesars Palace, Las Vegas, Nevada, USA, pp. 821–824. IEEE (2008)

    Google Scholar 

  14. Bouguila, N., ElGuebaly, W.: On discrete data clustering. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 503–510. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68125-0_44

    Chapter  Google Scholar 

  15. Bouguila, N., ElGuebaly, W.: Discrete data clustering using finite mixture models. Pattern Recognit. 42(1), 33–42 (2009)

    Article  Google Scholar 

  16. Bouguila, N., Ghimire, M.N.: Discrete visual features modeling via leave-one-out likelihood estimation and applications. J. Vis. Commun. Image Represent. 21(7), 613–626 (2010)

    Article  Google Scholar 

  17. Bouguila, N., Ziou, D.: MML-based approach for finite Dirichlet mixture estimation and selection. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 42–51. Springer, Heidelberg (2005). https://doi.org/10.1007/11510888_5

    Chapter  Google Scholar 

  18. Bouguila, N., Ziou, D.: Unsupervised selection of a finite Dirichlet mixture model: an mml-based approach. IEEE Trans. Knowl. Data Eng. 18(8), 993–1009 (2006)

    Article  Google Scholar 

  19. Bouguila, N., Ziou, D.: Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization. J. Vis. Commun. Image Represent. 18(4), 295–309 (2007)

    Article  Google Scholar 

  20. Chakraborty, S., Paul, D., Das, S., Xu, J.: Entropy weighted power k-means clustering. In: International Conference on Artificial Intelligence and Statistics, pp. 691–701. PMLR (2020)

    Google Scholar 

  21. Chiarappa, J.A.: Application of the negative multinomial distribution to comparative Poisson clinical trials of multiple experimental treatments versus a single control. Ph.D. thesis, Rutgers University-School of Graduate Studies (2019)

    Google Scholar 

  22. Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol. 1, pp. 1–2. Prague (2004)

    Google Scholar 

  23. De Leeuw, J.: Block-relaxation algorithms in statistics. In: Bock, HH., Lenski, W., Richter, M.M. (eds) Information Systems and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 308–324. Springer, Heidelberg (1994). https://doi.org/10.1007/978-3-642-46808-7_28

  24. Elkan, C.: Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 289–296 (2006)

    Google Scholar 

  25. Figueiredo, M.A.T., Jain, A.K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 381–396 (2002)

    Article  Google Scholar 

  26. Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)

    Google Scholar 

  27. Hunter, D.R., Lange, K.: A tutorial on MM algorithms. Am. Stat. 58(1), 30–37 (2004)

    Article  MathSciNet  Google Scholar 

  28. Kesten, H., Morse, N.: A property of the multinomial distribution. Ann. Math. Stat. 30(1), 120–127 (1959)

    Article  MathSciNet  Google Scholar 

  29. Law, M.H., Figueiredo, M.A., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1154–1166 (2004)

    Article  Google Scholar 

  30. Li, T., Mei, T., Kweon, I.S., Hua, X.S.: Contextual bag-of-words for visual categorization. IEEE Trans. Circuits Syst. Video Technol. 21(4), 381–392 (2010)

    Article  Google Scholar 

  31. Li, Z., Tang, J., He, X.: Robust structured nonnegative matrix factorization for image representation. IEEE Trans. Neural Networks Learn. Syst. 29(5), 1947–1960 (2017)

    Article  MathSciNet  Google Scholar 

  32. Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retrieval 14(2), 178–203 (2011)

    Article  Google Scholar 

  33. Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 545–552 (2005)

    Google Scholar 

  34. McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2004)

    MATH  Google Scholar 

  35. Minka, T.: Estimating a Dirichlet distribution (2000)

    Google Scholar 

  36. Pei, X., Chen, C., Gong, W.: Concept factorization with adaptive neighbors for document clustering. IEEE Trans. Neural Networks Learn. Syst. 29(2), 343–352 (2016)

    Article  MathSciNet  Google Scholar 

  37. Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)

    Article  Google Scholar 

  38. Shuja, J., Alanazi, E., Alasmary, W., Alashaikh, A.: Covid-19 open source data sets: a comprehensive survey. Applied Intelligence, pp. 1–30 (2020)

    Google Scholar 

  39. Sibuya, M., Yoshimura, I., Shimizu, R.: Negative multinomial distribution. Ann. Inst. Stat. Math. 16(1), 409–426 (1964). https://doi.org/10.1007/BF02868583

    Article  MathSciNet  MATH  Google Scholar 

  40. Taleb, I., Serhani, M.A., Dssouli, R.: Big data quality assessment model for unstructured data. In: 2018 International Conference on Innovations in Information Technology (IIT), pp. 69–74. IEEE (2018)

    Google Scholar 

  41. Wallace, C.S., Dowe, D.L.: MMl clustering of multi-state, poisson, von mises circular and gaussian distributions. Stat. Comput. 10(1), 73–83 (2000)

    Article  Google Scholar 

  42. Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 267–273 (2003)

    Google Scholar 

  43. Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval, pp. 197–206 (2007)

    Google Scholar 

  44. Zamzami, N., Bouguila, N.: Text modeling using multinomial scaled Dirichlet distributions. In: Mouhoub, M., Sadaoui, S., Ait Mohamed, O., Ali, M. (eds.) IEA/AIE 2018. LNCS (LNAI), vol. 10868, pp. 69–80. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92058-0_7

    Chapter  Google Scholar 

  45. Zamzami, N., Bouguila, N.: Model selection and application to high-dimensional count data clustering - via finite EDCM mixture models. Appl. Intell. 49(4), 1467–1488 (2019)

    Article  Google Scholar 

  46. Zenil, H., Kiani, N.A., Tegnér, J.: Quantifying loss of information in network-based dimensionality reduction techniques. J. Complex Networks 4(3), 342–362 (2016)

    Article  MathSciNet  Google Scholar 

  47. Zhou, H., Lange, K.: MM algorithms for some discrete multivariate distributions. J. Comput. Graph. Stat. 19(3), 645–665 (2010)

    Article  MathSciNet  Google Scholar 

  48. Zhou, H., Zhang, Y.: EM VS MM: a case study. Comput. Stat. Data Analy. 56(12), 3909–3920 (2012)

    Article  MathSciNet  Google Scholar 

  49. Zhu, J., Li, L.J., Fei-Fei, L., Xing, E.P.: Large margin learning of upstream scene understanding models. In: Advances in Neural Information Processing Systems, pp. 2586–2594 (2010)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Ornela Bregu .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2021 Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Bregu, O., Zamzami, N., Bouguila, N. (2021). Mixture-Based Unsupervised Learning for Positively Correlated Count Data. In: Nguyen, N.T., Chittayasothorn, S., Niyato, D., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2021. Lecture Notes in Computer Science(), vol 12672. Springer, Cham. https://doi.org/10.1007/978-3-030-73280-6_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-030-73280-6_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-030-73279-0

  • Online ISBN: 978-3-030-73280-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics