Mixture-Based Unsupervised Learning for Positively Correlated Count Data

Bregu, Ornela; Zamzami, Nuha; Bouguila, Nizar

doi:10.1007/978-3-030-73280-6_12

Ornela Bregu¹²,
Nuha Zamzami¹³ &
Nizar Bouguila¹²

Part of the book series: Lecture Notes in Computer Science ((LNAI,volume 12672))

Included in the following conference series:

Asian Conference on Intelligent Information and Database Systems

1769 Accesses
2 Citations

Abstract

The Multinomial distribution has been widely used to model count data. However, its Naive Bayes assumption usually degrades clustering performance especially when correlation between features is imminent, i.e., text documents. In this paper, we use the Negative Multinomial distribution to perform clustering based on finite mixture models, where the mixture parameters are to be estimated using a novel minorization-maximization algorithm, thriving in high-dimensionality optimization settings. Furthermore, we integrate a model-based feature selection approach to determine the optimal number of components in the mixture. To evaluate the clustering performance of the proposed model, three real-world applications are considered, namely, COVID-19 analysis, Web page clustering and facial expression recognition.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 84.99; Price excludes VAT (USA)

Softcover Book: USD 109.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

References

Aggarwal, C.C., Zhai, C.: A survey of text clustering algorithms. In: Aggarwal, C., Zhai, C. (eds.) Mining Text Data, pp. 77–128. Springer, London (2012). https://doi.org/10.1007/978-1-4614-3223-4_4
Chapter Google Scholar
Akaike, H.: A new look at the statistical model identification. IEEE Trans. Autom. Control 19(6), 716–723 (1974)
Article MathSciNet Google Scholar
Azam, M., Bouguila, N.: Bounded generalized Gaussian mixture model with ICA. Neural Process. Lett. 49(3), 1299–1320 (2019)
Article Google Scholar
Bakhtiari, A.S., Bouguila, N.: An expandable hierarchical statistical framework for count data modeling and its application to object classification. In: IEEE 23rd International Conference on Tools with Artificial Intelligence, ICTAI 2011, Boca Raton, FL, USA, 7–9, November 2011, pp. 817–824. IEEE Computer Society (2011)
Google Scholar
Bakhtiari, A.S., Bouguila, N.: Online learning for two novel latent topic models. In: Linawati, M.M.S., Neuhold, E.J., Tjoa, A.M., You, I. (eds.) ICT-EurAsia 2014. LNCS, vol. 8407, pp. 286–295. Springer, Heidelberg (2014)
Google Scholar
Baxter, R.A., Oliver, J.J.: Finding overlapping components with mml. Stat. Comput. 10(1), 5–16 (2000)
Article Google Scholar
Bijl, D., Hyde-Thomson, H.: Speech to text conversion, Jan 9 2001, uS Patent 6,173,259
Google Scholar
Bouguila, N.: A data-driven mixture kernel for count data classification using support vector machines. In: 2008 IEEE Workshop on Machine Learning for Signal Processing. pp. 26–31 (2008). https://doi.org/10.1109/MLSP.2008.4685450
Bouguila, N.: Clustering of count data using generalized Dirichlet multinomial distributions. IEEE Trans. Knowl. Data Eng. 20(4), 462–474 (2008)
Article Google Scholar
Bouguila, N.: A model-based approach for discrete data clustering and feature weighting using MAP and stochastic complexity. IEEE Trans. Knowl. Data Eng. 21(12), 1649–1664 (2009)
Article Google Scholar
Bouguila, N.: Count data modeling and classification using finite mixtures of distributions. IEEE Trans. Neural Networks 22(2), 186–198 (2011)
Article Google Scholar
Bouguila, N., Amayri, O.: A discrete mixture-based kernel for SVMs: application to spam and image categorization. Inf. Process. Manag. 45(6), 631–642 (2009)
Article Google Scholar
Bouguila, N., ElGuebaly, W.: A generative model for spatial color image databases categorization. In: Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing, ICASSP 2008, March 30–April 4, 2008, Caesars Palace, Las Vegas, Nevada, USA, pp. 821–824. IEEE (2008)
Google Scholar
Bouguila, N., ElGuebaly, W.: On discrete data clustering. In: Washio, T., Suzuki, E., Ting, K.M., Inokuchi, A. (eds.) PAKDD 2008. LNCS (LNAI), vol. 5012, pp. 503–510. Springer, Heidelberg (2008). https://doi.org/10.1007/978-3-540-68125-0_44
Chapter Google Scholar
Bouguila, N., ElGuebaly, W.: Discrete data clustering using finite mixture models. Pattern Recognit. 42(1), 33–42 (2009)
Article Google Scholar
Bouguila, N., Ghimire, M.N.: Discrete visual features modeling via leave-one-out likelihood estimation and applications. J. Vis. Commun. Image Represent. 21(7), 613–626 (2010)
Article Google Scholar
Bouguila, N., Ziou, D.: MML-based approach for finite Dirichlet mixture estimation and selection. In: Perner, P., Imiya, A. (eds.) MLDM 2005. LNCS (LNAI), vol. 3587, pp. 42–51. Springer, Heidelberg (2005). https://doi.org/10.1007/11510888_5
Chapter Google Scholar
Bouguila, N., Ziou, D.: Unsupervised selection of a finite Dirichlet mixture model: an mml-based approach. IEEE Trans. Knowl. Data Eng. 18(8), 993–1009 (2006)
Article Google Scholar
Bouguila, N., Ziou, D.: Unsupervised learning of a finite discrete mixture: Applications to texture modeling and image databases summarization. J. Vis. Commun. Image Represent. 18(4), 295–309 (2007)
Article Google Scholar
Chakraborty, S., Paul, D., Das, S., Xu, J.: Entropy weighted power k-means clustering. In: International Conference on Artificial Intelligence and Statistics, pp. 691–701. PMLR (2020)
Google Scholar
Chiarappa, J.A.: Application of the negative multinomial distribution to comparative Poisson clinical trials of multiple experimental treatments versus a single control. Ph.D. thesis, Rutgers University-School of Graduate Studies (2019)
Google Scholar
Csurka, G., Dance, C., Fan, L., Willamowski, J., Bray, C.: Visual categorization with bags of keypoints. In: Workshop on statistical learning in computer vision, ECCV, vol. 1, pp. 1–2. Prague (2004)
Google Scholar
De Leeuw, J.: Block-relaxation algorithms in statistics. In: Bock, HH., Lenski, W., Richter, M.M. (eds) Information Systems and Data Analysis. Studies in Classification, Data Analysis, and Knowledge Organization, pp. 308–324. Springer, Heidelberg (1994). https://doi.org/10.1007/978-3-642-46808-7_28
Elkan, C.: Clustering documents with an exponential-family approximation of the Dirichlet compound multinomial distribution. In: Proceedings of the 23rd International Conference on Machine Learning, pp. 289–296 (2006)
Google Scholar
Figueiredo, M.A.T., Jain, A.K.: Unsupervised learning of finite mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 24(3), 381–396 (2002)
Article Google Scholar
Hofmann, T.: Probabilistic latent semantic indexing. In: Proceedings of the 22nd annual international ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 50–57 (1999)
Google Scholar
Hunter, D.R., Lange, K.: A tutorial on MM algorithms. Am. Stat. 58(1), 30–37 (2004)
Article MathSciNet Google Scholar
Kesten, H., Morse, N.: A property of the multinomial distribution. Ann. Math. Stat. 30(1), 120–127 (1959)
Article MathSciNet Google Scholar
Law, M.H., Figueiredo, M.A., Jain, A.K.: Simultaneous feature selection and clustering using mixture models. IEEE Trans. Pattern Anal. Mach. Intell. 26(9), 1154–1166 (2004)
Article Google Scholar
Li, T., Mei, T., Kweon, I.S., Hua, X.S.: Contextual bag-of-words for visual categorization. IEEE Trans. Circuits Syst. Video Technol. 21(4), 381–392 (2010)
Article Google Scholar
Li, Z., Tang, J., He, X.: Robust structured nonnegative matrix factorization for image representation. IEEE Trans. Neural Networks Learn. Syst. 29(5), 1947–1960 (2017)
Article MathSciNet Google Scholar
Lu, Y., Mei, Q., Zhai, C.: Investigating task performance of probabilistic topic models: an empirical study of PLSA and LDA. Inf. Retrieval 14(2), 178–203 (2011)
Article Google Scholar
Madsen, R.E., Kauchak, D., Elkan, C.: Modeling word burstiness using the Dirichlet distribution. In: Proceedings of the 22nd International Conference on Machine Learning, pp. 545–552 (2005)
Google Scholar
McLachlan, G.J., Peel, D.: Finite Mixture Models. Wiley, New York (2004)
MATH Google Scholar
Minka, T.: Estimating a Dirichlet distribution (2000)
Google Scholar
Pei, X., Chen, C., Gong, W.: Concept factorization with adaptive neighbors for document clustering. IEEE Trans. Neural Networks Learn. Syst. 29(2), 343–352 (2016)
Article MathSciNet Google Scholar
Rissanen, J.: Modeling by shortest data description. Automatica 14(5), 465–471 (1978)
Article Google Scholar
Shuja, J., Alanazi, E., Alasmary, W., Alashaikh, A.: Covid-19 open source data sets: a comprehensive survey. Applied Intelligence, pp. 1–30 (2020)
Google Scholar
Sibuya, M., Yoshimura, I., Shimizu, R.: Negative multinomial distribution. Ann. Inst. Stat. Math. 16(1), 409–426 (1964). https://doi.org/10.1007/BF02868583
Article MathSciNet MATH Google Scholar
Taleb, I., Serhani, M.A., Dssouli, R.: Big data quality assessment model for unstructured data. In: 2018 International Conference on Innovations in Information Technology (IIT), pp. 69–74. IEEE (2018)
Google Scholar
Wallace, C.S., Dowe, D.L.: MMl clustering of multi-state, poisson, von mises circular and gaussian distributions. Stat. Comput. 10(1), 73–83 (2000)
Article Google Scholar
Xu, W., Liu, X., Gong, Y.: Document clustering based on non-negative matrix factorization. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 267–273 (2003)
Google Scholar
Yang, J., Jiang, Y.G., Hauptmann, A.G., Ngo, C.W.: Evaluating bag-of-visual-words representations in scene classification. In: Proceedings of the International Workshop on Workshop on Multimedia Information Retrieval, pp. 197–206 (2007)
Google Scholar
Zamzami, N., Bouguila, N.: Text modeling using multinomial scaled Dirichlet distributions. In: Mouhoub, M., Sadaoui, S., Ait Mohamed, O., Ali, M. (eds.) IEA/AIE 2018. LNCS (LNAI), vol. 10868, pp. 69–80. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-92058-0_7
Chapter Google Scholar
Zamzami, N., Bouguila, N.: Model selection and application to high-dimensional count data clustering - via finite EDCM mixture models. Appl. Intell. 49(4), 1467–1488 (2019)
Article Google Scholar
Zenil, H., Kiani, N.A., Tegnér, J.: Quantifying loss of information in network-based dimensionality reduction techniques. J. Complex Networks 4(3), 342–362 (2016)
Article MathSciNet Google Scholar
Zhou, H., Lange, K.: MM algorithms for some discrete multivariate distributions. J. Comput. Graph. Stat. 19(3), 645–665 (2010)
Article MathSciNet Google Scholar
Zhou, H., Zhang, Y.: EM VS MM: a case study. Comput. Stat. Data Analy. 56(12), 3909–3920 (2012)
Article MathSciNet Google Scholar
Zhu, J., Li, L.J., Fei-Fei, L., Xing, E.P.: Large margin learning of upstream scene understanding models. In: Advances in Neural Information Processing Systems, pp. 2586–2594 (2010)
Google Scholar

Download references

Author information

Authors and Affiliations

Concordia Institute for Information Systems Engineering (CIISE), Concordia University, Montreal, QC, Canada
Ornela Bregu & Nizar Bouguila
Department of Computer Science and Artificial Intelligence, University of Jeddah, Jeddah, Saudi Arabia
Nuha Zamzami

Authors

Ornela Bregu
View author publications
You can also search for this author in PubMed Google Scholar
Nuha Zamzami
View author publications
You can also search for this author in PubMed Google Scholar
Nizar Bouguila
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ornela Bregu .

Editor information

Editors and Affiliations

Wrocław University of Science and Technology, Wrocław, Poland
Ngoc Thanh Nguyen
King Mongkut's Institute of Technology Ladkrabang, Bangkok, Thailand
Suphamit Chittayasothorn
Nanyang Technological University, Singapore, Singapore
Dusit Niyato
Wrocław University of Science and Technology, Wrocław, Poland
Bogdan Trawiński

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Bregu, O., Zamzami, N., Bouguila, N. (2021). Mixture-Based Unsupervised Learning for Positively Correlated Count Data. In: Nguyen, N.T., Chittayasothorn, S., Niyato, D., Trawiński, B. (eds) Intelligent Information and Database Systems. ACIIDS 2021. Lecture Notes in Computer Science(), vol 12672. Springer, Cham. https://doi.org/10.1007/978-3-030-73280-6_12

Download citation

DOI: https://doi.org/10.1007/978-3-030-73280-6_12
Published: 05 April 2021
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-73279-0
Online ISBN: 978-3-030-73280-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics