Skip to main content
Log in

Model selection and application to high-dimensional count data clustering

via finite EDCM mixture models

  • Published:
Applied Intelligence Aims and scope Submit manuscript

Abstract

EDCM, the Exponential-family approximation to the Dirichlet Compound Multinomial (DCM), proposed by Elkan (2006), is an efficient statistical model for high-dimensional and sparse count data. EDCM models take into account the burstiness phenomenon correctly while being many times faster than DCM. This work proposes the use of Minimum Message Length (MML) criterion for determining the number of components that best describes the data with a finite EDCM mixture model. Parameters estimation is based on the previously proposed Deterministic Annealing Expectation-Maximization (DAEM) approach. The validation of the proposed unsupervised algorithm involves different real applications: text document modeling, topic novelty detection and hierarchical image clustering. A comparison with results obtained for other information-theory based selection criteria is provided.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16

Similar content being viewed by others

Notes

  1. http://kdd.ics.uci.edu/databases/reuters21578

  2. https://cs.nyu.edu/∼roweis/data.html

  3. http://www.cs.cmu.edu/∼webkb

  4. Both data sets are available at: http://www.cad.zju.edu.cn/home/dengcai/Data/TextData

References

  1. Aggarwal CC, Zhai C (2012) An introduction to text mining. In: Mining text data. Springer, pp 1–10

  2. Akaike H (1974) A new look at the statistical model identification. IEEE Trans Autom Control 19(6):716–723

    Article  MathSciNet  MATH  Google Scholar 

  3. Amayri O, Bouguila N (2013) Online news topic detection and tracking via localized feature selection. In: Proceedings of the international joint conference on neural networks (IJCNN). IEEE, pp 1–8

  4. Banerjee A, Basu S (2007) Topic models over text streams: a study of batch and online unsupervised learning. In: Proceedings of the 2007 SIAM international conference on data mining. SIAM, pp 431–436

  5. Banerjee A, Merugu S, Dhillon IS, Ghosh J (2005) Clustering with bregman divergences. J Mach Learn Res 6:1705–1749

    MathSciNet  MATH  Google Scholar 

  6. Baxter RA, Oliver JJ (2000) Finding overlapping components with mml. Stat Comput 10(1):5–16

    Article  Google Scholar 

  7. Bhatia SK, Deogun JS (1998) Conceptual clustering in information retrieval. IEEE Trans Syst Man Cybern B Cybern 28(3):427–436

    Article  Google Scholar 

  8. Bishop CM, Tipping ME (1998) A hierarchical latent variable model for data visualization. IEEE Trans Pattern Anal Mach Intell 20(3):281–293

    Article  Google Scholar 

  9. Blashfield RK (1976) Mixture model tests of cluster analysis: accuracy of four agglomerative hierarchical methods. Psychol Bull 83(3):377

    Article  Google Scholar 

  10. Bouguila N (2011) Count data modeling and classification using finite mixtures of distributions. IEEE Trans Neural Netw 22(2):186–198

    Article  Google Scholar 

  11. Bouguila N, Ziou D (2006) Online clustering via finite mixtures of dirichlet and minimum message length. Eng Appl Artif Intell 19(4):371–379

    Article  Google Scholar 

  12. Bouguila N, Ziou D (2006) Unsupervised selection of a finite dirichlet mixture model: an mml-based approach. IEEE Trans Knowl Data Eng 18(8):993–1009

    Article  Google Scholar 

  13. Bouguila N, Ziou D (2007) High-dimensional unsupervised selection and estimation of a finite generalized dirichlet mixture model based on minimum message length. IEEE Trans Pattern Anal Mach Intell 29(10):1716–1731

  14. Bouguila N, Ziou D (2007) Unsupervised learning of a finite discrete mixture: applications to texture modeling and image databases summarization. J Vis Commun Image Represent 18(4):295–309

    Article  Google Scholar 

  15. Boutemedjet S, Ziou D (2012) Predictive approach for user long-term needs in content-based image suggestion. IEEE Trans Neural Netw Learn Syst 23(8):1242–1253

    Article  Google Scholar 

  16. Brown LD (1986) Fundamentals of statistical exponential families: with applications in statistical decision theory. In: Lecture notes-monograph series, vol 9

  17. Cha SH (2007) Comprehensive survey on distance/similarity measures between probability density functions. City 1(2):1

    MathSciNet  Google Scholar 

  18. Church KW, Gale WA (1995) Poisson mixtures. Nat Lang Eng 1(2):163–190

    Article  MathSciNet  Google Scholar 

  19. Conway JH, Sloane NJA (2013) Sphere packings, lattices and groups, vol 290. Springer Science & Business Media, Berlin

    Google Scholar 

  20. Cover TM, Thomas JA (2012) Elements of information theory. Wiley, Hoboken

    MATH  Google Scholar 

  21. Csurka G, Dance C, Fan L, Willamowski J, Bray C (2004) Visual categorization with bags of keypoints. In: Proceedings of the workshop on statistical learning in computer vision (ECCV), vol 1. Prague, pp 1–2

  22. DasGupta A (2011) The exponential family and statistical applications. In: Probability for statistics and machine learning. Springer, pp 583–612

  23. Dhillon IS, Modha DS (2001) Concept decompositions for large sparse text data using clustering. Mach Learn 42(1-2):143–175

    Article  MATH  Google Scholar 

  24. Duda RO, Hart PE, Stork DG (2012) Pattern classification. Wiley, Hoboken

    MATH  Google Scholar 

  25. Edelbrock C, McLaughlin B (1980) Hierarchical cluster analysis using intraclass correlations: a mixture model study. Multivar Behav Res 15(3):299–318

    Article  Google Scholar 

  26. Elkan C (2003) Using the triangle inequality to accelerate k-means. In: Proceedings of the 20th international conference on machine learning (ICML), pp 147–153

  27. Elkan C (2006) Clustering documents with an exponential-family approximation of the dirichlet compound multinomial distribution. In: Proceedings of the 23rd international conference on machine learning (ICML). ACM, pp 289–296

  28. Everitt BS (1996) An introduction to finite mixture distributions

  29. Fei-Fei L, Perona P (2005) A bayesian hierarchical model for learning natural scene categories. In: Proceedings of the IEEE computer society conference on computer vision and pattern recognition (CVPR), vol 2. IEEE, pp 524–531

  30. Figueiredo MAT, Jain AK (2002) Unsupervised learning of finite mixture models. IEEE Trans Pattern Anal Mach Intell 24(3):381–396

    Article  Google Scholar 

  31. Frigui H, Krishnapuram R (1999) A robust competitive clustering algorithm with applications in computer vision. IEEE Trans Pattern Anal Mach Intell 21(5):450–465

    Article  Google Scholar 

  32. Goldwater S, Griffiths T, Johnson M (2006) Interpolating between types and tokens by estimating power-law generators. Adv Neural Inf Proces Syst 18:459–467

    Google Scholar 

  33. Graybill FA (1983) Matrices with applications in statistics. Wadsworth

  34. Guha S, Rastogi R, Shim K (1998) Cure: an efficient clustering algorithm for large databases. In: Proceedings of the ACM sigmod record, vol 27. ACM, pp 73–84

  35. Handcock MS, Raftery AE, Tantrum JM (2007) Model-based clustering for social networks. J R Stat Soc A Stat Soc 170(2):301–354

    Article  MathSciNet  Google Scholar 

  36. Hatzivassiloglou V, Gravano L, Maganti A (2000) An investigation of linguistic features and clustering algorithms for topical document clustering. In: Proceedings of the 23rd annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 224–231

  37. Heller KA, Ghahramani Z (2005) Bayesian hierarchical clustering. In: Proceedings of the 22nd international conference on machine learning. ACM, pp 297–304

  38. Hershey JR, Olsen PA (2007) Approximating the kullback leibler divergence between gaussian mixture models. In: Proceedings of the IEEE international conference on acoustics, speech and signal processing (ICASSP), vol 4. IEEE, pp IV–317

  39. Iwayama M, Tokunaga T (1995) Cluster-based text categorization: a comparison of category search strategies. In: Proceedings of the 18th annual international ACM SIGIR conference on research and development in information retrieval. ACM, pp 273– 280

  40. Jain AK (2010) Data clustering: 50 years beyond k-means. Pattern Recogn Lett 31(8):651–666

    Article  Google Scholar 

  41. Jain AK, Flynn PJ (1996) Image segmentation using clustering. IEEE Press, Piscataway

    Google Scholar 

  42. Jefferys WH, Berger JO (1992) Ockham’s razor and bayesian analysis. Am Sci 80(1):64–72

    Google Scholar 

  43. Jensen JH, Ellis DP, Christensen MG, Jensen SH (2007) Evaluation of distance measures between gaussian mixture models of mfccs. In: ISMIR, pp 107–108

  44. Katz SM (1996) Distribution of content words and phrases in text and language modelling. Nat Lang Eng 2 (1):15–59

    Article  Google Scholar 

  45. Kingman JFC (1993) Poisson processes. Wiley Online Library

  46. Krizhevsky A, Hinton G (2009) Learning multiple layers of features from tiny images. Tech. rep., University of Toronto

  47. Kullback S (1997) Information theory and statistics. Courier Corporation

  48. Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86

    Article  MathSciNet  MATH  Google Scholar 

  49. Lin YS, Jiang JY, Lee SJ (2014) A similarity measure for text classification and clustering. IEEE Trans Knowl Data Eng 26(7):1575–1590

    Article  Google Scholar 

  50. Lowe DG (2004) Distinctive image features from scale-invariant keypoints. Int J Comput Vis 60(2):91–110

    Article  Google Scholar 

  51. Madsen RE, Kauchak D, Elkan C (2005) Modeling word burstiness using the dirichlet distribution. In: Proceedings of the 22nd international conference on machine learning (ICML). ACM, pp 545–552

  52. Margaritis D, Thrun S (2001) A bayesian multiresolution independence test for continuous variables. In: Proceedings of the seventeenth conference on uncertainty in artificial intelligence. Morgan Kaufmann Publishers Inc., pp 346–353

  53. McCallum A, Nigam K, et al. (1998) A comparison of event models for naive bayes text classification. In: AAAI-98 workshop on learning for text categorization, vol 752. Citeseer, pp 41–48

  54. McCallum AK (1996) Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/∼mccallum/bow

  55. McLachlan G, Krishnan T (2007) The EM algorithm and extensions, vol 382. Wiley, Hoboken

    Google Scholar 

  56. McLachlan G, Peel D (2004) Finite mixture models. Wiley, Hoboken

  57. Minka TP (2003) Estimating a dirichlet distribution. Ann Phys 2000(8):1–13. https://doi.org/10.1007/s00256-007-0299-1

    Article  Google Scholar 

  58. Mosimann JE (1962) On the compound multinomial distribution, the multivariate β-distribution, and correlations among proportions. Biometrika 49(1/2):65–82

    Article  MathSciNet  MATH  Google Scholar 

  59. Olech ŁP, Paradowski M (2016) Hierarchical gaussian mixture model with objects attached to terminal and non-terminal dendrogram nodes. In: Proceedings of the 9th international conference on computer recognition systems (CORES). Springer, pp 191–201

  60. Pimentel MA, Clifton DA, Clifton L, Tarassenko L (2014) A review of novelty detection. Signal Process 99:215–249

    Article  Google Scholar 

  61. Rennie JDM, Shih L, Teevan J, Karger DR (2003) Tackling the poor assumptions of naive bayes text classifiers. In: Proceedings of the twentieth international conference on machine learning (ICML), vol 3, pp 616–623

  62. Rissanen J (1978) Modeling by shortest data description. Automatica 14(5):465–471

    Article  MATH  Google Scholar 

  63. Sahami M, Koller D (1998) Using machine learning to improve information access. Ph.D. thesis, Stanford University Department of Computer Science

  64. Sandler M (2007) Hierarchical mixture models: a probabilistic analysis. In: Proceedings of the 13th international conference on knowledge discovery and data mining (SIGKDD). ACM, pp 580–589

  65. Singh S, Markou M (2004) An approach to novelty detection applied to the classification of image regions. IEEE Trans Knowl Data Eng 16(4):396–407

    Article  Google Scholar 

  66. Sneath PH, Sokal RR, et al. (1973) Numerical taxonomy. The principles and practice of numerical classification. WH Freeman and Company, San Francisco

    MATH  Google Scholar 

  67. Timande N, Chandak M, Kamble M (2014) Document clustering with feature selection using dirichlet process mixture model and dirichlet multinomial allocation model. International Journal of Engineering Research and Applications 10–16. https://pdfs.semanticscholar.org/6a0f/2d593688780a73e12e171ee157bb94b037cc.pdf

  68. Ueda N, Nakano R (1998) Deterministic annealing em algorithm. Neural Netw 11(2):271–282

    Article  Google Scholar 

  69. Wallace CS (1990) Classification by minimum-message-length inference. In: Proceedings of the international conference on computing and information. Springer, pp 72–81

  70. Wallace CS (2005) Statistical and inductive inference by minimum message length. Springer Science & Business Media

  71. Wallace CS, Boulton DM (1968) An information measure for classification. Comput J 11(2):185–194

    Article  MATH  Google Scholar 

  72. Wallace CS, Dowe DL (2000) MML clustering of multi-state, poisson, von mises circular and gaussian distributions. Stat Comput 10(1):73–83

    Article  Google Scholar 

  73. Wallace CS, Freeman PR (1987) Estimation and inference by compact coding. J R Stat Soc Ser B Methodol 49(3):240–265

    MathSciNet  MATH  Google Scholar 

  74. Yang MH, Ahuja N (1998) Gaussian mixture model for human skin color and its applications in image and video databases. In: Storage and retrieval for image and video databases VII, vol 3656. International Society for Optics and Photonics, pp 458–467

  75. Yao B, Fei-Fei L (2010) Grouplet: a structured image representation for recognizing human and object interactions. In: Proceedings of the IEEE conference on computer vision and pattern recognition (CVPR). IEEE, pp 9–16

  76. Yao JF (2000) On recursive estimation in incomplete data models. Stat A J Theor Appl Stat 34(1):27–51

    MathSciNet  MATH  Google Scholar 

  77. Zamzami N, Bouguila N (2018) Mml-based approach for determining the number of topics in edcm mixture models. In: Proceedings of the 31st Canadian conference on artificial intelligence (CAI). Springer

  78. Zhang J, Ghahramani Z, Yang Y (2005) A probabilistic model for online document clustering with application to novelty detection. In: Advances in neural information processing systems, pp 1617–1624

  79. Zhao Y, Karypis G, Fayyad U (2005) Hierarchical clustering algorithms for document datasets. Data Min Knowl Disc 10(2):141–168

    Article  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Nuha Zamzami.

Appendices

Appendix A: Proof of (16)

We have the negative of log-likelihood function as:

$$\begin{array}{@{}rcl@{}} -\mathcal{L}(\mathcal{X}_{j}|\boldsymbol{\varphi}_{j})&=&-\log \left( \prod\limits_{d=l}^{l+\eta_{j}-1} \mathcal{EDCM}(\mathbf{X}_{d}|\boldsymbol{\varphi}_{j}) \right) \\ &=&\eta_{j}(-\log {\Gamma}(s_{j}))+\sum\limits_{d=l}^{l+\eta_{j}-1} \log {\Gamma}(s_{j}+n_{d}) \\ &&-\sum\limits_{w:x_{dw}\geq1} \log \varphi_{jw} + \log x_{dw} \end{array} $$
(30)

Then, the first order derivative of the negative log-likelihood, also called the Fisher score function, is:

$$\begin{array}{@{}rcl@{}} -\frac{\partial \mathcal{L}(\mathcal{X}_{j}|\varphi_{j})}{\partial\varphi_{jw}}&=&\eta_{j}(-{\Psi}(s_{j}))+\sum\limits_{d=l}^{l+\eta_{j}-1} {\Psi}(s_{j}+n_{d}) \\ && -\sum\limits_{d=l}^{l+\eta_{j}-1} I(x_{dw} \geq 1) \frac{1}{\varphi_{jw}} \end{array} $$
(31)

where Ψ is the digamma function. Then,

$$\begin{array}{@{}rcl@{}} -\frac{\partial^{2} \mathcal{L}(\mathcal{X}_{j}|\varphi_{j})}{\partial\varphi^{2}_{jw}}&=&\eta_{j}(-{\Psi}^{\prime}(s_{j}))+\sum\limits_{d=l}^{l+\eta_{j}-1} {\Psi}^{\prime}(s_{j}+n_{d}) \\ &&+\sum\limits_{d=l}^{l+\eta_{j}-1} I(x_{dw} \geq 1) \frac{1}{\varphi_{jw}^{2}} \end{array} $$
(32)

and:

$$\begin{array}{@{}rcl@{}} -\frac{\partial^{2} \mathcal{L}(\mathcal{X}_{j}|\varphi_{j})}{\partial\varphi_{jw1} \partial\varphi_{jw2}}&=& \eta_{j}(-{\Psi}^{\prime}(s_{j}))\\ &&+\sum\limits_{d=l}^{l+\eta_{j}-1} {\Psi}^{\prime}(s_{j}+n_{d}) , w_{1} \neq w_{2} \end{array} $$
(33)

where Ψ is the trigamma function. We remark that F(φj) can be written as:

$$ F(\boldsymbol{\varphi}_{j})=D_{j} +\gamma_{j} \mathbf{AA}^{T} $$
(34)

where \(D=diag\left [\sum \limits _{d=l}^{l+\eta _{j}-1} I(x_{dw} \geq 1) \left .\frac {1}{\varphi _{j1}^{2}} \right ), {\dots } , \sum \limits _{d=l}^{l+\eta _{j}-1} I\right .\)\(\left .\vphantom {\sum \limits _{d=l}^{l+\eta _{j}-1}}(x_{dw} \geq 1) \left .\frac {1}{\varphi _{jW}^{2}} \right ) \right ]\), \(\gamma =\eta _{j}(-{\Psi }^{\prime }(s_{j}))+\sum \limits _{d=l}^{l+\eta _{j}-1} {\Psi }^{\prime }(s_{j}+n_{d})\), and AT = 1. Then, according to (Theorem 8.4.3) given by Graybill [33], the determinant of the Fisher information matrix F(φj) is:

$$ |F(\boldsymbol{\varphi}_{j})|=\left( 1+\gamma_{j} \sum\limits_{w = 1}^{W} \frac{a_{jw}^{2}}{D_{jw}} \right) \prod\limits_{w = 1}^{W} D_{jw} $$
(35)

By substituting (35) and (15) into (14), we obtain:

$$ |F({\Theta})|\simeq\frac{N}{\prod\limits_{j = 1}^{M} \mu_{j}} \prod\limits_{j = 1}^{M} \left[ \left( 1+\gamma_{j} \sum\limits_{w = 1}^{W} \frac{a_{jw}^{2}}{D_{jw}} \right) \prod\limits_{w = 1}^{W} D_{jw} \right] $$
(36)

Then, taking the log gives (16).

Appendix B: Proof of (29)

The KL divergence between two distributions that belong to the exponential family is defined as [16]:

$$\begin{array}{@{}rcl@{}} KL \left( P(X|\theta_{j1}),P(X|\theta_{j2})\right)&=&{\Phi}(\theta_{j1})-{\Phi}(\theta_{j2})\\ &&+\left( G(\theta_{j1})\right.\\ &&\left.-G(\theta_{j2})\right) E_{\theta_{j1}}[T(X)] \end{array} $$
(37)

where E𝜃 is the expectation with respect to P(X|𝜃). We have:

$$\begin{array}{@{}rcl@{}} {\Phi}(\theta)&=&\frac{{\Gamma}(s)}{{\Gamma}(s+n)} \end{array} $$
(38)
$$\begin{array}{@{}rcl@{}} G(\theta)&=&\log(\varphi_{jw}) \end{array} $$
(39)
$$\begin{array}{@{}rcl@{}} T(X)&=&I(x_{w} \geq 1) \end{array} $$
(40)

Moreover, we have the following [16]:

$$ E_{\theta}[T(X)]=-{\Phi}^{\prime}(\theta) $$
(41)

Thus, according to (38 and 40), we have:

$$ E_{\theta}\left[I(x_{w} \geq 1)\right]=-\frac{\partial {\Phi}(\theta)}{\partial \varphi_{jw}}= {\Psi}(s+n)-{\Psi}(s) $$
(42)

substituting (4238 and 39) in (37) gives (29).

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zamzami, N., Bouguila, N. Model selection and application to high-dimensional count data clustering. Appl Intell 49, 1467–1488 (2019). https://doi.org/10.1007/s10489-018-1333-9

Download citation

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10489-018-1333-9

Keywords

Navigation