Skip to main content
Log in

Model selection for Gaussian latent block clustering with the integrated classification likelihood

  • Regular Article
  • Published:
Advances in Data Analysis and Classification Aims and scope Submit manuscript

Abstract

Block clustering aims to reveal homogeneous block structures in a data table. Among the different approaches of block clustering, we consider here a model-based method: the Gaussian latent block model for continuous data which is an extension of the Gaussian mixture model for one-way clustering. For a given data table, several candidate models are usually examined, which differ for example in the number of clusters. Model selection then becomes a critical issue. To this end, we develop a criterion based on an approximation of the integrated classification likelihood for the Gaussian latent block model, and propose a Bayesian information criterion-like variant following the same pattern. We also propose a non-asymptotic exact criterion, thus circumventing the controversial definition of the asymptotic regime arising from the dual nature of the rows and columns in co-clustering. The experimental results show steady performances of these criteria for medium to large data tables.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1

Similar content being viewed by others

References

  • Banerjee A, Dhillon I, Ghosh J, Merugu S (2007) A generalized maximum entropy approach to Bregman co-clustering and matrix approximation. J Mach Learn Res 8:1919–1986

    MathSciNet  MATH  Google Scholar 

  • Berkhin P (2006) A survey of clustering data mining techniques. Springer, Berlin

    Book  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (1998) Assessing a mixture model for clustering with the integrated classification likelihood. Tech. rep, INRIA

  • Biernacki C, Celeux G, Govaert G (2002) Assessing a mixture model for clustering with the integrated completed likelihood. IEEE Trans Pattern Anal Mach Intell 22(7):719–725

    Article  Google Scholar 

  • Biernacki C, Celeux G, Govaert G (2010) Exact and monte carlo calculations of integrated likelihoods for the latent class model. J Stat Plan Infer 140(11):2991–3002

    Article  MathSciNet  Google Scholar 

  • Charrad M, Lechevallier Y, Saporta G, Ben Ahmed M (2010) Détermination du nombre de classes dans les méthodes de bipartitionnement. In: 17ème Rencontres de la Société Francophone de Classification, Saint-Denis de la Réunion, pp 119–122

  • Daudin JJ, Picard F, Robin S (2008) A mixture model for random graphs. Stat Comput 18(2):173–183

    Article  MathSciNet  Google Scholar 

  • Fraley C, Raftery AE (1998) How many clusters? Which clustering method? Answers via model-based cluster analysis. Comput J 41(8):578–588

    Article  Google Scholar 

  • Gelman A, Carlin JB, Stern HS, Rubin DB (2004) Bayesian data analysis. CRC, Boca Raton

    MATH  Google Scholar 

  • Good IJ (1965) Categorization of classification. Mathematics and Computer Science in Biology and Medicine, Her Majesty’s Stationery Office

  • Govaert G (1977) Algorithme de classification d’un tableau de contingence. In: First international symposium on data analysis and informatics, INRIA, Versailles

  • Govaert G (1995) Simultaneous clustering of rows and columns. Control Cybern 24(4):437–458

    MATH  Google Scholar 

  • Govaert G, Nadif M (2003) Clustering with block mixture models. Pattern Recogn 36:463–473

    Article  Google Scholar 

  • Hartigan JA (1972) Direct clustering of a data matrix. J Am Stat Assoc 67:123–129

    Article  Google Scholar 

  • Hartigan JA (2000) Bloc voting in the United States senate. J Classif 17(1):29–49

    Article  MathSciNet  Google Scholar 

  • Jagalur M, Pal C, Learned-Miller E, Zoeller RT, Kulp D (2007) Analyzing in situ gene expression in the mouse brain with image registration, feature extraction and block clustering. BMC Bioinforma 8(Suppl 10):S5

    Article  Google Scholar 

  • Kemp C, Griffiths TL, Tenenbaum JB (2004) Discovering latent classes in relational data. Tech. rep, Computer science and artificial intelligence laboratory

  • Keribin C, Brault V, Celeux G, Govaert G (2012) Model selection for the binary latent block model. In: Colubi A, Fokianos K, Gonzalez-Rodriguez G, Kontoghiorghes EJ (eds) Proceedings of Compstat 2012, 20th international conference on computational statistics, The International Statistical Institute/International Association for Statistical, Computing, pp 379–390

  • Keribin C, Brault V, Celeux G, Govaert G et al (2013) Estimation and selection for the latent block model on categorical data. Tech. rep, INRIA

  • Kluger Y, Basri R, Chang JT, Gerstein M (2003) Spectral biclustering of microarray data: coclustering genes and conditions. Genome Res 13(4):703–716

    Article  Google Scholar 

  • Lomet A, Govaert G, Grandvalet Y (2012a) Design of artificial data tables for co-clustering analysis. Université de Technologie de Compiègne, Tech. rep

  • Lomet A, Govaert G, Grandvalet Y (2012b) Model selection in block clustering by the integrated classification likelihood. In: Colubi A, Fokianos K, Gonzalez-Rodriguez G, Kontoghiorghes EJ (eds) Proceedings of Compstat 2012, 20th international conference on computational statistics, The International Statistical Institute/International Association for Statistical, Computing, pp 519–530

  • Mariadassou M, Matias C (2012) Convergence of the groups posterior distribution in latent or stochastic block models. Tech. rep., arXiv

  • McLachlan GJ, Peel D (2000) Finite mixture models. Wiley, New York

    Book  Google Scholar 

  • Nadif M, Govaert G (2008) Algorithms for model-based block Gaussian clustering. In: DMIN’08, the 2008 international conference on data mining, Las Vegas, Nevada, USA

  • Richardson S, Green PJ (1997) On Bayesian analysis of mixtures with an unknown number of components (with discussion). J R Stat Soc Ser B Stat Methodol 59(4):731–792

    Article  Google Scholar 

  • Robert C (2001) The Bayesian choice. Springer, Berlin

    Google Scholar 

  • Rocci R, Vichi M (2008) Two-mode multi-partitioning. Comput Stat Data Anal 52(4):1984–2003

    Article  MathSciNet  Google Scholar 

  • Schepers J, Ceulemans E, Van Mechelen I (2008) Selecting among multi-mode partitioning models of different complexities: a comparison of four model selection criteria. J Classif 25(1):67–85

    Article  MathSciNet  Google Scholar 

  • Seldin Y, Tishby N (2010) Pac-Bayesian analysis of co-clustering and beyond. J Mach Learn Res 11: 3595–3646

  • Shan H, Banerjee A (2008) Bayesian co-clustering. In: 8th IEEE international conference on data mining, 2008. ICDM’08, pp 530–539

  • Van Dijk B, Van Rosmalen J, Paap R (2009) A Bayesian approach to two-mode clustering. Tech. Rep. 2009–06, Econometric Institute. http://hdl.handle.net/1765/15112

  • Wyse J, Friel N (2012) Block clustering with collapsed latent block models. Stat Comput 22(1):415–428

    Article  MathSciNet  Google Scholar 

Download references

Acknowledgments

We thank the reviewers and associate editor for their valuable inputs. This work, carried out in the framework of the Labex MS2T (ANR-11-IDEX-0004-02), was partially funded by the French National Agency for Research under grant ClasSel ANR-08-EMER-002 and the European ICT FP7 under grant No 247022-MASH.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Aurore Lomet.

Appendices

Appendix A: Derivation of the approximation of \(\textit{ICL}\)

The first term of the expansion (2) (\(\log p(\mathbf {X}|\mathbf {z},\mathbf {w},M)\)) can be approximated in a BIC-like fashion, since the table entries are independent conditionally on the row/column partitions:

$$\begin{aligned} \log p(\mathbf {X}|\mathbf {z},\mathbf {w},M)&\approx \max _{\varvec{\alpha }} \log p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M) - \frac{\lambda }{2} \log (nd) , \end{aligned}$$

where \(\lambda \) is the dimensionality of vector \(\varvec{\alpha }\) (that is, of \(\mathcal {A}\)).

The two terms \(\log p(\mathbf {z}|M)\) and \(\log p(\mathbf {w}|M)\) can be computed exactly by taking an informative prior distribution on \(\varvec{\pi }\) and \(\varvec{\rho }\) when the proportion parameters are free. Indeed, a Dirichlet distribution \(\mathcal {D}(\delta ,\ldots ,\delta )\) yields:

$$\begin{aligned} p(\mathbf {z}|M)&= \int _{\mathcal {P}} \pi _1^{n_1} \cdots \pi _g^{n_g} \frac{\varGamma (g \delta )}{\varGamma ( \delta )\cdots \varGamma (\delta )} \mathbf {1}_{\sum _k \pi _k=1} d\varvec{\pi }, \\&= \frac{\varGamma ( g\delta )}{\varGamma ( \delta )^{g}} \frac{\varGamma ( \delta + n_1)\cdots \varGamma (\delta + n_g)}{\varGamma (n + g \delta )} \end{aligned}$$

where \(n_k\) is the number of rows in the cluster \(k\). The details of calculations are given by Robert (2001).

Using non-informative Jeffreys prior distributions for the proportion parameters (\(\delta =1/2\)), the log-priors are:

$$\begin{aligned} \log p(\mathbf {z}|M)&= \log \varGamma \left( \frac{g}{2}\right) + \sum _{k=1}^g \log \varGamma (n_k +\frac{1}{2}) - g \log \varGamma \left( \frac{1}{2}\right) - \log \varGamma (n+\frac{g}{2}),\\ \log p(\mathbf {w}|M)&= \log \varGamma \left( \frac{m}{2}\right) + \sum _{\ell =1}^m \log \varGamma (d_\ell +\frac{1}{2}) - m \log \varGamma \left( \frac{1}{2}\right) - \log \varGamma (d+\frac{m}{2}). \end{aligned}$$

Because \((\mathbf {z}, \mathbf {w})\) are unknown, we replace them by their estimates \((\hat{\mathbf {z}}, \hat{\mathbf {w}})\) obtained by the VEM algorithm. When \(\hat{n}_k\) and \(\hat{d}_\ell \) are large enough, the approximation of the Gamma function by the Stirling formula \( \varGamma (t+1) \approx t^{t+1/2} \exp (-t) (2\pi )^{1/2}\) can be used. Neglecting terms of order \(O(1)\), the log-prior distributions are then approximated as follows:

$$\begin{aligned} \log p(\hat{\mathbf {z}}|M)&\approx \sum _{k=1}^g \hat{n}_k \log \hat{n}_k - n \log n + \frac{1}{2}(g-1) \log n, \\ \log p(\hat{\mathbf {w}}|M)&\approx \sum _{\ell =1}^m \hat{d}_\ell \log \hat{d}_\ell - d \log d - \frac{1}{2}(m-1) \log m. \end{aligned}$$

In addition, \(\sum _{k=1}^g \hat{n}_k \log \frac{\hat{n}_k}{n}=\max _{\varvec{\pi }} \log p(\hat{\mathbf {z}}| \varvec{\pi }, M)\), \(\sum _{\ell =1}^m \hat{d}_\ell \log \frac{\hat{d}_\ell }{d}=\max _{\varvec{\rho }} \log p(\hat{\mathbf {w}}| \varvec{\rho }, M)\) (see Robert 2001; Biernacki et al. 2010. For \(\delta =1/2\), we obtain:

$$\begin{aligned} \log p(\hat{\mathbf {z}}|M)&\approx \max _{\varvec{\pi }}\log p(\hat{\mathbf {z}}| \varvec{\pi }, M) - \frac{g-1}{2} \log n ,\\ \log p(\hat{\mathbf {w}}|M)&\approx \max _{\varvec{\rho }} \log p(\hat{\mathbf {w}}| \varvec{\rho }, M) - \frac{m-1}{2} \log d. \end{aligned}$$

Then, the ICL criterion can be approached by:

$$\begin{aligned} ICL(M)&\approx \max _{\varvec{\alpha }} \log p(\mathbf {X}|\hat{\mathbf {z}},\hat{\mathbf {w}},\varvec{\alpha },M) - \frac{\lambda }{2} \log (nd) + \max _{\varvec{\pi }}\log p(\hat{\mathbf {z}}| \varvec{\pi }, M) \\&\quad - \frac{g-1}{2} \log n + \max _{\varvec{\rho }} \log p(\hat{\mathbf {w}}| \varvec{\rho }, M) - \frac{m-1}{2} \log d \\&\approx \max _{\varvec{\theta }} \log p(\mathbf {X},\hat{\mathbf {z}},\hat{\mathbf {w}}|\varvec{\theta },M) - \frac{\lambda }{2} \log (nd)- \frac{g-1}{2} \log n - \frac{m-1}{2} \log d . \end{aligned}$$

Appendix B: Derivation of exact \(\textit{ICL}\)

The criterion \(\textit{ICL}\) can be broken down in three terms:

$$\begin{aligned} \textit{ICL} (M)&= \log p(\mathbf {X}|\mathbf {z},\mathbf {w},M) + \log p(\mathbf {z}|M) + \log p(\mathbf {w}|M). \end{aligned}$$

Then, the first term of the expansion (2) is rewritten using the following decomposition:

$$\begin{aligned} p(\mathbf {X}|\mathbf {z},\mathbf {w},M)&= \frac{p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M) p(\varvec{\alpha }|M) }{p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M)} , \end{aligned}$$

where \(p(\varvec{\alpha }|M)\) and \(p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M)\) are respectively the prior and posterior distributions of \(\varvec{\alpha }\).

For the latent block model with different variances, given the row and column labels, the entries \(x_{ij}\) of each block are independent and identically distributed. We thus apply the standard results for Gaussian samples (Gelman 2004), where the distributions are defined by:

$$\begin{aligned}&p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M) = \prod _{i,j,k, \ell } \left\{ N(x_{ij}; \mu _{k \ell }, \sigma ^2_{k \ell }) \right\} ^{z_{ik}w_{j \ell }}, \\&p(\varvec{\alpha }|M) = \prod _{k, \ell } \left\{ N\left( \mu _{k \ell }; \mu _0,\frac{\sigma ^2_{k \ell }}{\kappa _0}\right) \times \text {Inv}-\chi ^{2} (\sigma ^2_{k \ell }; \nu _0 ,\sigma ^2_0)\right\} , \\&p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M) = \prod _{k, \ell } \left\{ N \left( \mu _{k \ell }; \frac{\kappa _0 \mu _0 + n_k d_{\ell } \bar{x}_{k \ell }}{\kappa _0+n_k d_{\ell }}, \frac{\sigma ^2}{\kappa _0+n_k d_{ \ell }}\right) \right. \\&\quad \times \left. \text {Inv}-\chi ^2 \left( \sigma ^2_{k \ell }; \nu _0+n_k d_{ \ell } , \frac{\nu _0 \sigma ^2_0 + (n_k d_{ \ell } -1)s^{2\star }_{k \ell } + \frac{\kappa _0 n_k d_{\ell }}{\kappa _0 + n_k d_{ \ell }} (\bar{x}_{k \ell }-\mu _0)^2}{\nu _0+n_k d_{\ell }} \right) \right\} . \end{aligned}$$

Using the definitions of these distributions, the first term of the expansion (2),

$$\begin{aligned} \log p(\mathbf {X}|\mathbf {z},\mathbf {w},M) = \log p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M) + \log p(\varvec{\alpha }|M) - \log p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M) , \end{aligned}$$

is identified, after some calculations, as (3).

For the latent block model with equal variances, the standard results need to be adapted to account for the shared parameter \(\sigma ^2\). The prior distributions are now defined as follows:

$$\begin{aligned} p(\mathbf {X}|\mathbf {z},\mathbf {w},\varvec{\alpha },M)&= \prod _{i,j,k,\ell } \left\{ N(x_{ij}; \mu _{k \ell }, \sigma ^2) \right\} ^{z_{ik}w_{j \ell }}, \\ p(\varvec{\alpha }|M)&= \prod _{k,\ell } \left\{ N( \mu _{k \ell }; \mu _0,\frac{\sigma ^2}{\kappa _0}) \right\} \times \text {Inv}-\chi ^2 (\sigma ^2; \nu _0 ,\sigma ^2_0). \end{aligned}$$

The posterior distribution is then computed thanks to Bayes’ formula

$$\begin{aligned}&p(\varvec{\mu },\sigma ^2 | \mathbf {X}) \propto p(\varvec{\mu },\sigma ^2) p(\mathbf {X}| \varvec{\mu },\sigma ^2) \\&\quad \propto (\sigma ^2)^{-(\frac{\nu _0}{2}+1)} \exp (- \frac{1}{2\sigma ^2}\nu _0 \sigma ^2_0 ) \prod _{k,\ell } \left\{ \sigma ^{-1} \exp \left( -\frac{1}{2\sigma ^2} \kappa _0(\mu _{k\ell }-\mu _0)^2\right) \right\} \\&\qquad \times (\sigma ^2)^{-\frac{nd}{2}} \exp \bigg (-\frac{1}{2\sigma ^2} \underbrace{\sum _{i,j,k,\ell } z_{ik} w_{j\ell } (x_{ij}-\mu _{k\ell })^2}_{\displaystyle (nd-gm) s^{2\star }_{w} + \sum _{k,\ell } n_k d_{\ell } (\mu _{k\ell }-\bar{x}_{k\ell })^2} \bigg ) , \\&\quad = (\sigma ^2)^{-(\frac{\nu _0+nd}{2}+1)} \exp \left( - \frac{1}{2\sigma ^2} (\nu _0 \sigma ^2_0 +(nd-gm) s^{2\star }_{w})\right) \\&\qquad \times \prod _{k,\ell } \left\{ \sigma ^{-1} \exp \left( -\frac{1}{2\sigma ^2} \left( \kappa _0 (\mu _{k\ell } - \mu _0)^2+ n_k d_\ell (\mu _{k\ell }-\bar{x}_{k\ell })^2\right) \right) \right\} \\&\quad = (\sigma ^2)^{-(\frac{\nu _0+nd}{2}+1)} \exp \left( -\frac{\nu _0 \sigma ^2_0 +(nd-gm) s^{2\star }_{w}}{2\sigma ^2} \right) \\&\qquad \times \prod _{k,\ell } \sigma ^{-1} \exp \left( -\frac{\kappa _0+n_k d_\ell }{2\sigma ^2} \left( \mu _{k\ell }-\frac{\kappa _0 \mu _0+ n_k d_\ell \bar{x}_{k\ell }}{\kappa _0+n_k d_\ell } \right) ^2\!-\! \frac{(\kappa _0 n_k d_\ell )(\bar{x}_{k\ell }-\mu _0)^2}{2\sigma ^2(\kappa _0 + n_k d_\ell )} \right) \\&\quad = \prod _{k,\ell } \sigma ^{-1} \exp \left( -\frac{\kappa _0+n_k d_\ell }{2\sigma ^2} \left( \mu _{k\ell }-\frac{\kappa _0 \mu _0+ n_k d_\ell \bar{x}_{k\ell }}{\kappa _0+n_k d_\ell } \right) ^2 \right) \\&\qquad \times (\sigma ^2)^{-(\frac{\nu _0+nd}{2}+1)} \exp \left( - \frac{\nu _0 \sigma ^2_0+(nd-gm) s^{2\star }_{w} + \sum _{k,\ell }\frac{\kappa _0 n_k d_\ell }{\kappa _0 + n_k d_\ell } (\bar{x}_{k\ell }-\mu _0)^2}{2\sigma ^2} \right) \end{aligned}$$

This probability can be factorized:

$$\begin{aligned} p(\varvec{\mu },\sigma ^2| \mathbf {X})=p(\varvec{\mu }| \sigma ^2, \mathbf {X}) p(\sigma ^2| \mathbf {X}) \end{aligned}$$

Thus, the posterior distribution is defined (assuming the posterior independence of \(\mu _{k \ell }\)):

$$\begin{aligned}&p(\varvec{\alpha }| \mathbf {X},\mathbf {z},\mathbf {w},M) = \prod _{k,\ell } \left\{ N \left( \mu _{k \ell }; \frac{\kappa _0 \mu _0 + n_k d_{\ell } \bar{x}_{k\ell }}{\kappa _0+n_k d_{\ell }}, \frac{\sigma ^2}{\kappa _0+n_k d_{\ell }} \right) \right\} \\&\quad \times \, \text {Inv}\!-\!\chi ^2 \left( \sigma ^2; \nu _0+n d , \frac{\nu _0 \sigma ^2_0+(nd-gm)s^{2\star }_{w}+\sum _{k,\ell }\frac{n_k d_{\ell } \kappa _0}{n_k d_{\ell } +\kappa _0}(\bar{x}_{k \ell }-\mu _0)^2}{\nu _0+nd} \right) \! . \end{aligned}$$

For the terms related to the proportions, when the proportions are free, we assume a symmetric Dirichlet prior distribution of parameters \((\delta _0,\ldots ,\delta _0)\) for the row and column parameters \((\varvec{\pi },\varvec{\rho })\), so that:

$$\begin{aligned} p(\mathbf {z}|M)&= \int _{\mathcal {P}} \pi _1^{n_1} \ldots \pi _g^{n_g} \frac{\varGamma (g \delta _0)}{\varGamma ( \delta _0)\ldots \varGamma (\delta _0)} {1\!\!1}_{\sum _k \pi _k=1} d\varvec{\pi }, \\&= \frac{\varGamma ( g\delta _0)}{\varGamma ( \delta _0)^{g}} \frac{\varGamma ( \delta _0 + n_1)\ldots \varGamma (\delta _0+ n_g)}{\varGamma (n + g \delta _0)}. \end{aligned}$$

More details are given by Biernacki et al. (1998).

Rights and permissions

Reprints and permissions

About this article

Cite this article

Lomet, A., Govaert, G. & Grandvalet, Y. Model selection for Gaussian latent block clustering with the integrated classification likelihood. Adv Data Anal Classif 12, 489–508 (2018). https://doi.org/10.1007/s11634-013-0161-3

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11634-013-0161-3

Keywords

Mathematics Subject Classification (2000)

Navigation