Bayesian nonparametric clustering for large data sets

Zuanetti, Daiane Aparecida; Müller, Peter; Zhu, Yitan; Yang, Shengjie; Ji, Yuan

doi:10.1007/s11222-018-9803-9

Bayesian nonparametric clustering for large data sets

Published: 12 February 2018

Volume 29, pages 203–215, (2019)
Cite this article

Statistics and Computing Aims and scope Submit manuscript

Daiane Aparecida Zuanetti ORCID: orcid.org/0000-0003-1591-959X¹,
Peter Müller²,
Yitan Zhu³,
Shengjie Yang³ &
…
Yuan Ji⁴

1304 Accesses
12 Citations
Explore all metrics

Abstract

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Additive Conditional Independence for Large and Complex Biological Structures

Consensus clustering for Bayesian mixture models

Article Open access 21 July 2022

Unsupervised gene set testing based on random matrix theory

Article Open access 04 November 2016

References

Arbel, J., Lijoi, A., Nipoti, B.: Bayesian survival model based on moment characterization. In: Frühwirth-Schnatter, S., Bitto, A., Kastner, G., Posekany, A. (eds.) Bayesian Statistics from Methods to Models and Applications, pp. 3–14. Springer, Cham (2015)
Chapter Google Scholar
Blackwell, D., MacQueen, J.B.: Ferguson distributions via Pólya urn schemes. Ann. Stat. 1, 353–355 (1973)
Article MATH Google Scholar
Bouchard-Côté, A., Vollmer, S.J., Doucet, A.: The bouncy particle sampler: a non-reversible rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. (2017). https://doi.org/10.1080/01621459.2017.1294075
MATH Google Scholar
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. Theory Methods 3(1), 1–27 (1974)
Article MathSciNet MATH Google Scholar
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decision Support Syst. 47(4), 547–553 (2009)
Article Google Scholar
Dahl, D.B.: Model-based clustering for expression data via a Dirichlet process mixture model. In: Vannucci, M., Do, K.A., Müller, P. (eds.) Bayesian Inference for Gene Expression and Proteomics, pp. 201–218. Cambridge University Press, Cambridge (2006)
Chapter Google Scholar
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Knowl. Discov. Databases 96, 226–231 (1996)
Google Scholar
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936)
Article Google Scholar
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)
Article MathSciNet MATH Google Scholar
Fraley, C., Raftery, A.E.: Bayesian regularization for normal mixture estimation and model-based clustering. J. Classif. 24(2), 155–181 (2007)
Article MathSciNet MATH Google Scholar
Ge, H., Chen, Y., Wan, M., Ghahramani, Z.: Distributed inference for Dirichlet process mixture models. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 2276–2284. PMLR, Lille, France (2015)
Gelfand, A.E., Dey, D.K.: Bayesian model choice: asymptotics and exact calculations. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 56, 501–514 (1994)
MathSciNet MATH Google Scholar
Ghoshal, S.: The Dirichlet process, related priors and posterior asymptotics. In: Hjort, N.L., Holmes, C., Müller, P., Walker, S.G. (eds.) Bayesian Nonparametrics, pp. 22–34. Cambridge University Press, Cambridge (2010)
Google Scholar
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)
Article Google Scholar
Hennig, C.: Methods for merging Gaussian mixture components. Adv. Data Anal. Classif. 4(1), 3–34 (2010)
Article MathSciNet MATH Google Scholar
Huang, Z., Gelman, A.: Sampling for Bayesian computation with large datasets. Available at SSRN 1010107 (2005)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Article Google Scholar
Kulis, B., Jordan, M.I.: Revisiting k-means: new algorithms via Bayesian nonparametrics. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 513–520. ACM, New York, NY, USA (2012)
Lin, D.: Online learning of nonparametric mixture models via sequential variational approximation. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pp. 395–403. Curran Associates Inc., USA (2013)
MacEachern, S.N., Clyde, M., Liu, J.S.: Sequential importance sampling for nonparametric Bayes models: the next generation. Can. J. Stat. 27(2), 251–267 (1999)
Article MathSciNet MATH Google Scholar
Mitra, R., Müller, P., Liang, S., Yue, L., Ji, Y.: A Bayesian graphical model for ChIP-seq data on histone modifications. J. Am. Stat. Assoc. 108(501), 69–80 (2013)
Article MathSciNet MATH Google Scholar
Newton, M.A., Quintana, F.A., Zhang, Y.: Nonparametric Bayes methods using predictive updating. In: Dey, D., Müller, P., Sinha, D. (eds.) Practical Nonparametric and Semiparametric Bayesian Statistics, pp. 45–61. Springer, New York (1998)
Chapter Google Scholar
Pennell, M.L., Dunson, D.B.: Fitting semiparametric random effects models to large data sets. Biostatistics 8(4), 821–834 (2007)
Article MATH Google Scholar
Pettit, L.: The conditional predictive ordinate for the normal distribution. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 52, 175–184 (1990)
MathSciNet MATH Google Scholar
Scott, S.L., Blocker, A.W., Bonassi, F.V., Chipman, H.A., George, E.I., McCulloch, R.E.: Bayes and big data: the consensus Monte Carlo algorithm. Int. J. Manag. Sci. Eng. Manag. 11(2), 78–88 (2016)
Google Scholar
Tank, A., Foti, N., Fox, E.: Streaming variational inference for Bayesian nonparametric mixture models. In: Lebanon, G., Vishwanathan, S.V.N. (eds.) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 38, pp. 968–976. PMLR, San Diego, California, USA (2015)
Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
Article Google Scholar
Wang, L., Dunson, D.B.: Fast Bayesian inference in Dirichlet process mixture models. J. Comput. Graph. Stat. 20(1), 196–216 (2011)
Article MathSciNet Google Scholar
Williamson, S.A., Dubey, A., Xing, E.P.: Parallel Markov chain Monte Carlo for nonparametric mixture models. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML’13, vol. 28, pp. I-98–I-106. JMLR.org (2013)
Xu, R., Wunsch, D., et al.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Article Google Scholar
Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing, pp. 674–679. Springer, Berlin (2009)
Zhu, Y., Xu, Y., Helseth, D.L., Gulukota, K., Yang, S., Pesce, L.L., Mitra, R., Müller, P., Sengupta, S., Guo, W., et al.: Zodiac: A comprehensive depiction of genetic interactions in cancer by integrating TCGA data. J. Natl. Cancer Inst. 107(8), 1–9 (2015)

Download references

Acknowledgements

D. Zuanetti was supported by CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil. Peter Müller and Yuan Ji are supported in part by NIH/NCI CA 132891-07.

Author information

Authors and Affiliations

Departamento de Estatística, Universidade Federal de São Carlos, São Carlos, SP, Brazil
Daiane Aparecida Zuanetti
Department of Mathematics, UT Austin, Austin, TX, USA
Peter Müller
NorthShore University HealthSystem, Evanston, IL, USA
Yitan Zhu & Shengjie Yang
NorthShore University HealthSystem, Evanston and University of Chicago, Evanston, IL, USA
Yuan Ji

Authors

Daiane Aparecida Zuanetti
View author publications
You can also search for this author in PubMed Google Scholar
Peter Müller
View author publications
You can also search for this author in PubMed Google Scholar
Yitan Zhu
View author publications
You can also search for this author in PubMed Google Scholar
Shengjie Yang
View author publications
You can also search for this author in PubMed Google Scholar
Yuan Ji
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Daiane Aparecida Zuanetti.

Electronic supplementary material

Below is the link to the electronic supplementary material.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Zuanetti, D.A., Müller, P., Zhu, Y. et al. Bayesian nonparametric clustering for large data sets. Stat Comput 29, 203–215 (2019). https://doi.org/10.1007/s11222-018-9803-9

Download citation

Received: 21 April 2017
Accepted: 31 January 2018
Published: 12 February 2018
Issue Date: 15 March 2019
DOI: https://doi.org/10.1007/s11222-018-9803-9

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Bayesian nonparametric clustering for large data sets

Abstract

Access this article

Similar content being viewed by others

Additive Conditional Independence for Large and Complex Biological Structures

Consensus clustering for Bayesian mixture models

Unsupervised gene set testing based on random matrix theory

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (R 9 KB)

Supplementary material 2 (R 16 KB)

Supplementary material 3 (pdf 119 KB)

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Bayesian nonparametric clustering for large data sets

Abstract

Access this article

Similar content being viewed by others

Additive Conditional Independence for Large and Complex Biological Structures

Consensus clustering for Bayesian mixture models

Unsupervised gene set testing based on random matrix theory

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Electronic supplementary material

Supplementary material 1 (R 9 KB)

Supplementary material 2 (R 16 KB)

Supplementary material 3 (pdf 119 KB)

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation