Skip to main content
Log in

Bayesian nonparametric clustering for large data sets

  • Published:
Statistics and Computing Aims and scope Submit manuscript

Abstract

We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3

Similar content being viewed by others

References

  • Arbel, J., Lijoi, A., Nipoti, B.: Bayesian survival model based on moment characterization. In: Frühwirth-Schnatter, S., Bitto, A., Kastner, G., Posekany, A. (eds.) Bayesian Statistics from Methods to Models and Applications, pp. 3–14. Springer, Cham (2015)

    Chapter  Google Scholar 

  • Blackwell, D., MacQueen, J.B.: Ferguson distributions via Pólya urn schemes. Ann. Stat. 1, 353–355 (1973)

    Article  MATH  Google Scholar 

  • Bouchard-Côté, A., Vollmer, S.J., Doucet, A.: The bouncy particle sampler: a non-reversible rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. (2017). https://doi.org/10.1080/01621459.2017.1294075

    MATH  Google Scholar 

  • Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. Theory Methods 3(1), 1–27 (1974)

    Article  MathSciNet  MATH  Google Scholar 

  • Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decision Support Syst. 47(4), 547–553 (2009)

    Article  Google Scholar 

  • Dahl, D.B.: Model-based clustering for expression data via a Dirichlet process mixture model. In: Vannucci, M., Do, K.A., Müller, P. (eds.) Bayesian Inference for Gene Expression and Proteomics, pp. 201–218. Cambridge University Press, Cambridge (2006)

    Chapter  Google Scholar 

  • Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Knowl. Discov. Databases 96, 226–231 (1996)

    Google Scholar 

  • Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936)

    Article  Google Scholar 

  • Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)

    Article  MathSciNet  MATH  Google Scholar 

  • Fraley, C., Raftery, A.E.: Bayesian regularization for normal mixture estimation and model-based clustering. J. Classif. 24(2), 155–181 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  • Ge, H., Chen, Y., Wan, M., Ghahramani, Z.: Distributed inference for Dirichlet process mixture models. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 2276–2284. PMLR, Lille, France (2015)

  • Gelfand, A.E., Dey, D.K.: Bayesian model choice: asymptotics and exact calculations. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 56, 501–514 (1994)

    MathSciNet  MATH  Google Scholar 

  • Ghoshal, S.: The Dirichlet process, related priors and posterior asymptotics. In: Hjort, N.L., Holmes, C., Müller, P., Walker, S.G. (eds.) Bayesian Nonparametrics, pp. 22–34. Cambridge University Press, Cambridge (2010)

    Google Scholar 

  • Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)

    Article  Google Scholar 

  • Hennig, C.: Methods for merging Gaussian mixture components. Adv. Data Anal. Classif. 4(1), 3–34 (2010)

    Article  MathSciNet  MATH  Google Scholar 

  • Huang, Z., Gelman, A.: Sampling for Bayesian computation with large datasets. Available at SSRN 1010107 (2005)

  • Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)

    Article  Google Scholar 

  • Kulis, B., Jordan, M.I.: Revisiting k-means: new algorithms via Bayesian nonparametrics. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 513–520. ACM, New York, NY, USA (2012)

  • Lin, D.: Online learning of nonparametric mixture models via sequential variational approximation. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pp. 395–403. Curran Associates Inc., USA (2013)

  • MacEachern, S.N., Clyde, M., Liu, J.S.: Sequential importance sampling for nonparametric Bayes models: the next generation. Can. J. Stat. 27(2), 251–267 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  • Mitra, R., Müller, P., Liang, S., Yue, L., Ji, Y.: A Bayesian graphical model for ChIP-seq data on histone modifications. J. Am. Stat. Assoc. 108(501), 69–80 (2013)

    Article  MathSciNet  MATH  Google Scholar 

  • Newton, M.A., Quintana, F.A., Zhang, Y.: Nonparametric Bayes methods using predictive updating. In: Dey, D., Müller, P., Sinha, D. (eds.) Practical Nonparametric and Semiparametric Bayesian Statistics, pp. 45–61. Springer, New York (1998)

    Chapter  Google Scholar 

  • Pennell, M.L., Dunson, D.B.: Fitting semiparametric random effects models to large data sets. Biostatistics 8(4), 821–834 (2007)

    Article  MATH  Google Scholar 

  • Pettit, L.: The conditional predictive ordinate for the normal distribution. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 52, 175–184 (1990)

    MathSciNet  MATH  Google Scholar 

  • Scott, S.L., Blocker, A.W., Bonassi, F.V., Chipman, H.A., George, E.I., McCulloch, R.E.: Bayes and big data: the consensus Monte Carlo algorithm. Int. J. Manag. Sci. Eng. Manag. 11(2), 78–88 (2016)

    Google Scholar 

  • Tank, A., Foti, N., Fox, E.: Streaming variational inference for Bayesian nonparametric mixture models. In: Lebanon, G., Vishwanathan, S.V.N. (eds.) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 38, pp. 968–976. PMLR, San Diego, California, USA (2015)

  • Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)

    Article  Google Scholar 

  • Wang, L., Dunson, D.B.: Fast Bayesian inference in Dirichlet process mixture models. J. Comput. Graph. Stat. 20(1), 196–216 (2011)

    Article  MathSciNet  Google Scholar 

  • Williamson, S.A., Dubey, A., Xing, E.P.: Parallel Markov chain Monte Carlo for nonparametric mixture models. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML’13, vol. 28, pp. I-98–I-106. JMLR.org (2013)

  • Xu, R., Wunsch, D., et al.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)

    Article  Google Scholar 

  • Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing, pp. 674–679. Springer, Berlin (2009)

  • Zhu, Y., Xu, Y., Helseth, D.L., Gulukota, K., Yang, S., Pesce, L.L., Mitra, R., Müller, P., Sengupta, S., Guo, W., et al.: Zodiac: A comprehensive depiction of genetic interactions in cancer by integrating TCGA data. J. Natl. Cancer Inst. 107(8), 1–9 (2015)

Download references

Acknowledgements

D. Zuanetti was supported by CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil. Peter Müller and Yuan Ji are supported in part by NIH/NCI CA 132891-07.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Daiane Aparecida Zuanetti.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Zuanetti, D.A., Müller, P., Zhu, Y. et al. Bayesian nonparametric clustering for large data sets. Stat Comput 29, 203–215 (2019). https://doi.org/10.1007/s11222-018-9803-9

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11222-018-9803-9

Keywords

Navigation