Abstract
We propose two nonparametric Bayesian methods to cluster big data and apply them to cluster genes by patterns of gene–gene interaction. Both approaches define model-based clustering with nonparametric Bayesian priors and include an implementation that remains feasible for big data. The first method is based on a predictive recursion which requires a single cycle (or few cycles) of simple deterministic calculations for each observation under study. The second scheme is an exact method that divides the data into smaller subsamples and involves local partitions that can be determined in parallel. In a second step, the method requires only the sufficient statistics of each of these local clusters to derive global clusters. Under simulated and benchmark data sets the proposed methods compare favorably with other clustering algorithms, including k-means, DP-means, DBSCAN, SUGS, streaming variational Bayes and an EM algorithm. We apply the proposed approaches to cluster a large data set of gene–gene interactions extracted from the online search tool “Zodiac.”
Similar content being viewed by others
References
Arbel, J., Lijoi, A., Nipoti, B.: Bayesian survival model based on moment characterization. In: Frühwirth-Schnatter, S., Bitto, A., Kastner, G., Posekany, A. (eds.) Bayesian Statistics from Methods to Models and Applications, pp. 3–14. Springer, Cham (2015)
Blackwell, D., MacQueen, J.B.: Ferguson distributions via Pólya urn schemes. Ann. Stat. 1, 353–355 (1973)
Bouchard-Côté, A., Vollmer, S.J., Doucet, A.: The bouncy particle sampler: a non-reversible rejection-free Markov chain Monte Carlo method. J. Am. Stat. Assoc. (2017). https://doi.org/10.1080/01621459.2017.1294075
Caliński, T., Harabasz, J.: A dendrite method for cluster analysis. Commun. Stat. Theory Methods 3(1), 1–27 (1974)
Cortez, P., Cerdeira, A., Almeida, F., Matos, T., Reis, J.: Modeling wine preferences by data mining from physicochemical properties. Decision Support Syst. 47(4), 547–553 (2009)
Dahl, D.B.: Model-based clustering for expression data via a Dirichlet process mixture model. In: Vannucci, M., Do, K.A., Müller, P. (eds.) Bayesian Inference for Gene Expression and Proteomics, pp. 201–218. Cambridge University Press, Cambridge (2006)
Ester, M., Kriegel, H.P., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. Knowl. Discov. Databases 96, 226–231 (1996)
Fisher, R.A.: The use of multiple measurements in taxonomic problems. Ann. Eugen. 7(2), 179–188 (1936)
Fraley, C., Raftery, A.E.: Model-based clustering, discriminant analysis, and density estimation. J. Am. Stat. Assoc. 97(458), 611–631 (2002)
Fraley, C., Raftery, A.E.: Bayesian regularization for normal mixture estimation and model-based clustering. J. Classif. 24(2), 155–181 (2007)
Ge, H., Chen, Y., Wan, M., Ghahramani, Z.: Distributed inference for Dirichlet process mixture models. In: Bach, F., Blei, D. (eds.) Proceedings of the 32nd International Conference on Machine Learning, Proceedings of Machine Learning Research, vol. 37, pp. 2276–2284. PMLR, Lille, France (2015)
Gelfand, A.E., Dey, D.K.: Bayesian model choice: asymptotics and exact calculations. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 56, 501–514 (1994)
Ghoshal, S.: The Dirichlet process, related priors and posterior asymptotics. In: Hjort, N.L., Holmes, C., Müller, P., Walker, S.G. (eds.) Bayesian Nonparametrics, pp. 22–34. Cambridge University Press, Cambridge (2010)
Guha, S., Meyerson, A., Mishra, N., Motwani, R., O’Callaghan, L.: Clustering data streams: theory and practice. IEEE Trans. Knowl. Data Eng. 15(3), 515–528 (2003)
Hennig, C.: Methods for merging Gaussian mixture components. Adv. Data Anal. Classif. 4(1), 3–34 (2010)
Huang, Z., Gelman, A.: Sampling for Bayesian computation with large datasets. Available at SSRN 1010107 (2005)
Jain, A.K.: Data clustering: 50 years beyond k-means. Pattern Recognit. Lett. 31(8), 651–666 (2010)
Kulis, B., Jordan, M.I.: Revisiting k-means: new algorithms via Bayesian nonparametrics. In: Langford, J., Pineau, J. (eds.) Proceedings of the 29th International Conference on Machine Learning (ICML-12), pp. 513–520. ACM, New York, NY, USA (2012)
Lin, D.: Online learning of nonparametric mixture models via sequential variational approximation. In: Proceedings of the 26th International Conference on Neural Information Processing Systems, NIPS’13, pp. 395–403. Curran Associates Inc., USA (2013)
MacEachern, S.N., Clyde, M., Liu, J.S.: Sequential importance sampling for nonparametric Bayes models: the next generation. Can. J. Stat. 27(2), 251–267 (1999)
Mitra, R., Müller, P., Liang, S., Yue, L., Ji, Y.: A Bayesian graphical model for ChIP-seq data on histone modifications. J. Am. Stat. Assoc. 108(501), 69–80 (2013)
Newton, M.A., Quintana, F.A., Zhang, Y.: Nonparametric Bayes methods using predictive updating. In: Dey, D., Müller, P., Sinha, D. (eds.) Practical Nonparametric and Semiparametric Bayesian Statistics, pp. 45–61. Springer, New York (1998)
Pennell, M.L., Dunson, D.B.: Fitting semiparametric random effects models to large data sets. Biostatistics 8(4), 821–834 (2007)
Pettit, L.: The conditional predictive ordinate for the normal distribution. J. R. Stat. Soc. Ser. B (Stat. Methodol.) 52, 175–184 (1990)
Scott, S.L., Blocker, A.W., Bonassi, F.V., Chipman, H.A., George, E.I., McCulloch, R.E.: Bayes and big data: the consensus Monte Carlo algorithm. Int. J. Manag. Sci. Eng. Manag. 11(2), 78–88 (2016)
Tank, A., Foti, N., Fox, E.: Streaming variational inference for Bayesian nonparametric mixture models. In: Lebanon, G., Vishwanathan, S.V.N. (eds.) Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, Proceedings of Machine Learning Research, vol. 38, pp. 968–976. PMLR, San Diego, California, USA (2015)
Thorndike, R.L.: Who belongs in the family? Psychometrika 18(4), 267–276 (1953)
Wang, L., Dunson, D.B.: Fast Bayesian inference in Dirichlet process mixture models. J. Comput. Graph. Stat. 20(1), 196–216 (2011)
Williamson, S.A., Dubey, A., Xing, E.P.: Parallel Markov chain Monte Carlo for nonparametric mixture models. In: Proceedings of the 30th International Conference on International Conference on Machine Learning, ICML’13, vol. 28, pp. I-98–I-106. JMLR.org (2013)
Xu, R., Wunsch, D., et al.: Survey of clustering algorithms. IEEE Trans. Neural Netw. 16(3), 645–678 (2005)
Zhao, W., Ma, H., He, Q.: Parallel k-means clustering based on MapReduce. In: Jaatun, M.G., Zhao, G., Rong, C. (eds.) Cloud Computing, pp. 674–679. Springer, Berlin (2009)
Zhu, Y., Xu, Y., Helseth, D.L., Gulukota, K., Yang, S., Pesce, L.L., Mitra, R., Müller, P., Sengupta, S., Guo, W., et al.: Zodiac: A comprehensive depiction of genetic interactions in cancer by integrating TCGA data. J. Natl. Cancer Inst. 107(8), 1–9 (2015)
Acknowledgements
D. Zuanetti was supported by CAPES - Coordenação de Aperfeiçoamento de Pessoal de Nível Superior - Brazil. Peter Müller and Yuan Ji are supported in part by NIH/NCI CA 132891-07.
Author information
Authors and Affiliations
Corresponding author
Electronic supplementary material
Below is the link to the electronic supplementary material.
Rights and permissions
About this article
Cite this article
Zuanetti, D.A., Müller, P., Zhu, Y. et al. Bayesian nonparametric clustering for large data sets. Stat Comput 29, 203–215 (2019). https://doi.org/10.1007/s11222-018-9803-9
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11222-018-9803-9