Large-scale correlation mining for biomolecular network discovery

Alfred Hero; Bala Rajaratnam

doi:10.1017/CBO9781316162750.016

15 - Large-scale correlation mining for biomolecular network discovery

from Part IV - Big data over biological networks

Published online by Cambridge University Press: 18 December 2015

Alfred Hero and

Bala Rajaratnam

Edited by

Shuguang Cui ,

Alfred O. Hero, III ,

Zhi-Quan Luo and

José M. F. Moura

Show author details

Alfred Hero: Affiliation:
University of Michigan, USA
Bala Rajaratnam: Affiliation:
Stanford University, USA
Shuguang Cui: Affiliation:
Texas A & M University
Alfred O. Hero, III: Affiliation:
University of Michigan, Ann Arbor
Zhi-Quan Luo: Affiliation:
University of Minnesota
José M. F. Moura: Affiliation:
Carnegie Mellon University, Pennsylvania

Book contents

Get access

Summary

Continuing advances in high-throughput mRNA probing, gene sequencing, and microscopic imaging technology is producing a wealth of biomarker data on many different living organisms and conditions. Scientists hope that increasing amounts of relevant data will eventually lead to better understanding of the network of interactions between the thousands of molecules that regulate these organisms. Thus progress in understanding the biological science has become increasingly dependent on progress in understanding the data science. Data-mining tools have been of particular relevance since they can sometimes be used to effectively separate the “wheat” from the “chaff”, winnowing the massive amount of data down to a few important data dimensions. Correlation mining is a data-mining tool that is particularly useful for probing statistical correlations between biomarkers and recovering properties of their correlation networks. However, since the number of correlations between biomarkers is quadratically larger than the number biomarkers, the scalability of correlation mining in the big data setting becomes an issue. Furthermore, there are phase transitions that govern the correlation mining discoveries that must be understood in order for these discoveries to be reliable and of high confidence. This is especially important to understand at big data scales where the number of samples is fixed and the number of biomarkers becomes unbounded, a sampling regime referred to as the “purely high-dimensional setting”. In this chapter, we will discuss some of the main advances and challenges in correlation mining in the context of large scale biomolecular networks with a focus on medicine. A new correlation mining application will be introduced: discovery of correlation sign flips between edges in a pair of correlation or partial correlation networks. The pair of networks could respectively correspond to a disease (or treatment) group and a control group.

Introduction

Data mining at a large scale has matured over the past 50 years to a point where, every minute, millions of searches over billions of data dimensions are routinely handled by search engines at Google, Yahoo, LinkedIn, Facebook, Twitter, and other media. Similarly, large ontological databases like GO [1] and DAVID [2] have enabled large-scale text data mining for researchers in the life sciences [3].

Type: Chapter
Information: Big Data over Networks , pp. 409 - 436

DOI: https://doi.org/10.1017/CBO9781316162750.016 [Opens in a new window]

Publisher: Cambridge University Press

Print publication year: 2016

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

References

[1] Gene Ontology Consortium, “The gene ontology (GO) database and informatics resource,” Nucleic Acids Research, vol. 32, no. suppl 1, pp. D258–D261, 2004.

[2] G., Dennis Jr, B. T., Sherman, D. A., Hosack, et al., “DAVID: database for annotation, visualization, and integrated discovery,” Genome Biol., vol. 4, no. 5, p. P3, 2003.Google Scholar

[3] M., Ashburner, C. A., Ball, J. A., Blake, et al., “Gene ontology: tool for the unification of biology,” Nature Genetics, vol. 25, no. 1, pp. 25–29, 2000.Google Scholar

[4] E. W., Sayers, T., Barrett, D. A., Benson, et al., “Database resources of the national center for biotechnology information,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. D38–D51, 2011.

[5] C. F., Schaefer, K., Anthony, S., Krupa, et al., “Pid: the pathway interaction database,” Nucleic Acids Research, vol. 37, no. suppl 1, pp. D674–D679, 2009.Google Scholar

[6] E. G., Cerami, B. E., Gross, E., Demir, et al., “Pathway commons, a web resource for biological pathway data,” Nucleic Acids Research, vol. 39, no. suppl 1, pp. D685–D690, 2011. [Online]. Available: http://www.pathwaycommons.org/about/Google Scholar

[7] J. D., Allen, Y., Xie, M., Chen, L., Girard, and G., Xiao, “Comparing statistical methods for constructing large scale gene networks,” PLos One, vol. 7, no. 1, p. e29348, 2012.Google Scholar

[8] C., Jiang, F., Coenen, and M., Zito, “A survey of frequent subgraph mining algorithms,” The Knowledge Engineering Review, vol. 28, no. 01, pp. 75–105, 2013.Google Scholar

[9] A., Hero and B., Rajaratnam, “Large-scale correlation screening,” Journal of the American Statistical Association, vol. 106, no. 496, pp. 1540–1552, 2011.Google Scholar

[10] A., Hero and B., Rajaratnam, “Hub discovery in partial correlation models,” IEEE Transactions on Information Theory, vol. 58, no. 9, pp. 6064–6078, 2012, available as Arxiv preprint arXiv:1109.6846.Google Scholar

[11] P., Bühlmann and S. van de, Geer, Statistics for High-Dimensional Data: Methods, Theory and Applications, Springer, 2011.Google Scholar

[12] R., Fisher, “On the mathematical foundations of theoretical statistics,” Philosophical Transactions of the Royal Society of London, Series A, vol. 222, pp. 309–368, 1922.Google Scholar

[13] R., Fisher, “Theory of statistical estimation,” Proceedings of the Cambridge Philosophical Society, vol. 22, pp. 700–725, 1925.Google Scholar

[14] C., Rao, “Large sample tests of statistical hypotheses concerning several parameters with applications to problems of estimation,” Mathematical Proceedings of the Cambridge Philosophical Society, vol. 44, pp. 50–57, 1947.Google Scholar

[15] C., Rao, “Criteria of estimation in large samples,” Sankhyā: The Indian Journal of Statistics, Series A, vol. 25, pp. 189–206, 1963.Google Scholar

[16] J., Neyman and E., Pearson, “On the problem of the most efficient tests of statistical hypotheses,” Philosophical Transactions of the Royal Society of London, Series A, vol. 231, pp. 289–337, 1933.Google Scholar

[17] S., Wilks, “The large-sample distribution of the likelihood ratio for testing composite hypotheses,” Annals of Mathematical Statistics, vol. 9, pp. 60–62, 1938.Google Scholar

[18] A., Wald, “Asymptotically most powerful tests of statistical hypotheses,” Annals of Mathematical Statistics, vol. 12, pp. 1–19, 1941.Google Scholar

[19] A., Wald, “Some examples of asymptotically most powerful tests,” Annals of Mathematical Statistics, vol. 12, pp. 396–408, 1941.Google Scholar

[20] A., Wald, “Tests of statistical hypotheses concerning several parameters when the number of observations is large,” Transactions of the American Mathematical Society, vol. 54, pp. 426–482, 1943.Google Scholar

[21] A., Wald, “Note on the consistency of the maximum likelihood estimate,” Annals of Mathematical Statistics, vol. 20, pp. 595–601, 1949.Google Scholar

[22] H., Cramér, Mathematical Methods of Statistics, Princeton, NJ: Princeton University Press, 1946.Google Scholar

[23] H., Cramér,“Acontribution to the theory of statistical estimation,” Scandinavian Actuarial Journal, vol. 29, pp. 85–94, 1946.Google Scholar

[24] L. Le, Cam, “On some asymptotic properties of maximum likelihood estimates and related Bayes’ estimates,” University of California Publications in Statistics, vol. 1, pp. 277–330, 1953.Google Scholar

[25] L. Le, Cam, Asymptotic Methods in Statistical Decision Theory, New York: Springer-Verlag, 1986.Google Scholar

[26] H., Chernoff, “Large-sample theory: parametric case,” Annals of Mathematical Statistics, vol. 27, pp. 1–22, 1956.Google Scholar

[27] J., Kiefer and J., Wolfowitz, “Consistency of the maximum likelihood esitmator in the presence of infinitely many incidental parameters,” Annals of Mathematical Statistics, vol. 27, pp. 887–906, 1956.Google Scholar

[28] R., Bahadur, “Rates of convergence of estimates and test statistics,” Annals of Mathematical Statistics, vol. 38, pp. 303–324, 1967.Google Scholar

[29] B., Efron, “Maximum likelihood and decision theory,” Annals of Statistics, vol. 10, pp. 340–356, 1982.Google Scholar

[30] D., Donoho, “For most large underdetermined systems of linear equations the minimal l1-norm solution is also the sparsest solution,” Communications on Pure and Applied Mathematics, vol. 59, pp. 797–829, 2006.Google Scholar

[31] P., Zhao and B., Yu, “On model selection consistency of Lasso,” Journal of Machine Learning Research, vol. 7, pp. 2541–2563, 2006.Google Scholar

[32] N., Meinshausen and P., Buhlmann, “High-dimensional graphs and variable selection with the lasso,” Annals of Statistics, vol. 34, no. 3, pp. 1436–1462, June 2006.Google Scholar

[33] E., Candès and T., Tao, “The Dantzig selector: statistical estimation when p is much larger than n,” Annals of Statistics, vol. 35, pp. 2313–2351, 2007.Google Scholar

[34] P., Bickel, Y., Ritov, and A., Tsybakov, “Simultaneous analysis of Lasso and Dantzig selector,” Annals of Statistics, vol. 37, pp. 1705–1732, 2009.Google Scholar

[35] J., Peng, P., Wang, N., Zhou, and J., Zhu, “Partial correlation estimation by joint sparse regression models,” Journal of the American Statistical Association, vol. 104, no. 486, 2009.Google Scholar

[36] M., Wainwright, “Information-theoretic limitations on sparsity recovery in the highdimensional and noisy setting,” IEEE Transactions on Information Theory, vol. 55, pp. 5728–5741, 2009.

[37] M., Wainwright, “Sharp thresholds for high-dimensional and noisy sparsity recovery using l1- constrained quadratic programming (Lasso),” IEEE Transactions on Information Theory, vol. 55, pp. 2183–2202, 2009.Google Scholar

[38] K., Khare, S., Oh, and B., Rajaratnam, “A convex pseudo-likelihood framework for high dimensional partial correlation estimation with convergence guarantees,” Journal of the Royal Statistical Society: Series B (Statistical Methodology), to appear, 2014. [Online]. Available: http://arxiv.org/abs/1307.5381Google Scholar

[39] H., Firouzi, A., Hero, and B., Rajaratnam, “Variable selection for ultra high dimensional regression,” Technical Report, University of Michigan and Stanford University, 2014.Google Scholar

[40] B., Mole, “The gene sequencing future is here,” Science News, February 6, 2014. [Online]. Available: https://www.sciencenews.org/article/gene-sequencing-future-here

[41] W., KA, “Dna sequencing costs: data from the nhgri genome sequencing program (gsp),” August 22, 2014. [Online]. Available: https://www.sciencenews.org/article/ gene-sequencing-future-here

[42] A., Zaas, M., Chen, J., Varkey, et al., “Gene expression signatures diagnose influenza and other symptomatic respiratory viral infections in humans,” Cell Host & Microbe, vol. 6, no. 3, pp. 207–217, 2009.Google Scholar

[43] Y., Huang, A., Zaas, A., Rao, et al., “Temporal dynamics of host molecular responses differentiate symptomatic and asymptomatic influenza a infection,” PLoS Genet, vol. 7, no. 8, p. e1002234, 2011.Google Scholar

[44] P. J., Bickel and K. A., Doksum, Mathematical Statistics: Basic Ideas and Selected Topics, Holden-Day, San Francisco, 1977.Google Scholar

[45] H., Jeong, S. P., Mason, A.-L., Barabasi, and Z. N., Oltvai, “Lethality and centrality in protein networks,” Nature, vol. 411, no. 6833, pp. 41–42, May 2001. [Online]. Available: http://dx.doi.org/10.1038/35075138http://www.nature.com/nature/journal/v411/ n6833/abs/411041a0.htmlGoogle Scholar

[46] M. C., Oldham, S., Horvath, and D. H., Geschwind, “Conservation and evolution of gene coexpression networks in human and chimpanzee brains,” Proceedings of the National Academy of Sciences, vol. 103, no. 47, pp. 17 973–17 978, November 2006. [Online]. Available: http://www.pnas.org/content/103/47/17973.abstractGoogle Scholar

[47] P., Langfelder and S., Horvath, “WGCNA: an R package for weighted correlation network analysis,” BMC bioinformatics, vol. 9, no. 1, p. 559, January 2008. [Online]. Available: http://www.biomedcentral.com/1471-2105/9/559Google Scholar

[48] A., Li and S., Horvath, “Network neighborhood analysis with the multi-node topological overlap measure,” Bioinformatics (Oxford, England), vol. 23, no. 2, pp. 222–31, January 2007. [Online]. Available: http://bioinformatics.oxfordjournals.org/cgi/content/abstract/23/2/222Google Scholar

[49] L., Wu, C., Zhang, and J., Zhang, “Hmbox1 negatively regulates nk cell functions by suppressing the nkg2d/dap10 signaling pathway,” Cellular & Molecular Immunology, vol. 8, no. 5, pp. 433–440, 2011.Google Scholar

[50] A. Y., Istomin and A., Godzik, “Understanding diversity of human innate immunity receptors: analysis of surface features of leucine-rich repeat domains in nlrs and tlrs,” BMC Immunology, vol. 10, no. 1, p. 48, 2009.Google Scholar

[51] S. L., Lauritzen, Graphical Models, Oxford University Press, 1996.Google Scholar

[52] A., Dempster, “Covariance selection,” Biometrics, vol. 28, no. 1, pp. 157–175, 1972.Google Scholar

[53] J., Friedman, T., Hastie, and R., Tibshirani, “Sparse inverse covariance estimation with the graphical lasso,” Biostatistics, vol. 9, no. 3, pp. 432–441, 2008.Google Scholar

[54] B., Rajaratnam, H., Massam, and C., Carvalho, “Flexible covariance estimation in graphical Gaussian models,” Annals of Statistics, vol. 36, pp. 2818–2849, 2008.Google Scholar

[55] K., Khare and B., Rajaratnam, “Wishart distributions for decomposable covariance graph models,” The Annals of Statistics, vol. 39, no. 1, pp. 514–555, Mar. 2011. [Online]. Available: http://projecteuclid.org/euclid.aos/1297779855Google Scholar

[56] A. J., Rothman, P., Bickel, E., Levina, and J., Zhu, “Sparse permutation invariant covariance estimation,” Electronic Journal of Statistics, vol. 2, pp. 494–515, 2008.Google Scholar

[57] P., Bickel and E., Levina, “Covariance regularization via thresholding,” Annals of Statistics, vol. 34, no. 6, pp. 2577–2604, 2008.Google Scholar

[58] O., Banerjee, L. E., Ghaoui, and A., d'Aspremont, “Model selection through sparse maximum likelihood estimation for multivariateGaussian or binary data,” Journal of Machine Learning Research, vol. 9, pp. 485–516, March 2008.Google Scholar

[59] C.-J., Hsieh, M. A., Sustik, I., Dhillon, P., Ravikumar, and R., Poldrack, “Big & quick: sparse inverse covariance estimation for a million variables,” in Advances in Neural Information Processing Systems, 2013, pp. 3165–3173.Google Scholar

[60] D., Guillot, B., Rajaratnam, B. T., Rolfs, A., Maleki, and I., Wong, “Iterative Thresholding Algorithm for Sparse Inverse Covariance Estimation,” in Advances in Neural Information Processing Systems 25, 2012. [Online]. Available: http://arxiv.org/abs/1211.2532Google Scholar

[61] O., Dalal and B., Rajaratnam, “G-AMA: sparse Gaussian graphical model estimation via alternating minimization,” Technical Report, Department of Statistics, Stanford University (in revision), 2014. [Online]. Available: http://arxiv.org/abs/1405.3034

[62] G., Rocha, P., Zhao, and B., Yu, “A path following algorithm for Sparse Pseudo-Likelihood Inverse Covariance Estimation (SPLICE),” Statistics Department, UC Berkeley, Berkeley, CA, Tech. Rep., 2008. [Online]. Available: http://www.stat.berkeley.edu/~binyu/ps/rocha. pseudo.pdfGoogle Scholar

[63] S., Oh, O., Dalal, K., Khare, and B., Rajaratnam, “Optimization methods for sparse pseudolikelihood graphical model selection,” in Advances in Neural Information Processing Systems 27, 2014.Google Scholar

[64] G., Marjanovic and A. O., Hero III, “On lq estimation of sparse inverse covariance,” in Proceedings of IEEE Conference on Acoustics, Speech and Signal Processing (ICASSP), Florence, May 2014.Google Scholar

[65] G., Marjanovic and A. O., Hero III, “l 0 sparse inverse covariance estimation,” arXiv preprint arXiv:1408.0850, 2014.

[66] T., Tsiligkaridis, A., Hero, and S., Zhou, “Convergence properties of Kronecker Graphical Lasso algorithms,” IEEE Transactions on Signal Processing (also available as arXiv:1204.0585), vol. 61, no. 7, pp. 1743–1755, 2013.Google Scholar

[67] R., Gill, S., Datta, and S., Datta, “A statistical framework for differential network analysis from microarray data,” BMC Bioinformatics, vol. 11, no. 1, p. 95, 2010.Google Scholar

[68] N., Kramer, J., Schafer, and A.-L., Boulesteix, “Regularized estimation of large-scale gene association networks using graphical gaussian models,” BMC Bioinformatics, vol. 10, no. 384, pp. 1–24, 2009.Google Scholar

[69] V., Pihur, S., Datta, and S., Datta, “Reconstruction of genetic association networks from microarray data: a partial least squares approach,” Bioinformatics, vol. 24, no. 4, p. 561, 2008.Google Scholar

[70] D., Mount and S., Arya, “Approximate nearest neighbor code,” http://www.cs.umd.edu/˜ mount/ANN.

[71] J., Schäfer and K., Strimmer, “An empirical Bayes approach to inferring large-scale gene association networks,” Bioinformatics, vol. 21, no. 6, pp. 754–764, 2005.Google Scholar

[72] J., Friedman, T., Hastie, and R., Tibshirani, “Applications of the lasso and grouped lasso to the estimation of sparse graphical models,” 2010. [Online]. Available: http://www-stat. stanford.edu/~tibs/research.html

[73] J., Lee and T., Hastie, “Learning the structure of mixed graphical models,” Journal of Computational and Graphical Statistics, vol. 24, pp. 230–253, 2014.Google Scholar

[74] K., Sricharan, A., Hero, and B., Rajaratnam, “A local dependence measure and its application to screening for high correlations in large data sets,” in Information Fusion (FUSION), 2011 Proceedings of the 14th International Conference on, IEEE, 2011, pp. 1–8.Google Scholar

[75] H., Firouzi, D., Wei, and A., Hero, “Spectral correlation hub screening of multivariate time series,” in Excursions in Harmonic Analysis: The February Fourier Talks at the Norbert Wiener Center, R., Balan, M., Begué, J. J., Benedetto, W., Czaja, and K., Okoudjou, Eds., Springer, 2014.Google Scholar

[76] B., He, R., Baird, R., Butera, A., Datta, et al., “Grand challenges in interfacing engineering with life sciences and medicine.” IEEE Transactions on Bio-Medical Engineering (BME), vol. 4, no. 4, 2013.Google Scholar

[77] R., Chen, G. I., Mias, J., Li-Pook-Than, et al., “Personal omics profiling reveals dynamic molecular and medical phenotypes,” Cell, vol. 148, no. 6, pp. 1293–1307, 2012.Google Scholar

[78] J. J., McCarthy, H. L., McLeod, and G. S., Ginsburg, “Genomic medicine: a decade of successes, challenges, and opportunities,” Science Translational Medicine, vol. 5, no. 189, pp. 189sr4–189sr4, 2013.Google Scholar

[79] J. T., Erler and R., Linding, “Network medicine strikes a blow against breast cancer,” Cell, vol. 149, no. 4, pp. 731–733, 2012.Google Scholar

[80] D.B., Boivin, F. O., James, A., Wu, et al., “Circadian clock genes oscillate in human peripheral blood mononuclear cells,” Blood, vol. 102, no. 12, pp. 4143–4145, 2003.Google Scholar

[81] H., Firouzi, D., Wei, and A., Hero, “Spatio-temporal analysis of gaussian wss processes via complex correlation and partial correlation screening,” in Proceedings of IEEE GlobalSIP Conference, also available as arxiv:1303.2378, 2013.

[82] J. J., Eady, G. M., Wortley, Y. M., Wormstone, et al., “Variation in gene expression profiles of peripheral blood mononuclear cells from healthy volunteers,” Physiological Genomics, vol. 22, no. 3, pp. 402–411, 2005.Google Scholar

[83] A. R., Whitney, M., Diehn, S. J., Popper, et al., “Individuality and variation in gene expression patterns in human blood,” Proceedings of the National Academy of Sciences, vol. 100, no. 4, pp. 1896–1901, 2003.Google Scholar

[84] H., Firouzi, A., Hero, and B., Rajaratnam, “Predictive correlation screening: application to two-stage predictor design in high dimension,” in Proceedings of AISTATS, also available as arxiv:1303.2378, 2013.

[85] N., Katenka, E.D., Kolaczyk, et al., “Inference and characterization of multi-attribute networks with application to computational biology,” The Annals of Applied Statistics, vol. 6, no. 3, pp. 1068–1094, 2012.Google Scholar

[86] S., Zhou, “Gemini: graph estimation with matrix variate normal instances,” The Annals of Statistics, vol. 42, no. 2, pp. 532–562, 2014.Google Scholar

[87] P., Langfelder and S., Horvath, “Wgcna: an R package for weighted correlation network analysis,” BMC Bioinformatics, vol. 9, no. 1, p. 559, 2008.Google Scholar

[88] D., Zhu, A., Hero, H., Cheng, R., Kanna, and A., Swaroop, “Network constrained clustering for gene microarray data,” Bioinformatics, vol. 21, no. 21, pp. 4014–4021, 2005.Google Scholar

[89] A., Rao and A. O., Hero, “Biological pathway inference using manifold embedding,” in Acoustics, Speech and Signal Processing (ICASSP), 2011 IEEE International Conference on, IEEE, 2011, pp. 5992–5995.Google Scholar

Book contents

15 - Large-scale correlation mining for biomolecular network discovery

Summary

Access options

References

Save book to Kindle

Save book to Dropbox

Save book to Google Drive