Abstract
Genome-wide association studies have revolutionized the search for genetic influences on common genetic diseases such as diabetes, obesity, asthma, cardio-vascular diseases and some cancers. In particular, together with the population aging concern, increasing health care costs require that further investigations are pursued to design scalable and efficient tools. The high dimensionality and complexity of genetic data hinder the detection of genetic associations. To decrease the risks of missing the causal factor and discovering spurious associations, machine learning offers an attractive framework alternative to classical statistical approaches. A novel class of probabilistic graphical models (PGMs) has recently been proposed - the forest of latent tree models (FLTMs) - , to reach a trade-off between faithful modeling of data dependences and tractability. In this chapter, we assess the great potentiality of this model to detect genotype-phenotype associations. The FLTM-based contribution is first put into the perspective of PGM-based works meant to model the dependences in genetic data; then the contribution is considered from the technical viewpoint of LTM learning, with the vital objective of scalability in mind. We then present the systematic and comprehensive evaluation conducted to assess the ability of the FLTM model to detect genetic associations through latent variables. Realistic simulations were performed under various controlled conditions. In this context, we present a procedure tailored to correct for multiple testing. We also show and discuss results obtained on real data. Beside guaranteeing data dimension reduction through latent variables, the FLTM model is empirically proven able to capture indirect genetic associations with the disease: strong associations are evidenced between the disease and the ancestor nodes of the causal genetic marker node, in the forest; in contrast, very weak associations are obtained for other latent variables. Finally, we discuss the prospects of the model for association detection at genome scale.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Zhang, Y., Ji, L.: Clustering of SNPs by a Structural EM Algorithm. In: International Joint Conference on Bioinformatics, Systems Biology and Intelligent Computing, pp. 147–150 (2009)
Mourad, R., Sinoquet, C., Leray, P.: Learning Hierarchical Bayesian Networks for Genome-Wide Association Studies. In: Lechevallier, Y., Saporta, G. (eds.) 19th International Conference on Computational Statistics (COMPSTAT), pp. 549–556 (2010)
Mourad, R., Sinoquet, C., Leray, P.: A Hierarchical Bayesian Network Approach for Linkage Disequilibrium Modeling and Data-Dimensionality Reduction Prior to Genome-wide Association Studies. BMC Bioinformatics 12, 16+ (2011)
Daly, M.J., Rioux, J.D., Schaffner, S.F., Hudson, T.J., Lander, E.S.: High-Resolution Haplotype Structure in the Human Genome. Nature Genetics 29(2), 229–232 (2001)
Verzilli, C.J., Stallard, N., Whittaker, J.C.: Bayesian Graphical Models for Genome-Wide Association Studies. The American Journal of Human Genetics 79, 100–112 (2006)
Han, B., Park, M., Chen, X.-W.: A Markov Blanket-Based Method for Detecting Causal SNPs in GWAS. BMC Bioinformatics 11(suppl. 3), S5+ (2010)
Thomas, A., Camp, N.J.: Graphical Modeling of the Joint Distribution of Alleles at Associated Loci. The American Journal of Human Genetics 74, 1088–1101 (2004)
Lee, P.H., Shatkay, H.: BNTagger: Improved Tagging SNP Selection Using Bayesian Networks. Bioinformatics 22(14), 211–219 (2006)
Greenspan, G., Geiger, D.: High Density Linkage Disequilibrium Mapping Using Models of Haplotype Block Variation. Bioinformatics 20, 137–144 (2004)
Kimmel, G., Shamir, R.: GERBIL: Genotype Resolution and Block Identification Using Likelihood. Proceedings of the National Academy of Sciences of The United States of America (PNAS) 102(1), 158–162 (2005)
Scheet, P., Stephens, M.: A Fast and Flexible Statistical Model for Large-Scale Population Genotype Data: Applications to Inferring Missing Genotypes and Haplotypic Phase. The American Journal of Human Genetics 78(4), 629–644 (2006)
Browning, S.R., Browning, B.L.: Rapid and Accurate Haplotype Phasing and Missing-data Inference for Whole-Genome Association Studies by Use of Localized Haplotype Clustering. The American Journal of Human Genetics 81(5), 1084–1097 (2007)
Abel, H.J., Thomas, A.: Accuracy and Computational Efficiency of a Graphical Modeling Approach to Linkage Disequilibrium Estimation. Statistical Applications in Genetics and Molecular Biology 10(1), Article 5 (2011)
Thomas, A., Green, P.J.: Enumerating the Junction Trees of a Decomposable Graph. Journal of Computational and Graphical Statistics 18(4), 930–940 (2009)
Schwartz, G.: Estimating the Dimension of a Model. The Annals of Statistics 6(2), 461–464 (1978)
Zhang, N.L.: Hierarchical Latent Class Models for Cluster Analysis. Journal of Machine Learning Research 5, 697–723 (2004)
Chen, T., Zhang, N.L., Liu, T., Poon, K.M., Wang, Y.: Model-Based Multidimensional Clustering of Categorical Data. Artificial Intelligence 176(1), 2246–2269 (2011)
Zhang, N.L., Kocka, T.: Efficient Learning of Hierarchical Latent Class Models. In: 16th IEEE International Conference on Tools with Artificial Intelligence (ICTAI), pp. 585–593 (2004)
Hwang, K.-B., Kim, B.-H., Zhang, B.-T.: Learning Hierarchical Bayesian Networks for Large-Scale Data Analysis. In: King, I., Wang, J., Chan, L.-W., Wang, D. (eds.) ICONIP 2006, Part I. LNCS, vol. 4232, pp. 670–679. Springer, Heidelberg (2006)
Harmeling, S., Williams, C.K.I.: Greedy Learning of Binary Latent Trees. IEEE Transactions on Pattern Analysis and Machine Intelligence 33(6), 1087–1097 (2011)
Wang, Y., Zhang, N.L., Chen, T.: Latent Tree Models and Approximate Inference in Bayesian Networks. Machine Learning 32, 879–900 (2008)
Ben-Dor, A., Shamir, R., Yakhini, Z.: Clustering Gene Expression Patterns. In: 3rd Annual International Conference on Computational Molecular Biology, pp. 33–42 (1999)
Mourad, R., Sinoquet, C., Dina, C., Leray, P.: Visualization of Pairwise and Multilocus Linkage Disequilibrium Structure Using Latent Forests. PLoS ONE 6(12), e27320 (2011)
Spencer, C.C., Su, Z., Donnelly, P., Marchini, J.: Designing Genome-Wide Association Studies: Sample Size, Power, Imputation, and the Choice of Genotyping Chip. PLoS Genetics, 5, e1000477+ (2009)
Hosking, L.K., Boyd, P.R., Xu, C.F., Nissum, M., Cantone, K., Purvis, I.J., Khakhar, R., Barnes, M.R., Liberwirth, U., Hagen-Mann, K., Ehm, M.G., Riley, J.H.: Linkage Disequilibrium Mapping Identifies a 390 kb Region Associated with CYP2D6 Poor Drug Metabolising Activity. Pharmacogenomics Journal 2(3), 165–175 (2002)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2013 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Sinoquet, C., Mourad, R., Leray, P. (2013). Forests of Latent Tree Models to Decipher Genotype-Phenotype Associations. In: Gabriel, J., et al. Biomedical Engineering Systems and Technologies. BIOSTEC 2012. Communications in Computer and Information Science, vol 357. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-38256-7_8
Download citation
DOI: https://doi.org/10.1007/978-3-642-38256-7_8
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-38255-0
Online ISBN: 978-3-642-38256-7
eBook Packages: Computer ScienceComputer Science (R0)