An Adaptive and Memory Efficient Algorithm for Genotype Imputation

Kang, Hyun Min; Zaitlen, Noah A.; Han, Buhm; Eskin, Eleazar

doi:10.1007/978-3-642-02008-7_34

Hyun Min Kang²⁰,
Noah A. Zaitlen²¹,
Buhm Han²⁰ &
…
Eleazar Eskin²²

Part of the book series: Lecture Notes in Computer Science ((LNBI,volume 5541))

Included in the following conference series:

Annual International Conference on Research in Computational Molecular Biology

1597 Accesses
1 Citations

Abstract

Genome wide association studies have proven to be a highly successful method for identification of genetic loci for complex phenotypes in both humans and model organisms. These large scale studies rely on the collection of hundreds of thousands of single nucleotide polymorphisms (SNPs) across the genome. Standard high-throughput genotyping technologies capture only a fraction of the total genetic variation. Recent efforts have shown that it is possible to “impute” with high accuracy the genotypes of SNPs that are not collected in the study provided that they are present in a reference data set which contains both SNPs collected in the study as well as other SNPs. We here introduce a novel HMM based technique to solve the imputation problem that addresses several shortcomings of existing methods. First, our method is adaptive which lets it estimate population genetic parameters from the data and be applied to model organisms that have very different evolutionary histories. Compared to traditional methods, our method is up to ten times more accurate on model organisms such as mouse. Second, our algorithm scales in memory usage in the number of collected markers as opposed to the number of known SNPs. This issue is very relevant due to the size of the reference data sets currently being generated. We compare our method over mouse and human data sets to existing methods and show that each has either comparable or better performance and much lower memory usage. The method is available for download at http://genetics.cs.ucla.edu/eminim .

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Borevitz, J.O., Hazen, S.P., Michael, T.P., Morris, G.P., Baxter, I.R., Hu, T.T., Chen, H., Werner, J.D., Nordborg, M., Salt, D.E., Kay, S.A., Chory, J., Weigel, D., Jones, J.D., Ecker, J.R.: Genome-wide patterns of single-feature polymorphism in Arabidopsis thaliana. Proc. Natl. Acad. Sci. U.S.A. 104, 12057–12062 (2007)
Article CAS PubMed PubMed Central Google Scholar
Collins, F.S., Brooks, L.D., Chakravarti, A.: A DNA polymorphism discovery resource for research on human genetic variation. Genome Res. 8, 1229–1231 (1998)
Article CAS PubMed Google Scholar
de Bakker, P.I., Yelensky, R., Pe’er, I., Gabriel, S.B., Daly, M.J., Altshuler, D.: Efficiency and power in genetic association studies. Nat. Genet. 37, 1217–1223 (2005)
Article PubMed Google Scholar
Devlin, B., Risch, N.: A comparison of linkage disequilibrium measures for fine-scale mapping. Genomics 29, 311–322 (1995)
Article CAS PubMed Google Scholar
Frazer, K.A., Eskin, E., Kang, H.M., Bogue, M.A., Hinds, D.A., Beilharz, E.J., Gupta, R.V., Montgomery, J., Morenzoni, M.M., Nilsen, G.B., Pethiyagoda, C.L., Stuve, L.L., Johnson, F.M., Daly, M.J., Wade, C.M., Cox, D.R.: A sequence-based variation map of 8. 27 million SNPs in inbred mouse strains 448, 1050–1053 (2007)
CAS Google Scholar
Gunderson, K.L., Steemers, F.J., Lee, G., Mendoza, L.G., Chee, M.S.: A genome-wide scalable SNP genotyping assay using microarray technology. Nat. Genet. 37, 549–554 (2005)
Article CAS PubMed Google Scholar
International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature 449, 851–861 (October 2007)
Google Scholar
Karlsson, E.K., Baranowska, I., Wade, C.M., Salmon Hillbertz, N.H., Zody, M.C., Anderson, N., Biagi, T.M., Patterson, N., Pielberg, G.R., Kulbokas, E.J., Comstock, K.E., Keller, E.T., Mesirov, J.P., von Euler, H., Kämpe, O., Hedhammar, A., Lander, E.S., Andersson, G., Andersson, L., Lindblad-Toh, K.: Efficient mapping of mendelian traits in dogs through genome-wide association. Nat. Genet. 39, 1321–1328 (2007)
Article CAS PubMed Google Scholar
Kingman, J.F.C.: On the genealogy of large populations. Journal of Applied Proability 19, 27–43 (1982)
Article Google Scholar
Li, Y., Willer, C.J., Ding, J., Scheet, P., Abecasis, G.R.: Rapid Markov chain haplotyping and genotype inference (in submission) (2006)
Google Scholar
Marchini, J., Howie, B., Myers, S., McVean, G., Donnelly, P.: A new multipoint method for genome-wide association studies by imputation of genotypes. Nat. Genet. 39, 906–913 (2007)
Article CAS PubMed Google Scholar
Matsuzaki, H., Dong, S., Loi, H., Di, X., Liu, G., Hubbell, E., Law, J., Berntsen, T., Chadha, M., Hui, H., Yang, G., Kennedy, G.C., Webster, T.A., Cawley, S., Walsh, P.S., Jones, K.W., Fodor, S.P., Mei, R.: Genotyping over 100,000 SNPs on a pair of oligonucleotide arrays. Nat. Methods 1, 109–111 (2004)
Article CAS PubMed Google Scholar
Risch, N., Merikangas, K.: The future of genetic studies of complex human diseases. Science 273, 1516–1517 (1996)
Article CAS PubMed Google Scholar
Scheet, P., Stephens, M.: A fast and flexible statistical model for large-scale population genotype data: applications to inferring missing genotypes and haplotypic phase. Am. J. Hum. Genet. 78, 629–644 (2006)
Article CAS PubMed PubMed Central Google Scholar
Szatkiewicz, J.P., Beane, G.L., Ding, Y., Hutchins, L., de Villena, F.P.-M., Churchill, G.A.: An imputed genotype resource for the laboratory mouse. Mamm. Genome 19, 199–208 (2008)
Article PubMed PubMed Central Google Scholar
The STAR Consortium. SNP and haplotype mapping for genetic analysis in the rat. Nat. Genet. 40, 560–566 (May 2008)
Google Scholar
The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls 447, 661–678 (2007)
Google Scholar
Zaitlen, N., Kang, H.M., Eskin, E., Halperin, E.: Leveraging the HapMap correlation structure in association studies. Am. J. Hum. Genet. 80, 683–691 (2007)
Article CAS PubMed PubMed Central Google Scholar

Download references

Author information

Authors and Affiliations

Computer Science and Engineering, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0404, USA
Hyun Min Kang & Buhm Han
Bioinformatics Program, University of California, San Diego, 9500 Gilman Drive, La Jolla, CA 92093-0419, USA
Noah A. Zaitlen
Department of Computer Science and Department of Human Genetics, University of California, Los Angeles, 3532-J Boelter Hall, Los Angeles, CA 90095-1596, USA
Eleazar Eskin

Authors

Hyun Min Kang
View author publications
You can also search for this author in PubMed Google Scholar
Noah A. Zaitlen
View author publications
You can also search for this author in PubMed Google Scholar
Buhm Han
View author publications
You can also search for this author in PubMed Google Scholar
Eleazar Eskin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Computer Science Department, James H. Clark Center, 318 Campus Drive, RM S266, CA 94305-5428,, Stanford, USA
Serafim Batzoglou

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Kang, H.M., Zaitlen, N.A., Han, B., Eskin, E. (2009). An Adaptive and Memory Efficient Algorithm for Genotype Imputation. In: Batzoglou, S. (eds) Research in Computational Molecular Biology. RECOMB 2009. Lecture Notes in Computer Science(), vol 5541. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-02008-7_34

Download citation

DOI: https://doi.org/10.1007/978-3-642-02008-7_34
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-02007-0
Online ISBN: 978-3-642-02008-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics