A fast haplotype inference method for large population genotype data

https://doi.org/10.1016/j.csda.2008.04.004Get rights and content

Abstract

With the rapid progress of genotyping techniques, many large-scale, genome-wide disease studies are now under way. One of the challenges of large disease-association studies is developing a fast and accurate computing method for haplotype inference from genotype data. In this paper, a new computing method for population-based haplotype inference problem is proposed. The designed method does not assume haplotype blocks in the population and allows each individual haplotype to have its own structure, and thus is able to accommodate recombination and obtain higher adaptivity to the genotype data, specifically in the case of long marker maps. This method develops a dynamic programming algorithm, which is theoretically guaranteed to find exact maximum likelihood solutions of the variable order Markov chain model for haplotype inference problem within linear running time. Hence, it is fast and, as a result, practicable for large genotype datasets. Through extensive computational experiments on large-scale real genotype data, the proposed method is shown to be fast and efficient.

Introduction

Haplotypes play very important roles in many genetic studies, especially for association (or linkage disequilibrium, LD) studies of common complex diseases. They capture information about regions descended from ancestral chromosomes and give higher power for assigning a phenotype to a genetic region in association studies. Recent works such as Akey et al. (2001) and Stephens et al. (2001) indicate that haplotypes generally have more information content than individual SNP (single nucleotide polymorphism) markers in disease-association studies. Although the haplotypes can be determined by the use of existing experimental techniques (Patil et al., 2001), such approaches are considerably expensive and time consuming. Current practical laboratory techniques usually provide unphased genotype data (i.e., an unordered pair of alleles for each marker) rather than haplotype data for diploid organisms. Therefore, the reconstruction of haplotypes from genotype data is a decisive step in genetic studies.

There are mainly two ways for solving the problem now. One is haplotyping genetically related individuals. By exploiting the additional pedigree data, one can get a better estimate of haplotypes since the haplotype pair of a child is constrained by its inheritance from his parents. This involves significant additional genotyping costs and potential recruiting problems, and on average up to one eighth of the alleles can still remain ambiguous. The second approach is haplotyping a population without pedigree information. This fast and cheap population-based alternative applies computational or statistical methods to find the most likely haplotype configurations from the observed genotype data. A lot of approaches to population-based haplotyping have been presented recently: Clark’s parsimony method (Clark, 1990, Gusfield, 2002), the expectation-maximization (EM) algorithm (Excoffier and Slatkin, 1995) and its Partition Ligation (PL) variant (Niu et al., 2002), PHASE (Stephens et al., 2001), Haplotyper (Niu et al., 2002), and the phylogenetic approach (Eskin et al., 2003, Gusfield, 2002, Halperin and Eskin, 2004). We refer the interested readers to the review paper (Halldórsson et al., 2004) and the references therein for more details about these methods. Most of these existing methods typically assume that each haplotype is descended as a unit from generation to generation and cannot accommodate recombination very well.

Recently, several researchers have explicitly modeled recombination (Eronen et al., 2004, Greenspan and Geiger, 2003, Kimmel and Shamir, 2004, Stephens and Scheet, 2005). The models in Greenspan and Geiger (2003) and Kimmel and Shamir (2004) address the recombination by solving the haplotype inference problem and haplotype block partition problem simultaneously. Stephens and Scheet (2005) consider the recombination by treating recombination processes as “nuisance parameters”. Their algorithm is the most accurate in the methods described above, but also one of the slowest methods (Marchini et al., 2006). Eronen et al., 2004, Eronen et al., 2006 take a variable order Markov chain method to consider the recombination problem. The method estimates and uses frequencies of local haplotype fragments instead of that of full haplotypes. These fragments are shorter regions potentially conserved for several generations and thus more likely to be reliably identified in a population sample. This method is aimed at long marker maps, where LD between markers may be relatively weak. The experiments in Eronen et al., 2004, Eronen et al., 2006 show that their algorithms outperform most of the existing haplotype inference methods (such as Snphap1 and PL-EM (Niu et al., 2002)), especially on genetically long marker maps. However, the “Partition Ligation” algorithm used in Eronen et al., 2004, Eronen et al., 2006 is only a greedy heuristic algorithm producing near-optimal haplotype reconstructions in reasonable time, and without any guarantee of solution quality.

In this paper, we give a more rigorous mathematical definition of the variable order Markov chain model, which was first proposed by Eronen et al. (2004). Based on the new formulation of the model, an exact algorithm based on dynamic programming technique is proposed to solve the haplotype inference problem, instead of the heuristic method used by Eronen et al., 2004, Eronen et al., 2006. The designed dynamic programming algorithm can theoretically guarantee global optimality of the maximum likelihood solution and has low (linear) time complexity. Thus, it is fast and, as a result, practicable for large genotype datasets. This paper also enhances the fragment frequency estimate by exploiting the partially missing markers to give more information about possible haplotype configurations than that in Eronen et al. (2004). Finally, extensive computational experiments were carried out for evaluating the proposed model and algorithm.

The rest of the paper is organized as follows. The next section introduces the notation and the haplotype inference problem. In Section 4, a dynamic programming method is developed for the haplotype inference problem. Section 5 illustrates the computational experiments on several real genotype data. Discussion and conclusion are provided in the last section.

Section snippets

Notation

Each diploid individual has two nearly identical copies of each chromosome and hence of each region of interest. A description of the alleles of markers on a single copy is called a haplotype, while a description of the conflated alleles (i.e. the unordered pair of alleles for each marker) on the two homogenous copies of chromosomes is called a genotype. We denote a set (map) of the n markers by M={1,2,,n} and the set of alleles of marker i by Ai. Then the haplotypes are vectors in i=1nAi and

Haplotype fragment probabilities

Given a set of genotype G, the probabilities of haplotype fragments are estimated by their frequencies computed from the genotype data G. In detail, the frequency of a haplotype fragment is calculated from the number of matching genotypes and the number of possible haplotype configurations for each genotype. For genotypes without missing data, the haplotype fragment probabilities can be appropriately estimated as follows. Pr(H(i,j)G)fr(H(i,j))=1|G|GGH(i,j)G(i,j)12kG(i,j), where kG(i,j) is

Markov chain MC(F)

For the selected haplotype fragment set F=Fv(G,d,δ), let F=F0(k=τ,,nFk) where τ=min{j|H(1,j)F} and Fk, k=0,τ,,n, are disjoint haplotype fragment sets defined as follows: F0={H(i,j)H(i,j)F,1<ijτ},Fk={H(i,k)H(i,k)FF0},k=τ,,n. Actually, F0 is the set of haplotype fragments of F between the second and τth marker, Fτ is the set of haplotype fragments of F from the first to τth marker and Fk is the set of haplotype fragments of F terminated at the kth marker, k=τ+1,,n. We also call Fk a

Computational experiments

In order to evaluate the haplotype inference method presented in this paper and test the performance of the designed dynamic programming algorithm, extensive computational experiments are carried out on two real genotype data. The proposed algorithm was implemented as part of HMC version 0.8 (Haplotype inference tool based on Markov Chain models) by using C++. The experimental results are compared to the VMM model of HaploRec 2.1 (Eronen et al., 2004, Eronen et al., 2006) since they are based

Conclusions

In this paper, a fast inference method for population-based haplotype reconstruction is proposed. The method does not assume haplotype blocks in the population and allows each individual haplotype to have its own structure, which enables it to better accommodate recombination and obtain higher adaptivity to the genotype data, specifically in the case of long marker maps.

The proposed method improves the variable order Markov chain model which was first proposed by Eronen et al., 2004, Eronen

Acknowledgments

The authors thank Lauri Eronen and Matthew Stephens for kindly making their program and data available. The software of HMC is available upon request from the authors.

References (21)

There are more references available in the full text version of this article.

Cited by (4)

  • Haplotype inference using a novel binary particle swarm optimization algorithm

    2014, Applied Soft Computing Journal
    Citation Excerpt :

    There are two ways to solve the Haplotype Inference (HI) problem (find a set of haplotype pairs to explain (or solve) the given genotypes): (1) haplotyping genetically related individuals; (2) haplotyping a population without pedigree information. By the first way, one can get a better estimate of haplotypes, however, it involves significant additional costs [14]. The second one employs computational methods to infer the haplotype from the given genotype data [15–21].

  • HybHap: A fast and accurate hybrid approach for haplotype inference on large datasets

    2013, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
  • Haplotype Inference Models and Algorithms

    2010, Algorithms in Computational Molecular Biology: Techniques, Approaches and Applications
  • Insights on haplotype inference on large genotype datasets

    2010, Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

The first two authors contributed equally. This work was supported by the National Natural Science Foundation of China under Grant Nos 60503004, 10631070, 70771013, and program for NCET-07-0105. The authors gratefully acknowledge the support of K. C. Wong Education Foundation, Hong Kong.

View full text