Shotgun Sequence Assembly
Introduction
In 1982 Fred Sanger developed a new technique called shotgun sequencing and proved its worth by sequencing the complete genome of the bacteriophage Lambda [1]. This technique attempted to overcome the limitations of sequencing technologies by breaking up the DNA at random. Sequencing techniques were only able to “read” several hundred nucleotides at a time. The resulting pieces were assembled together based on the similarity between pieces derived from the same section of the original DNA molecule. The large amount of data produced by shotgun sequencing made it necessary to utilize computer programs to assist the assembly [2], [3]. Despite continued improvements in sequencing technology and the development of specialized assembly programs, it was unclear whether shotgun sequencing could be used to sequence genomes larger than those of viruses (typically 5000–100,000 nucleotides). For larger genomes it was thought that the complexity of the task would pose an insurmountable challenge to any computer program.
In 1995, however, researchers at The Institute for Genomic Research (TIGR) successfully used the shotgun sequencing technique to decipher the complete genome of the bacterium Haemophilus influenzae[4]. The sequencing of this 1.83 million base pair genome required the development of a specialized assembly program [5] as well as painstaking laboratory efforts to complete those regions that could not correctly be assembled by the software. The success of the Haemophilus project started a genomics revolution with the number of genomes being sequenced every year increasing at an exponential rate. At the moment the genomes of more than 1000 viruses, 100 bacteria, and several eukaryotes have been completed, while multiple other projects are well on the way to completion. In parallel with the large amounts of genomic data becoming available, the genomic revolution led to the birth of a new field—bioinformatics—bringing together an eclectic mix of scientific fields such as computer science and engineering, mathematics, physics, chemistry, and biology.
Critics of the shotgun sequencing approach continued to question its applicability to large genomes despite obvious successes in sequencing bacterial genomes. They argued the technique would be impractical in the case of large eukaryotic genomes because repeats—stretches of DNA that occur in two or more copies within the genome—would hopelessly confuse any assembler [6]. The standard procedure for handling large genomes was a hierarchical approach involving breaking up the DNA into large (50–150 kbp) pieces cloned in bacterial artificial chromosomes (BACs), and then sequencing each BAC through the shotgun method. Most such criticism was silenced in 2000 by the successful assembly at Celera of the genome of Drosophila melanogaster[7] from whole-genome shotgun (WGS) data. The assembly was performed with a new assembler [8] designed to handle the specific problems involved in assembling large complex genomes. The researchers from Celera went on to assemble the human genome using the same whole-genome shotgun sequencing technique [9]. Their results were published simultaneously with those from the International Human Genome Sequencing Consortium, who used the traditional hierarchical method [10]. Independent studies [11], [12] later showed that the two assemblies produced similar results and many of the differences between them could be explained by the draft-level quality of the data. The applicability of the WGS method to large genomes was thus proven though some continue to argue the validity of Celera's results (some opinions on this topic are presented in [13], [14], [15], [16], [17]).
Celera's success combined with the cost advantages of the WGS technique—Celera sequenced and assembled the human genome in a little over a year while the international consortium's efforts had been going on for more than 10 years—renewed interest in the WGS method and led to the development of several WGS assembly programs: Arachne [18], [19] at the Whitehead Institute, Phusion [20] at the Sanger Institute, Atlas [21] at the Baylor Human Genome Sequencing Center, and Jazz [22] at the DOE Joint Genome Institute. Most current sequencing projects have opted for a WGS approach instead of the hierarchical approach. For example the sequencing of the mouse [23], rat [24], dog [25], puffer fish [22], and sea squirt [26] all follow the WGS strategy.
The current issue of debate is the suitability of whole-genome shotgun sequencing as the starting point in the efforts to obtain the complete sequence for a genome. All sequencing strategies start by building a backbone, or rough draft, of the genome whose gaps need to be filled in through further laboratory experiments. It is still not clear which sequencing strategy will ultimately be the most efficient in obtaining the complete sequence of an organism, especially as none of the large eukaryotic projects have yet been finished, except for the 100 Mbp genome of the nematode Caenorhabditis elegans, finished in October 2002. (The genomes of Drosophila melanogaster and human are expected to be mostly finished before the end of 2003.)
Despite significant differences in the overall structuring of the sequencing process, all sequencing strategies rely on shotgun sequencing as a basic component. The reader is referred to [27], [28] for an in-depth discussion of current approaches to sequencing. The following sections represent a description of the shotgun sequencing technique, with a emphasis on the algorithmic challenges imposed by this technique.
Section snippets
Shotgun Sequencing Overview
The process of shotgun sequencing starts by physically breaking up the DNA molecule into millions of random fragments. The fragments are then inserted into cloning vectors1 in order to amplify the DNA to levels needed by the sequencing reactions. Commonly used cloning vectors are plasmids (circular pieces
Assembly Paradigms
In its most general form the sequence assembly problem involves reconstructing the genome from the shotgun reads based on sequence similarity alone. This problem can be further decomposed into two problems: the mapping or layout problem, in which all reads need to be positioned correctly in the genome, and the consensus problem, in which the contiguous DNA sequence of the genome is computed. It can be easily seen that in this formulation the general problem is impossible to solve. For example,
Overlap Detection
The basic assumption of shotgun sequencing is that sequence similarity between two reads is an indication that the reads originate from the same section of the genome. All assembly algorithms must therefore identify similarities between reads. The specific algorithmic approaches to the task have evolved throughout the years, as increasingly more complex sequencing projects were tackled through the shotgun method. The earliest algorithms involved either iteratively aligning each read to an
Exotic Assembly
Up to this point we have presented solutions to the most common problems related to shotgun sequence assembly. These algorithms contributed to the current genomic revolution leading to an exponentially increasing number of genomes being sequenced. This increase in the numbers and types of genomes that are analyzed is uncovering new problems to be solved by assembly programs. In this section we will briefly discuss a few of the current assembly challenges.
Conclusions
The assembly problem was repeatedly considered solved, first when efficient approximation algorithms for the shortest superstring problem became available, again when assembly software was able to routinely assemble entire bacterial genomes, and recently when software exists that can assemble entire mammalian genomes in a relatively short time. Continued reductions in sequencing costs have led to a dramatic increase in the numbers of genomes being sequenced. A direct effect of this genomic
Acknowledgements
I would like to thank Art Delcher, Adam Phillippy, and Steven Salzberg for their useful comments and continued support. This work was supported in part by the National Institutes of Health under grant R01-LM06845.
References (137)
Nucleotide sequence of bacteriophage lambda DNA
J. Mol. Biol.
(1982)- et al.
Genomic mapping by fingerprinting random clones: A mathematical analysis
Genomics
(1988) Genomic mapping by anchoring random clones: A mathematical analysis
Genomics
(1991)Pairwise end sequencing: A unified approach to genomic mapping and sequencing
Genomics
(1995)Genomic mapping by end-characterized random clones: A mathematical analysis
Genomics
(1995)Optimized multiplex PCR: efficiently closing a whole-genome shotgun sequencing project
Genomics
(1999)- et al.
Identification of common molecular subsequences
J. Mol. Biol.
(1981) Sequence and analysis of the Arabidopsis genome
Curr. Opin. Plant. Biol.
(2001)- et al.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
J. Mol. Biol.
(1970) Basic local alignment search tool
J. Mol. Biol.
(1990)
An improved sequence assembly program
Genomics
Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing
Nucleic Acids Res.
Computer programs for the assembly of DNA sequences
Nucleic Acids Res.
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd
Science
TIGR assembler: A new tool for assembling large shotgun sequencing projects
Genome Science and Technology
Against a whole-genome shotgun
Genome Res.
The genome sequence of Drosophila melanogaster
Science
A whole-genome assembly of Drosophila
Science
The sequence of the human genome
Science
Initial sequencing and analysis of the human genome
Nature
Computational comparison of human genomic sequence assemblies for a region of chromosome 4
Genome Res.
Computational comparison of two draft sequences of the human genome
Nature
The independence of our genome assemblies
Proc. Natl. Acad. Sci. USA
More on the sequencing of the human genome
Proc. Natl. Acad. Sci. USA
On the sequencing of the human genome
Proc. Natl. Acad. Sci. USA
Whole-genome disassembly
Proc. Natl. Acad. Sci. USA
On the sequencing and assembly of the human genome
Proc. Natl. Acad. Sci. USA
ARACHNE: a whole-genome shotgun assembler
Genome Res.
Whole-genome sequence assembly for Mammalian genomes: arachne 2
Genome Res.
The phusion assembler
Genome Res.
The Atlas whole-genome assembler
Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes
Science
Initial sequencing and comparative analysis of the mouse genome
Nature
Rat. Genome project
The dog genome: survey sequencing and comparative analysis
Science
The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins
Science
Strategies for the systematic sequencing of complex genomes
Nat. Rev. Genet.
A clone-array pooled shotgun strategy for sequencing large genomes
Genome Res.
Algorithms for optimizing production DNA sequencing
Sequencing a genome by walking with clone-end sequences: A mathematical analysis
Genome Res.
Estimating the repeat structure and length of DNA sequences using ell-tuples
Genome Res.
Predicting Progress in Shotgun Sequencing with Paired Ends
Representation of cloned genomic sequences in two sequencing vectors: correlation of DNA sequence and subclone distribution
Nucleic Acids Res.
Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction
Cold Spring Harb. Symp. Quant. Biol.
Multiplex polymerase chain reaction
Mod. Pathol.
An optimal procedure for gap closing in whole genome shotgun sequencing
Learning a hidden matching
Sequence assembly and finishing methods
Methods Biochem. Anal.
Automated finishing with autofinish
Genome Res.
Consed: A graphical tool for sequence finishing
Genome Res.
Cited by (26)
Sequence assembly
2009, Computational Biology and ChemistryCitation Excerpt :In EST datasets the main difficulty is to develop an algorithmic approach that, in addition to efficient assembly, can handle highly expressed genes, paralogous genes, alternative spliceforms and chimerism in the dataset. The theoretical background for genome assembly lies in computer science, and an insight into the mathematical and theoretical background can be found in (Pop, 2004) and references therein. Although pyrosequencing with a whole-genome shotgun approach has been successfully applied to bacterial genomes (Margulies et al., 2005), the construction of high-quality assemblies with high-throughput sequencing data is still a non-trivial problem even for short genomes.
Recent advances in gene and genome assembly: Challenges and implications
2020, Advances in Synthetic BiologyLong-read sequence and assembly of segmental duplications
2019, Nature MethodsSolving the DNA fragment assembly problem with a parallel discrete firefly algorithm implemented on GPU
2018, Computer Science and Information Systems