Shotgun Sequence Assembly

doi:10.1016/S0065-2458(03)60006-9

Advances in Computers

Volume 60, 2004, Pages 193-248

https://doi.org/10.1016/S0065-2458(03)60006-9 Get rights and content

Abstract

Shotgun sequencing is the most widely used technique for determining the DNA sequence of organisms. It involves breaking up the DNA into many small pieces that can be read by automated sequencing machines, then piecing together the original genome using specialized software programs called assemblers. Due to the large amounts of data being generated and to the complex structure of most organisms' genomes, successful assembly programs rely on sophisticated algorithms based on knowledge from such diverse fields as statistics, graph theory, computer science, and computer engineering. Throughout this chapter we will describe the main computational challenges imposed by the shotgun sequencing method, and survey the most widely used assembly algorithms.

Introduction

In 1982 Fred Sanger developed a new technique called shotgun sequencing and proved its worth by sequencing the complete genome of the bacteriophage Lambda [1]. This technique attempted to overcome the limitations of sequencing technologies by breaking up the DNA at random. Sequencing techniques were only able to “read” several hundred nucleotides at a time. The resulting pieces were assembled together based on the similarity between pieces derived from the same section of the original DNA molecule. The large amount of data produced by shotgun sequencing made it necessary to utilize computer programs to assist the assembly [2], [3]. Despite continued improvements in sequencing technology and the development of specialized assembly programs, it was unclear whether shotgun sequencing could be used to sequence genomes larger than those of viruses (typically 5000–100,000 nucleotides). For larger genomes it was thought that the complexity of the task would pose an insurmountable challenge to any computer program.

In 1995, however, researchers at The Institute for Genomic Research (TIGR) successfully used the shotgun sequencing technique to decipher the complete genome of the bacterium Haemophilus influenzae[4]. The sequencing of this 1.83 million base pair genome required the development of a specialized assembly program [5] as well as painstaking laboratory efforts to complete those regions that could not correctly be assembled by the software. The success of the Haemophilus project started a genomics revolution with the number of genomes being sequenced every year increasing at an exponential rate. At the moment the genomes of more than 1000 viruses, 100 bacteria, and several eukaryotes have been completed, while multiple other projects are well on the way to completion. In parallel with the large amounts of genomic data becoming available, the genomic revolution led to the birth of a new field—bioinformatics—bringing together an eclectic mix of scientific fields such as computer science and engineering, mathematics, physics, chemistry, and biology.

Critics of the shotgun sequencing approach continued to question its applicability to large genomes despite obvious successes in sequencing bacterial genomes. They argued the technique would be impractical in the case of large eukaryotic genomes because repeats—stretches of DNA that occur in two or more copies within the genome—would hopelessly confuse any assembler [6]. The standard procedure for handling large genomes was a hierarchical approach involving breaking up the DNA into large (50–150 kbp) pieces cloned in bacterial artificial chromosomes (BACs), and then sequencing each BAC through the shotgun method. Most such criticism was silenced in 2000 by the successful assembly at Celera of the genome of Drosophila melanogaster[7] from whole-genome shotgun (WGS) data. The assembly was performed with a new assembler [8] designed to handle the specific problems involved in assembling large complex genomes. The researchers from Celera went on to assemble the human genome using the same whole-genome shotgun sequencing technique [9]. Their results were published simultaneously with those from the International Human Genome Sequencing Consortium, who used the traditional hierarchical method [10]. Independent studies [11], [12] later showed that the two assemblies produced similar results and many of the differences between them could be explained by the draft-level quality of the data. The applicability of the WGS method to large genomes was thus proven though some continue to argue the validity of Celera's results (some opinions on this topic are presented in [13], [14], [15], [16], [17]).

Celera's success combined with the cost advantages of the WGS technique—Celera sequenced and assembled the human genome in a little over a year while the international consortium's efforts had been going on for more than 10 years—renewed interest in the WGS method and led to the development of several WGS assembly programs: Arachne [18], [19] at the Whitehead Institute, Phusion [20] at the Sanger Institute, Atlas [21] at the Baylor Human Genome Sequencing Center, and Jazz [22] at the DOE Joint Genome Institute. Most current sequencing projects have opted for a WGS approach instead of the hierarchical approach. For example the sequencing of the mouse [23], rat [24], dog [25], puffer fish [22], and sea squirt [26] all follow the WGS strategy.

The current issue of debate is the suitability of whole-genome shotgun sequencing as the starting point in the efforts to obtain the complete sequence for a genome. All sequencing strategies start by building a backbone, or rough draft, of the genome whose gaps need to be filled in through further laboratory experiments. It is still not clear which sequencing strategy will ultimately be the most efficient in obtaining the complete sequence of an organism, especially as none of the large eukaryotic projects have yet been finished, except for the 100 Mbp genome of the nematode Caenorhabditis elegans, finished in October 2002. (The genomes of Drosophila melanogaster and human are expected to be mostly finished before the end of 2003.)

Despite significant differences in the overall structuring of the sequencing process, all sequencing strategies rely on shotgun sequencing as a basic component. The reader is referred to [27], [28] for an in-depth discussion of current approaches to sequencing. The following sections represent a description of the shotgun sequencing technique, with a emphasis on the algorithmic challenges imposed by this technique.

Section snippets

Shotgun Sequencing Overview

The process of shotgun sequencing starts by physically breaking up the DNA molecule into millions of random fragments. The fragments are then inserted into cloning vectors¹ in order to amplify the DNA to levels needed by the sequencing reactions. Commonly used cloning vectors are plasmids (circular pieces

Assembly Paradigms

In its most general form the sequence assembly problem involves reconstructing the genome from the shotgun reads based on sequence similarity alone. This problem can be further decomposed into two problems: the mapping or layout problem, in which all reads need to be positioned correctly in the genome, and the consensus problem, in which the contiguous DNA sequence of the genome is computed. It can be easily seen that in this formulation the general problem is impossible to solve. For example,

Overlap Detection

The basic assumption of shotgun sequencing is that sequence similarity between two reads is an indication that the reads originate from the same section of the genome. All assembly algorithms must therefore identify similarities between reads. The specific algorithmic approaches to the task have evolved throughout the years, as increasingly more complex sequencing projects were tackled through the shotgun method. The earliest algorithms involved either iteratively aligning each read to an

Exotic Assembly

Up to this point we have presented solutions to the most common problems related to shotgun sequence assembly. These algorithms contributed to the current genomic revolution leading to an exponentially increasing number of genomes being sequenced. This increase in the numbers and types of genomes that are analyzed is uncovering new problems to be solved by assembly programs. In this section we will briefly discuss a few of the current assembly challenges.

Conclusions

The assembly problem was repeatedly considered solved, first when efficient approximation algorithms for the shortest superstring problem became available, again when assembly software was able to routinely assemble entire bacterial genomes, and recently when software exists that can assemble entire mammalian genomes in a relatively short time. Continued reductions in sequencing costs have led to a dramatic increase in the numbers of genomes being sequenced. A direct effect of this genomic

Acknowledgements

I would like to thank Art Delcher, Adam Phillippy, and Steven Salzberg for their useful comments and continued support. This work was supported in part by the National Institutes of Health under grant R01-LM06845.

References (137)

F. Sanger
Nucleotide sequence of bacteriophage lambda DNA
J. Mol. Biol.
(1982)
E.S. Lander et al.
Genomic mapping by fingerprinting random clones: A mathematical analysis
Genomics
(1988)
R. Arratia
Genomic mapping by anchoring random clones: A mathematical analysis
Genomics
(1991)
J.C. Roach
Pairwise end sequencing: A unified approach to genomic mapping and sequencing
Genomics
(1995)
E. Port
Genomic mapping by end-characterized random clones: A mathematical analysis
Genomics
(1995)
H. Tettelin
Optimized multiplex PCR: efficiently closing a whole-genome shotgun sequencing project
Genomics
(1999)
T.F. Smith et al.
Identification of common molecular subsequences
J. Mol. Biol.
(1981)
M. Bevan
Sequence and analysis of the Arabidopsis genome
Curr. Opin. Plant. Biol.
(2001)
S.B. Needleman et al.
A general method applicable to the search for similarities in the amino acid sequence of two proteins
J. Mol. Biol.
(1970)
S.F. Altschul
Basic local alignment search tool
J. Mol. Biol.
(1990)

X. Huang

An improved sequence assembly program

Genomics

(1996)

R. Staden

Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing

Nucleic Acids Res.

(1982)

T.R. Gingeras

Computer programs for the assembly of DNA sequences

Nucleic Acids Res.

(1979)

R.D. Fleischmann

Whole-genome random sequencing and assembly of Haemophilus influenzae Rd

Science

(1995)

G.G. Sutton

TIGR assembler: A new tool for assembling large shotgun sequencing projects

Genome Science and Technology

(1995)

P. Green

Against a whole-genome shotgun

Genome Res.

(1997)

M.D. Adams

The genome sequence of Drosophila melanogaster

Science

(2000)

E.W. Myers

A whole-genome assembly of Drosophila

Science

(2000)

J.C. Venter

The sequence of the human genome

Science

(2001)

I.H.G.S. Consortium

Initial sequencing and analysis of the human genome

Nature

(2001)

C.A. Semple

Computational comparison of human genomic sequence assemblies for a region of chromosome 4

Genome Res.

(2002)

J. Aach

Computational comparison of two draft sequences of the human genome

Nature

(2001)

M.D. Adams

The independence of our genome assemblies

Proc. Natl. Acad. Sci. USA

(2003)

R.H. Waterston et al.

Proc. Natl. Acad. Sci. USA

(2003)

R.H. Waterston et al.

On the sequencing of the human genome

Proc. Natl. Acad. Sci. USA

(2002)

P. Green

Whole-genome disassembly

Proc. Natl. Acad. Sci. USA

(2002)

E.W. Myers

On the sequencing and assembly of the human genome

Proc. Natl. Acad. Sci. USA

(2002)

S. Batzoglou

ARACHNE: a whole-genome shotgun assembler

Genome Res.

(2002)

D.B. Jaffe

Whole-genome sequence assembly for Mammalian genomes: arachne 2

Genome Res.

(2003)

J.C. Mullikin et al.

The phusion assembler

Genome Res.

(2003)

P. Havlak

The Atlas whole-genome assembler

S. Aparicio

Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes

Science

(2002)

R.H. Waterston

Initial sequencing and comparative analysis of the mouse genome

Nature

(2002)

Consortium R.g.s

Rat. Genome project

E.F. Kirkness

The dog genome: survey sequencing and comparative analysis

Science

(2003)

P. Dehal

The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins

Science

(2002)

E.D. Green

Strategies for the systematic sequencing of complex genomes

Nat. Rev. Genet.

(2001)

W.W. Cai

A clone-array pooled shotgun strategy for sequencing large genomes

Genome Res.

(2001)

E. Czabarka

Algorithms for optimizing production DNA sequencing

S. Batzoglou

Sequencing a genome by walking with clone-end sequences: A mathematical analysis

Genome Res.

(1999)

X. Li et al.

Estimating the repeat structure and length of DNA sequences using ell-tuples

Genome Res.

(2003)

R.F. Yeh

Predicting Progress in Shotgun Sequencing with Paired Ends

(2002)

S.L. Chissoe

Representation of cloned genomic sequences in two sequencing vectors: correlation of DNA sequence and subclone distribution

Nucleic Acids Res.

(1997)

K. Mullis

Specific enzymatic amplification of DNA in vitro: the polymerase chain reaction

Cold Spring Harb. Symp. Quant. Biol.

(1986)

L.J. Burgart

Multiplex polymerase chain reaction

Mod. Pathol.

(1992)

R. Beigel

An optimal procedure for gap closing in whole genome shotgun sequencing

N. Alon

Learning a hidden matching

R. Staden et al.

Sequence assembly and finishing methods

Methods Biochem. Anal.

(2001)

D. Gordon et al.

Automated finishing with autofinish

Genome Res.

(2001)

D. Gordon et al.

Consed: A graphical tool for sequence finishing

Genome Res.

(1998)

Cited by (26)

Sequence assembly
2009, Computational Biology and Chemistry
Citation Excerpt :
In EST datasets the main difficulty is to develop an algorithmic approach that, in addition to efficient assembly, can handle highly expressed genes, paralogous genes, alternative spliceforms and chimerism in the dataset. The theoretical background for genome assembly lies in computer science, and an insight into the mathematical and theoretical background can be found in (Pop, 2004) and references therein. Although pyrosequencing with a whole-genome shotgun approach has been successfully applied to bacterial genomes (Margulies et al., 2005), the construction of high-quality assemblies with high-throughput sequencing data is still a non-trivial problem even for short genomes.
Despite the rapidly increasing number of sequenced and re-sequenced genomes, many issues regarding the computational assembly of large-scale sequencing data have remain unresolved. Computational assembly is crucial in large genome projects as well for the evolving high-throughput technologies and plays an important role in processing the information generated by these methods. Here, we provide a comprehensive overview of the current publicly available sequence assembly programs. We describe the basic principles of computational assembly along with the main concerns, such as repetitive sequences in genomic DNA, highly expressed genes and alternative transcripts in EST sequences. We summarize existing comparisons of different assemblers and provide a detailed descriptions and directions for download of assembly programs at: http://genome.ku.dk/resources/assembly/methods.html.
Recent advances in gene and genome assembly: Challenges and implications
2020, Advances in Synthetic Biology
Long-read sequence and assembly of segmental duplications
2019, Nature Methods
Reconstructing latent orderings by spectral clustering
2018, arXiv
Robust seriation and applications to cancer genomics
2018, arXiv
Solving the DNA fragment assembly problem with a parallel discrete firefly algorithm implemented on GPU
2018, Computer Science and Information Systems

View all citing articles on Scopus

View full text

Shotgun Sequence Assembly

Abstract

Introduction

Section snippets

Shotgun Sequencing Overview

Assembly Paradigms

Overlap Detection

Exotic Assembly

Conclusions

Acknowledgements

J. Mol. Biol.

Genomics

Genomics

Genomics

Genomics

Genomics

J. Mol. Biol.

Curr. Opin. Plant. Biol.

J. Mol. Biol.

J. Mol. Biol.

Genomics

Automation of the computer handling of gel reading data produced by the shotgun method of DNA sequencing

Nucleic Acids Res.

Computer programs for the assembly of DNA sequences

Nucleic Acids Res.

Whole-genome random sequencing and assembly of Haemophilus influenzae Rd

Science

TIGR assembler: A new tool for assembling large shotgun sequencing projects

Genome Science and Technology

Against a whole-genome shotgun

Genome Res.

The genome sequence of Drosophila melanogaster

Science

A whole-genome assembly of Drosophila

Science

The sequence of the human genome

Science

Initial sequencing and analysis of the human genome

Nature

Computational comparison of human genomic sequence assemblies for a region of chromosome 4

Genome Res.

Computational comparison of two draft sequences of the human genome

Nature

The independence of our genome assemblies

Proc. Natl. Acad. Sci. USA

More on the sequencing of the human genome

Proc. Natl. Acad. Sci. USA

On the sequencing of the human genome

Proc. Natl. Acad. Sci. USA

Whole-genome disassembly

Proc. Natl. Acad. Sci. USA

On the sequencing and assembly of the human genome

Proc. Natl. Acad. Sci. USA

ARACHNE: a whole-genome shotgun assembler

Genome Res.

Whole-genome sequence assembly for Mammalian genomes: arachne 2

Genome Res.

The phusion assembler

Genome Res.

The Atlas whole-genome assembler

Whole-genome shotgun assembly and analysis of the genome of Fugu rubripes

Science

Initial sequencing and comparative analysis of the mouse genome

Nature

Rat. Genome project

The dog genome: survey sequencing and comparative analysis

Science

The draft genome of Ciona intestinalis: insights into chordate and vertebrate origins

Science

Strategies for the systematic sequencing of complex genomes

Nat. Rev. Genet.

A clone-array pooled shotgun strategy for sequencing large genomes

Genome Res.

Algorithms for optimizing production DNA sequencing

Sequencing a genome by walking with clone-end sequences: A mathematical analysis

Genome Res.

Estimating the repeat structure and length of DNA sequences using ell-tuples

Genome Res.

Predicting Progress in Shotgun Sequencing with Paired Ends

Representation of cloned genomic sequences in two sequencing vectors: correlation of DNA sequence and subclone distribution