ABSTRACT
With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called TRELLIS, which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. TRELLIS was compared to various state-of-the-art persistent disk-based suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time.
- S. Bedathur and J. Haritsa. Engineering a fast online persistent suffix tree construction. In 20th Int'l Conference on Data Engineering, 2004. Google ScholarDigital Library
- S. Bedathur and J. Haritsa. Search-optimized suffix-tree storage for biological applications. In IEEE Int'l Conf. on High Performance Computing, 2005. Google ScholarDigital Library
- N. Bray, I. Dubchak, and L. Pachter. AVID: A global alignment program. Genome Research, 13(1):97--102, 2003.Google ScholarCross Ref
- A. Brown. Constructing genome scale suffix trees. In 2nd Asia-Pacific Bioinformatics Conference, 2004. Google ScholarDigital Library
- A. Carvalho, A. Freitas, A. Oliveira, and M. Sagot.Efficient extraction of structured motifs using box-links. In 11th Conference on String Processing and Information Retrieval, 2004. Google ScholarDigital Library
- W. Chang and E. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327--344, 1994.Google ScholarCross Ref
- C. F. Cheung, J. Yu, and H. Lu. Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering, 17(1):90--105, 2005. Google ScholarDigital Library
- R. Clifford and M. Sergot. Distributed and paged suffix trees for large genetic databases. In 14th Annual Symp. on Combinatorial Pattern Matching, 2003. Google ScholarDigital Library
- A. Crauser and P. Ferragina. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica, 32:1--35, 2002.Google ScholarDigital Library
- A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research, 30(11):2478--2483, 2002.Google ScholarCross Ref
- R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders. Better external memory suffix array construction. In Workshop on Algorithm Engineering and Experiments, 2005.Google Scholar
- M. Farach-Colton. Optimal suffix tree construction with large alphabets. In 39th Annual Symposium on Foundations of Computer Science, 1997. Google ScholarDigital Library
- M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In 39th Annual Symp. on Foundations of Computer Science, 1998. Google ScholarDigital Library
- M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM, 47(6):987--1011, 2000. Google ScholarDigital Library
- P. Ferragina and R. Grossi.The string B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236--280, 1999. Google ScholarDigital Library
- R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. Software Practice & Experience, 33(11):1035--1049, 2003.Google ScholarCross Ref
- D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. Google ScholarDigital Library
- D. Gusfield. Suffix trees (and relatives) come of age in bioinformatics. In IEEE Computer Society Bioinformatics Conference, 2002. Google ScholarDigital Library
- D. Gusfield and J. Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string Journal of Computer and System Sciences, 69(4):525--546, 2004. Google ScholarDigital Library
- K. Heumann and H. W. Mewes. The hashed position tree (HPT): A suffix tree variant for large data sets stored on slow mass storage devices. In 3rd South American Workshop on String Processing, 1996.Google Scholar
- M. Höhl, S. Kurtz, and E. Ohlebusch. Efficient multiple genome alignment. Bioinformatics, 18(supplement 1):312--320, 2002.Google Scholar
- E. Hunt, M. Atkinson, and R. Irving. A database index to large biological sequences. In 27th Int'l Conference on Very Large Data Bases, 2001. Google ScholarDigital Library
- R. Japp. The top-compressed suffix tree: A disk-resident index for large seqeuences. In Bioinformatics Workshop, 21st Annual British National Conference On Databases, 2004.Google Scholar
- E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262--272, Apr. 1976. Google ScholarDigital Library
- NCBI. Public collections of dna and rna sequence reach 100 gigabases. http://www.nlm.nih.gov/news/press_releases/dna_rna_100_gig.html, 2005.Google Scholar
- K. B. Schürmann and J. Stoye. Suffix tree construction and storage with limited main memory. Technical Report 2003-06, Universität Bielefeld, 2003.Google Scholar
- S. Tata, R. Hankins, and J. Patel. Practical suffix tree construction. In 30th Int'l Conference on Very Large Data Bases, 2004. Google ScholarDigital Library
- Y. Tian, S. Tata, R. Hankins, and J. Patel. Practical methods for constructing suffix trees. VLDB Journal, 14(3):281--299, 2005. Google ScholarDigital Library
- E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3), 1995.Google Scholar
- E. Ukkonen and J. Kärkkäinen. Sparse suffix trees. In 2nd Annual Int'l Conference on Computing and Combinatorics, 1996. Google ScholarDigital Library
- P. Weiner. Linear pattern matching algorithms. In 14th IEEE Symp. on Switching and Automata Theory, 1973.Google ScholarDigital Library
Index Terms
- Genome-scale disk-based suffix tree indexing
Recommendations
A new method for indexing genomes using on-disk suffix trees
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge managementWe propose a new method to build persistent suffix trees for indexing the genomic data. Our algorithm DiGeST (Disk-Based Genomic Suffix Tree) improves significantly over previous work in reducing the random access to the input string and performing only ...
Serial and parallel methods for i/o efficient suffix tree construction
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of dataOver the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the ...
I/O efficient algorithms for serial and parallel suffix tree construction
Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the ...
Comments