Article

Genome-scale disk-based suffix tree indexing

Authors:
Benjarath Phoophakdee

Rensselaer Polytechnic Institute, Troy, NY

Rensselaer Polytechnic Institute, Troy, NY
View Profile

,
Mohammed J. Zaki

Rensselaer Polytechnic Institute, Troy, NY

Rensselaer Polytechnic Institute, Troy, NY
View Profile

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of dataJune 2007Pages 833–844https://doi.org/10.1145/1247480.1247572

Published:11 June 2007Publication History

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

Pages 833–844

ABSTRACT

With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called TRELLIS, which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. TRELLIS was compared to various state-of-the-art persistent disk-based suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time.

References

S. Bedathur and J. Haritsa. Engineering a fast online persistent suffix tree construction. In 20th Int'l Conference on Data Engineering, 2004. Google ScholarDigital Library
S. Bedathur and J. Haritsa. Search-optimized suffix-tree storage for biological applications. In IEEE Int'l Conf. on High Performance Computing, 2005. Google ScholarDigital Library
N. Bray, I. Dubchak, and L. Pachter. AVID: A global alignment program. Genome Research, 13(1):97--102, 2003.Google ScholarCross Ref
A. Brown. Constructing genome scale suffix trees. In 2nd Asia-Pacific Bioinformatics Conference, 2004. Google ScholarDigital Library
A. Carvalho, A. Freitas, A. Oliveira, and M. Sagot.Efficient extraction of structured motifs using box-links. In 11th Conference on String Processing and Information Retrieval, 2004. Google ScholarDigital Library
W. Chang and E. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327--344, 1994.Google ScholarCross Ref
C. F. Cheung, J. Yu, and H. Lu. Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering, 17(1):90--105, 2005. Google ScholarDigital Library
R. Clifford and M. Sergot. Distributed and paged suffix trees for large genetic databases. In 14th Annual Symp. on Combinatorial Pattern Matching, 2003. Google ScholarDigital Library
A. Crauser and P. Ferragina. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica, 32:1--35, 2002.Google ScholarDigital Library
A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research, 30(11):2478--2483, 2002.Google ScholarCross Ref
R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders. Better external memory suffix array construction. In Workshop on Algorithm Engineering and Experiments, 2005.Google Scholar
M. Farach-Colton. Optimal suffix tree construction with large alphabets. In 39th Annual Symposium on Foundations of Computer Science, 1997. Google ScholarDigital Library
M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In 39th Annual Symp. on Foundations of Computer Science, 1998. Google ScholarDigital Library
M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM, 47(6):987--1011, 2000. Google ScholarDigital Library
P. Ferragina and R. Grossi.The string B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236--280, 1999. Google ScholarDigital Library
R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. Software Practice & Experience, 33(11):1035--1049, 2003.Google ScholarCross Ref
D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. Google ScholarDigital Library
D. Gusfield. Suffix trees (and relatives) come of age in bioinformatics. In IEEE Computer Society Bioinformatics Conference, 2002. Google ScholarDigital Library
D. Gusfield and J. Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string Journal of Computer and System Sciences, 69(4):525--546, 2004. Google ScholarDigital Library
K. Heumann and H. W. Mewes. The hashed position tree (HPT): A suffix tree variant for large data sets stored on slow mass storage devices. In 3rd South American Workshop on String Processing, 1996.Google Scholar
M. Höhl, S. Kurtz, and E. Ohlebusch. Efficient multiple genome alignment. Bioinformatics, 18(supplement 1):312--320, 2002.Google Scholar
E. Hunt, M. Atkinson, and R. Irving. A database index to large biological sequences. In 27th Int'l Conference on Very Large Data Bases, 2001. Google ScholarDigital Library
R. Japp. The top-compressed suffix tree: A disk-resident index for large seqeuences. In Bioinformatics Workshop, 21st Annual British National Conference On Databases, 2004.Google Scholar
E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262--272, Apr. 1976. Google ScholarDigital Library
NCBI. Public collections of dna and rna sequence reach 100 gigabases. http://www.nlm.nih.gov/news/press_releases/dna_rna_100_gig.html, 2005.Google Scholar
K. B. Schürmann and J. Stoye. Suffix tree construction and storage with limited main memory. Technical Report 2003-06, Universität Bielefeld, 2003.Google Scholar
S. Tata, R. Hankins, and J. Patel. Practical suffix tree construction. In 30th Int'l Conference on Very Large Data Bases, 2004. Google ScholarDigital Library
Y. Tian, S. Tata, R. Hankins, and J. Patel. Practical methods for constructing suffix trees. VLDB Journal, 14(3):281--299, 2005. Google ScholarDigital Library
E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3), 1995.Google Scholar
E. Ukkonen and J. Kärkkäinen. Sparse suffix trees. In 2nd Annual Int'l Conference on Computing and Combinatorics, 1996. Google ScholarDigital Library
P. Weiner. Linear pattern matching algorithms. In 14th IEEE Symp. on Switching and Automata Theory, 1973.Google ScholarDigital Library

Index Terms

Genome-scale disk-based suffix tree indexing

Recommendations

A new method for indexing genomes using on-disk suffix trees
CIKM '08: Proceedings of the 17th ACM conference on Information and knowledge management

We propose a new method to build persistent suffix trees for indexing the genomic data. Our algorithm DiGeST (Disk-Based Genomic Suffix Tree) improves significantly over previous work in reducing the random access to the input string and performing only ...
Read More
Serial and parallel methods for i/o efficient suffix tree construction
SIGMOD '09: Proceedings of the 2009 ACM SIGMOD International Conference on Management of data

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the ...
Read More
I/O efficient algorithms for serial and parallel suffix tree construction

Over the past three decades, the suffix tree has served as a fundamental data structure in string processing. However, its widespread applicability has been hindered due to the fact that suffix tree construction does not scale well with the size of the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
June 2007
1210 pages
ISBN:9781595936868
DOI:10.1145/1247480
General Chairs:
Lizhu Zhou
Tsinghua University, China
,
Tok Wang Ling
National University of Singapore, Singapore
,
Program Chair:
Beng Chin Ooi
National University of Singapore, Singapore
Copyright © 2007 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 11 June 2007
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
disk-based
external memory
genome-scale
sequence indexing
suffix tree
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate785of4,003submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 49
  Total Citations
  View Citations
- 58
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Genome-scale disk-based suffix tree indexing

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

A new method for indexing genomes using on-disk suffix trees

Serial and parallel methods for i/o efficient suffix tree construction

I/O efficient algorithms for serial and parallel suffix tree construction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Genome-scale disk-based suffix tree indexing

SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data

ABSTRACT

References

Cited By

Index Terms

Recommendations

A new method for indexing genomes using on-disk suffix trees

Serial and parallel methods for i/o efficient suffix tree construction

I/O efficient algorithms for serial and parallel suffix tree construction

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Acceptance Rates

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media