skip to main content
10.1145/1247480.1247572acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
Article

Genome-scale disk-based suffix tree indexing

Published:11 June 2007Publication History

ABSTRACT

With the exponential growth of biological sequence databases, it has become critical to develop effective techniques for storing, querying, and analyzing these massive data. Suffix trees are widely used to solve many sequence-based problems, and they can be built in linear time and space, provided the resulting tree fits in main-memory. To index larger sequences, several external suffix tree algorithms have been proposed in recent years. However, they suffer from several problems such as susceptibility to data skew, non-scalability to genome-scale sequences, and non-existence of suffix links, which are crucial in various suffix tree based algorithms. In this paper, we target DNA sequences and propose a novel disk-based suffix tree algorithm called TRELLIS, which effectively scales up to genome-scale sequences. Specifically, it can index the entire human genome using 2GB of memory, in about 4 hours and can recover all its suffix links within 2 hours. TRELLIS was compared to various state-of-the-art persistent disk-based suffix tree construction algorithms, and was shown to outperform the best previous methods, both in terms of indexing time and querying time.

References

  1. S. Bedathur and J. Haritsa. Engineering a fast online persistent suffix tree construction. In 20th Int'l Conference on Data Engineering, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. S. Bedathur and J. Haritsa. Search-optimized suffix-tree storage for biological applications. In IEEE Int'l Conf. on High Performance Computing, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. N. Bray, I. Dubchak, and L. Pachter. AVID: A global alignment program. Genome Research, 13(1):97--102, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  4. A. Brown. Constructing genome scale suffix trees. In 2nd Asia-Pacific Bioinformatics Conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. A. Carvalho, A. Freitas, A. Oliveira, and M. Sagot.Efficient extraction of structured motifs using box-links. In 11th Conference on String Processing and Information Retrieval, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. W. Chang and E. Lawler. Sublinear approximate string matching and biological applications. Algorithmica, 12(4/5):327--344, 1994.Google ScholarGoogle ScholarCross RefCross Ref
  7. C. F. Cheung, J. Yu, and H. Lu. Constructing suffix tree for gigabyte sequences with megabyte memory. IEEE Transactions on Knowledge and Data Engineering, 17(1):90--105, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Clifford and M. Sergot. Distributed and paged suffix trees for large genetic databases. In 14th Annual Symp. on Combinatorial Pattern Matching, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. A. Crauser and P. Ferragina. A theoretical and experimental study on the construction of suffix arrays in external memory. Algorithmica, 32:1--35, 2002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Delcher, A. Phillippy, J. Carlton, and S. Salzberg. Fast algorithms for large-scale genome alignment and comparison. Nucleic Acids Research, 30(11):2478--2483, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  11. R. Dementiev, J. Kärkkäinen, J. Mehnert, and P. Sanders. Better external memory suffix array construction. In Workshop on Algorithm Engineering and Experiments, 2005.Google ScholarGoogle Scholar
  12. M. Farach-Colton. Optimal suffix tree construction with large alphabets. In 39th Annual Symposium on Foundations of Computer Science, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. Overcoming the memory bottleneck in suffix tree construction. In 39th Annual Symp. on Foundations of Computer Science, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. M. Farach-Colton, P. Ferragina, and S. Muthukrishnan. On the sorting-complexity of suffix tree construction. Journal of the ACM, 47(6):987--1011, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. P. Ferragina and R. Grossi.The string B-tree: a new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236--280, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. R. Giegerich, S. Kurtz, and J. Stoye. Efficient implementation of lazy suffix trees. Software Practice & Experience, 33(11):1035--1049, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  17. D. Gusfield. Algorithms on Strings, Trees, and Sequences. Cambridge University Press, 1997. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. D. Gusfield. Suffix trees (and relatives) come of age in bioinformatics. In IEEE Computer Society Bioinformatics Conference, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. D. Gusfield and J. Stoye. Linear time algorithms for finding and representing all the tandem repeats in a string Journal of Computer and System Sciences, 69(4):525--546, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. K. Heumann and H. W. Mewes. The hashed position tree (HPT): A suffix tree variant for large data sets stored on slow mass storage devices. In 3rd South American Workshop on String Processing, 1996.Google ScholarGoogle Scholar
  21. M. Höhl, S. Kurtz, and E. Ohlebusch. Efficient multiple genome alignment. Bioinformatics, 18(supplement 1):312--320, 2002.Google ScholarGoogle Scholar
  22. E. Hunt, M. Atkinson, and R. Irving. A database index to large biological sequences. In 27th Int'l Conference on Very Large Data Bases, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. R. Japp. The top-compressed suffix tree: A disk-resident index for large seqeuences. In Bioinformatics Workshop, 21st Annual British National Conference On Databases, 2004.Google ScholarGoogle Scholar
  24. E. McCreight. A space-economical suffix tree construction algorithm. Journal of the ACM, 23(2):262--272, Apr. 1976. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. NCBI. Public collections of dna and rna sequence reach 100 gigabases. http://www.nlm.nih.gov/news/press_releases/dna_rna_100_gig.html, 2005.Google ScholarGoogle Scholar
  26. K. B. Schürmann and J. Stoye. Suffix tree construction and storage with limited main memory. Technical Report 2003-06, Universität Bielefeld, 2003.Google ScholarGoogle Scholar
  27. S. Tata, R. Hankins, and J. Patel. Practical suffix tree construction. In 30th Int'l Conference on Very Large Data Bases, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Y. Tian, S. Tata, R. Hankins, and J. Patel. Practical methods for constructing suffix trees. VLDB Journal, 14(3):281--299, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. E. Ukkonen. On-line construction of suffix trees. Algorithmica, 14(3), 1995.Google ScholarGoogle Scholar
  30. E. Ukkonen and J. Kärkkäinen. Sparse suffix trees. In 2nd Annual Int'l Conference on Computing and Combinatorics, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. P. Weiner. Linear pattern matching algorithms. In 14th IEEE Symp. on Switching and Automata Theory, 1973.Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Genome-scale disk-based suffix tree indexing

                Recommendations

                Comments

                Login options

                Check if you have access through your login credentials or your institution to get full access on this article.

                Sign in
                • Published in

                  cover image ACM Conferences
                  SIGMOD '07: Proceedings of the 2007 ACM SIGMOD international conference on Management of data
                  June 2007
                  1210 pages
                  ISBN:9781595936868
                  DOI:10.1145/1247480
                  • General Chairs:
                  • Lizhu Zhou,
                  • Tok Wang Ling,
                  • Program Chair:
                  • Beng Chin Ooi

                  Copyright © 2007 ACM

                  Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                  Publisher

                  Association for Computing Machinery

                  New York, NY, United States

                  Publication History

                  • Published: 11 June 2007

                  Permissions

                  Request permissions about this article.

                  Request Permissions

                  Check for updates

                  Qualifiers

                  • Article

                  Acceptance Rates

                  Overall Acceptance Rate785of4,003submissions,20%

                PDF Format

                View or Download as a PDF file.

                PDF

                eReader

                View online with eReader.

                eReader