Skip to main content

Indexing Similar DNA Sequences

  • Conference paper

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6124))

Abstract

To study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + NlogN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m + m logN + m log(rk)psc(P) + occlogn), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time.

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Briniza, D., He, J., Zelikovsky, A.: Combinatorial search methods for multi-SNP disease association. In: EMBS, pp. 5802–5805 (2006)

    Google Scholar 

  2. Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, California (1994)

    Google Scholar 

  3. Emahazion, T., Feuk, L., Jobs, M., Sawyer, S.L., Fredman, D., Clair, D.S., Prince, J.A., Brookes, A.J.: SNP association studies in Alzheimer’s disease highlight problems for complex disease analysis. Trends in Genetics 17(7), 407–413 (2001)

    Article  Google Scholar 

  4. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, pp. 390–398 (2000)

    Google Scholar 

  5. Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: SODA, pp. 269–278 (2001)

    Google Scholar 

  6. Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: STOC, pp. 397–406 (2000)

    Google Scholar 

  7. Gusfield, D.: Algorithms on strings, trees, and sequences. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  8. Kao, M.-Y. (ed.): Encyclopedia of Algorithms. Springer, Heidelberg (2008)

    MATH  Google Scholar 

  9. Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of DNA. Bioinformatics 24(6), 791–797 (2008)

    Article  Google Scholar 

  10. Lippert, R.A.: Space-efficient whole genome comparisons with Burrows-Wheeler transforms. Journal of Computational Biology 12(4), 407–415 (2005)

    Article  Google Scholar 

  11. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)

    MathSciNet  Google Scholar 

  12. Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  13. Nekrich, Y.: Orthogonal range searching in linear and almost-linear space. Computational Geometry: Theory and Applications 42(4), 342–351 (2009)

    MATH  MathSciNet  Google Scholar 

  14. Szpankowski, W.: Probabilistic analysis of generalized suffix trees. In: CPM, pp. 1–14 (1992)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2010 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M. (2010). Indexing Similar DNA Sequences. In: Chen, B. (eds) Algorithmic Aspects in Information and Management. AAIM 2010. Lecture Notes in Computer Science, vol 6124. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14355-7_19

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-14355-7_19

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-14354-0

  • Online ISBN: 978-3-642-14355-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics