Indexing Similar DNA Sequences

Huang, Songbo; Lam, T. W.; Sung, W. K.; Tam, S. L.; Yiu, S. M.

doi:10.1007/978-3-642-14355-7_19

Songbo Huang¹⁷,
T. W. Lam¹⁷,
W. K. Sung¹⁸,
S. L. Tam¹⁷ &
…
S. M. Yiu¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 6124))

Included in the following conference series:

International Conference on Algorithmic Applications in Management

841 Accesses
11 Citations

Abstract

To study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + NlogN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m + m logN + m log(rk)psc(P) + occlogn), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Algorithms for Indexing Highly Similar DNA Sequences

NET-ASAR: A Tool for DNA Sequence Search Based on Data Compression

$\textsc {McDag}$: indexing maximal common subsequences for k strings

Article Open access 19 April 2025

References

Briniza, D., He, J., Zelikovsky, A.: Combinatorial search methods for multi-SNP disease association. In: EMBS, pp. 5802–5805 (2006)
Google Scholar
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, California (1994)
Google Scholar
Emahazion, T., Feuk, L., Jobs, M., Sawyer, S.L., Fredman, D., Clair, D.S., Prince, J.A., Brookes, A.J.: SNP association studies in Alzheimer’s disease highlight problems for complex disease analysis. Trends in Genetics 17(7), 407–413 (2001)
Article Google Scholar
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, pp. 390–398 (2000)
Google Scholar
Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: SODA, pp. 269–278 (2001)
Google Scholar
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: STOC, pp. 397–406 (2000)
Google Scholar
Gusfield, D.: Algorithms on strings, trees, and sequences. Cambridge University Press, Cambridge (1997)
MATH Google Scholar
Kao, M.-Y. (ed.): Encyclopedia of Algorithms. Springer, Heidelberg (2008)
MATH Google Scholar
Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of DNA. Bioinformatics 24(6), 791–797 (2008)
Article Google Scholar
Lippert, R.A.: Space-efficient whole genome comparisons with Burrows-Wheeler transforms. Journal of Computational Biology 12(4), 407–415 (2005)
Article Google Scholar
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)
MathSciNet Google Scholar
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)
Chapter Google Scholar
Nekrich, Y.: Orthogonal range searching in linear and almost-linear space. Computational Geometry: Theory and Applications 42(4), 342–351 (2009)
MATH MathSciNet Google Scholar
Szpankowski, W.: Probabilistic analysis of generalized suffix trees. In: CPM, pp. 1–14 (1992)
Google Scholar

Download references

Author information

Authors and Affiliations

Department of Computer Science, The University of Hong Kong, Hong Kong
Songbo Huang, T. W. Lam, S. L. Tam & S. M. Yiu
Department of Computer Science, National University of Singapore, Singapore
W. K. Sung

Authors

Songbo Huang
View author publications
You can also search for this author in PubMed Google Scholar
T. W. Lam
View author publications
You can also search for this author in PubMed Google Scholar
W. K. Sung
View author publications
You can also search for this author in PubMed Google Scholar
S. L. Tam
View author publications
You can also search for this author in PubMed Google Scholar
S. M. Yiu
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Warwick Business School/ DIMAP - Centre for Discrete Mathematics and its Applications Coventry, University of Warwick, CV4 7AL, UK
Bo Chen

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M. (2010). Indexing Similar DNA Sequences. In: Chen, B. (eds) Algorithmic Aspects in Information and Management. AAIM 2010. Lecture Notes in Computer Science, vol 6124. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14355-7_19

Download citation

DOI: https://doi.org/10.1007/978-3-642-14355-7_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14354-0
Online ISBN: 978-3-642-14355-7
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Indexing Similar DNA Sequences

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Algorithms for Indexing Highly Similar DNA Sequences

NET-ASAR: A Tool for DNA Sequence Search Based on Data Compression

\(\textsc {McDag}\): indexing maximal common subsequences for k strings

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Indexing Similar DNA Sequences

Abstract

Access this chapter

Subscribe and save

Buy Now

Preview

Similar content being viewed by others

Algorithms for Indexing Highly Similar DNA Sequences

NET-ASAR: A Tool for DNA Sequence Search Based on Data Compression

\(\textsc {McDag}\): indexing maximal common subsequences for k strings

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us