Abstract
To study the genetic variations of a species, one basic operation is to search for occurrences of patterns in a large number of very similar genomic sequences. To build an indexing data structure on the concatenation of all sequences may require a lot of memory. In this paper, we propose a new scheme to index highly similar sequences by taking advantage of the similarity among the sequences. To store r sequences with k common segments, our index requires only O(n + NlogN) bits of memory, where n is the total length of the common segments and N is the total length of the distinct regions in all texts. The total length of all sequences is rn + N, and any scheme to store these sequences requires Ω(n + N) bits. Searching for a pattern P of length m takes O(m + m logN + m log(rk)psc(P) + occlogn), where psc(P) is the number of prefixes of P that appear as a suffix of some common segments and occ is the number of occurrences of P in all sequences. In practice, rk ≤ N, and psc(P) is usually a small constant. We have implemented our solution and evaluated our solution using real DNA sequences. The experiments show that the memory requirement of our solution is much less than that required by BWT built on the concatenation of all sequences. When compared to the other existing solution (RLCSA), we use less memory with faster searching time.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Briniza, D., He, J., Zelikovsky, A.: Combinatorial search methods for multi-SNP disease association. In: EMBS, pp. 5802–5805 (2006)
Burrows, M., Wheeler, D.J.: A block-sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, California (1994)
Emahazion, T., Feuk, L., Jobs, M., Sawyer, S.L., Fredman, D., Clair, D.S., Prince, J.A., Brookes, A.J.: SNP association studies in Alzheimer’s disease highlight problems for complex disease analysis. Trends in Genetics 17(7), 407–413 (2001)
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: FOCS, pp. 390–398 (2000)
Ferragina, P., Manzini, G.: An experimental study of an opportunistic index. In: SODA, pp. 269–278 (2001)
Grossi, R., Vitter, J.S.: Compressed suffix arrays and suffix trees with applications to text indexing and string matching. In: STOC, pp. 397–406 (2000)
Gusfield, D.: Algorithms on strings, trees, and sequences. Cambridge University Press, Cambridge (1997)
Kao, M.-Y. (ed.): Encyclopedia of Algorithms. Springer, Heidelberg (2008)
Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of DNA. Bioinformatics 24(6), 791–797 (2008)
Lippert, R.A.: Space-efficient whole genome comparisons with Burrows-Wheeler transforms. Journal of Computational Biology 12(4), 407–415 (2005)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(1), 40–66 (2005)
Mäkinen, V., Navarro, G., Sirén, J., Välimäki, N.: Storage and retrieval of individual genomes. In: Batzoglou, S. (ed.) RECOMB 2009. LNCS, vol. 5541, pp. 121–137. Springer, Heidelberg (2009)
Nekrich, Y.: Orthogonal range searching in linear and almost-linear space. Computational Geometry: Theory and Applications 42(4), 342–351 (2009)
Szpankowski, W.: Probabilistic analysis of generalized suffix trees. In: CPM, pp. 1–14 (1992)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2010 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Huang, S., Lam, T.W., Sung, W.K., Tam, S.L., Yiu, S.M. (2010). Indexing Similar DNA Sequences. In: Chen, B. (eds) Algorithmic Aspects in Information and Management. AAIM 2010. Lecture Notes in Computer Science, vol 6124. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-14355-7_19
Download citation
DOI: https://doi.org/10.1007/978-3-642-14355-7_19
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-14354-0
Online ISBN: 978-3-642-14355-7
eBook Packages: Computer ScienceComputer Science (R0)