Abstract
Indexing of factors or substrings is a widely used and useful technique in stringology and can be seen as a tool in solving diverse text algorithmic problems. A gapped-factor is a concatenation of a factor of length k, a gap of length d and another factor of length k′. Such a gapped factor is called a (k−d−k′)-gapped-factor. The problem of indexing the gapped-factors was considered recently by Peterlongo et al. (In: Stringology, pp. 182–196, 2006). In particular, Peterlongo et al. devised a data structure, namely a gapped factor tree (GFT) to index the gapped-factors. Given a text \(\mathcal{T}\) of length n over the alphabet Σ and the values of the parameters k, d and k′, the construction of GFT requires O(n|Σ|) time. Once GFT is constructed, a given (k−d−k′)-gapped-factor can be reported in O(k+k′+Occ) time, where Occ is the number of occurrences of that factor in \(\mathcal{T}\) . In this paper, we present a new improved indexing scheme for the gapped-factors. The improvements we achieve come from two aspects. Firstly, we generalize the indexing data structure in the sense that, unlike GFT, it is independent of the parameters k and k′. Secondly, our data structure can be constructed in O(nlog 1+ε n) time and space, where 0<ε<1. The only price we pay is a slight increase, i.e. an additional log log n term, in the query time.
Similar content being viewed by others
References
Agarwal, P.K., Govindarajan, S., Muthukrishnan, S.: Range searching in categorical data: Colored range searching on grid. In: Möhring, R.H., Raman, R. (eds.) ESA. Lecture Notes in Computer Science, vol. 2461, pp. 17–28. Springer, New York (2002)
Allali, J., Sagot, M.-F.: The at most k-deep factor tree. Report 2004-03, Institut Gaspard Monge, Université de Marne-la-Vallée (2004)
Alstrup, S., Brodal, G.S., Rauhe, T.: New data structures for orthogonal range searching. In: FOCS, pp. 198–207 (2000)
Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Berlin (1985)
Brudno, M., Chapman, M., Göttgens, B., Batzoglou, S., Morgenstern, B.: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinf. 4, 66 (2003)
Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., Batzoglou, S.: Lagan and multi-lagan: Efficient tools for large-scale multiple alignment of genomic dna. Genome Res. 13(4), 721–731 (2003)
Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Babai, L. (ed.) STOC, pp. 91–100. ACM, Singapore (2004)
Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific, Singapore (2002)
Crochemore, M., Iliopoulos, C.S., Mohamed, M., Sagot, M.-F.: Longest repeats with a block of don’t cares. Theor. Comput. Sci. 362(1–3), 248–254 (2006)
Edgar, R.C.: Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004)
Farach, M.: Optimal suffix tree construction with large alphabets. In: FOCS, pp. 137–143 (1997)
Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Apers, P.M.G., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K., Snodgrass, R.T. (eds.) VLDB, pp. 491–500. Morgan Kaufmann, San Mateo (2001)
Gusfield, D.: Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)
Höhl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. In: ISMB, pp. 312–320 (2002)
Iliopoulos, C.S., McHugh, J.A.M., Peterlongo, P., Pisanti, N., Rytter, W., Sagot, M.-F.: A first approach to finding common motifs with gaps. Int. J. Found. Comput. Sci. 16(6), 1145–1154 (2005)
Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP. Lecture Notes in Computer Science, vol. 2719, pp. 943–955. Springer, New York (2003)
Kim, D.K., Sim, J.S., Park, H., Park, K.: Constructing suffix arrays in linear time. J. Discrete Algorithms 3(2–4), 126–142 (2005)
Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. J. Discrete Algorithms 3(2–4), 143–156 (2005)
Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter ii: Highly sensitive and fast homology search. Genome Inf. 14, 164–175 (2003)
Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)
Maaß, M.G., Nowak, J.: Text indexing with errors. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM. Lecture Notes in Computer Science, vol. 3537, pp. 21–32. Springer, New York (2005)
Maaß, M.G., Nowak, J.: Text indexing with errors. J. Discrete Algorithms 5(4), 662–681 (2007). doi:10.1016/j.jda.2006.11.001, selected papers from Combinatorial Pattern Matching (CPM) 2005, December 2007
Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)
McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)
Michael, M., Dieterich, C., Vingron, M.: Siteblast-rapid and sensitive local alignment of genomic sequences employing motif anchors. Bioinformatics 21(9), 2093–2094 (2005)
Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: SODA, pp. 657–666 (2002)
Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM. Lecture Notes in Computer Science, vol. 1848, pp. 350–363. Springer, New York (2000)
Peterlongo, P., Allali, J., Sagot, M.-F.: The gapped-factor tree. In: Holub, J., Zdárek, J. (eds.) Stringology, pp. 182–196. Czech Technical University, Prague (2006)
Rahman, M.S., Iliopoulos, C.S.: Indexing factors with gaps. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plasil, F. (eds.) SOFSEM (1). Lecture Notes in Computer Science, vol. 4362, pp. 465–474. Springer, New York (2007)
Rahman, M.S., Iliopoulos, C.S., Lee, I., Mohamed, M., Smyth, W.F.: Finding patterns with variable length gaps or don’t cares. In: Chen, D.Z., Lee, D.T. (eds.) COCOON. Lecture Notes in Computer Science, vol. 4112, pp. 146–155. Springer, New York (2006)
Sutinen, E., Tarhio, J.: On using q-gram locations in approximate string matching. In: Spirakis, P.G. (ed.) ESA. Lecture Notes in Computer Science, vol. 979, pp. 327–340. Springer, Berlin (1995)
Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)
Author information
Authors and Affiliations
Corresponding author
Additional information
Preliminary version appeared in [29].
C.S. Iliopoulos is supported by EPSRC and Royal Society grants.
M.S. Rahman is supported by the Commonwealth Scholarship Commission in the UK under the Commonwealth Scholarship and Fellowship Plan (CSFP).
M.S. Rahman is on leave from Department of CSE, BUET, Dhaka 1000, Bangladesh.
Rights and permissions
About this article
Cite this article
Iliopoulos, C.S., Rahman, M.S. Indexing Factors with Gaps. Algorithmica 55, 60–70 (2009). https://doi.org/10.1007/s00453-007-9141-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00453-007-9141-3