Skip to main content
Log in

Indexing Factors with Gaps

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Indexing of factors or substrings is a widely used and useful technique in stringology and can be seen as a tool in solving diverse text algorithmic problems. A gapped-factor is a concatenation of a factor of length k, a gap of length d and another factor of length k′. Such a gapped factor is called a (kdk′)-gapped-factor. The problem of indexing the gapped-factors was considered recently by Peterlongo et al. (In: Stringology, pp. 182–196, 2006). In particular, Peterlongo et al. devised a data structure, namely a gapped factor tree (GFT) to index the gapped-factors. Given a text \(\mathcal{T}\) of length n over the alphabet Σ and the values of the parameters k, d and k′, the construction of GFT requires O(n|Σ|) time. Once GFT is constructed, a given (kdk′)-gapped-factor can be reported in O(k+k′+Occ) time, where Occ is the number of occurrences of that factor in  \(\mathcal{T}\) . In this paper, we present a new improved indexing scheme for the gapped-factors. The improvements we achieve come from two aspects. Firstly, we generalize the indexing data structure in the sense that, unlike GFT, it is independent of the parameters k and k′. Secondly, our data structure can be constructed in O(nlog 1+ε n) time and space, where 0<ε<1. The only price we pay is a slight increase, i.e. an additional log log n term, in the query time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Agarwal, P.K., Govindarajan, S., Muthukrishnan, S.: Range searching in categorical data: Colored range searching on grid. In: Möhring, R.H., Raman, R. (eds.) ESA. Lecture Notes in Computer Science, vol. 2461, pp. 17–28. Springer, New York (2002)

    Google Scholar 

  2. Allali, J., Sagot, M.-F.: The at most k-deep factor tree. Report 2004-03, Institut Gaspard Monge, Université de Marne-la-Vallée (2004)

  3. Alstrup, S., Brodal, G.S., Rauhe, T.: New data structures for orthogonal range searching. In: FOCS, pp. 198–207 (2000)

  4. Apostolico, A.: The myriad virtues of subword trees. In: Combinatorial Algorithms on Words. NATO ISI Series, pp. 85–96. Springer, Berlin (1985)

    Google Scholar 

  5. Brudno, M., Chapman, M., Göttgens, B., Batzoglou, S., Morgenstern, B.: Fast and sensitive multiple alignment of large genomic sequences. BMC Bioinf. 4, 66 (2003)

    Article  Google Scholar 

  6. Brudno, M., Do, C.B., Cooper, G.M., Kim, M.F., Davydov, E., Green, E.D., Sidow, A., Batzoglou, S.: Lagan and multi-lagan: Efficient tools for large-scale multiple alignment of genomic dna. Genome Res. 13(4), 721–731 (2003)

    Article  Google Scholar 

  7. Cole, R., Gottlieb, L.-A., Lewenstein, M.: Dictionary matching and indexing with errors and don’t cares. In: Babai, L. (ed.) STOC, pp. 91–100. ACM, Singapore (2004)

    Chapter  Google Scholar 

  8. Crochemore, M., Rytter, W.: Jewels of Stringology. World Scientific, Singapore (2002)

    Google Scholar 

  9. Crochemore, M., Iliopoulos, C.S., Mohamed, M., Sagot, M.-F.: Longest repeats with a block of don’t cares. Theor. Comput. Sci. 362(1–3), 248–254 (2006)

    Article  MATH  MathSciNet  Google Scholar 

  10. Edgar, R.C.: Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 32(5), 1792–1797 (2004)

    Article  Google Scholar 

  11. Farach, M.: Optimal suffix tree construction with large alphabets. In: FOCS, pp. 137–143 (1997)

  12. Gravano, L., Ipeirotis, P.G., Jagadish, H.V., Koudas, N., Muthukrishnan, S., Srivastava, D.: Approximate string joins in a database (almost) for free. In: Apers, P.M.G., Atzeni, P., Ceri, S., Paraboschi, S., Ramamohanarao, K., Snodgrass, R.T. (eds.) VLDB, pp. 491–500. Morgan Kaufmann, San Mateo (2001)

    Google Scholar 

  13. Gusfield, D.: Algorithms on Strings, Trees, and Sequences—Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    MATH  Google Scholar 

  14. Höhl, M., Kurtz, S., Ohlebusch, E.: Efficient multiple genome alignment. In: ISMB, pp. 312–320 (2002)

  15. Iliopoulos, C.S., McHugh, J.A.M., Peterlongo, P., Pisanti, N., Rytter, W., Sagot, M.-F.: A first approach to finding common motifs with gaps. Int. J. Found. Comput. Sci. 16(6), 1145–1154 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  16. Kärkkäinen, J., Sanders, P.: Simple linear work suffix array construction. In: Baeten, J.C.M., Lenstra, J.K., Parrow, J., Woeginger, G.J. (eds.) ICALP. Lecture Notes in Computer Science, vol. 2719, pp. 943–955. Springer, New York (2003)

    Google Scholar 

  17. Kim, D.K., Sim, J.S., Park, H., Park, K.: Constructing suffix arrays in linear time. J. Discrete Algorithms 3(2–4), 126–142 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  18. Ko, P., Aluru, S.: Space efficient linear time construction of suffix arrays. J. Discrete Algorithms 3(2–4), 143–156 (2005)

    Article  MATH  MathSciNet  Google Scholar 

  19. Li, M., Ma, B., Kisman, D., Tromp, J.: Patternhunter ii: Highly sensitive and fast homology search. Genome Inf. 14, 164–175 (2003)

    Google Scholar 

  20. Ma, B., Tromp, J., Li, M.: Patternhunter: faster and more sensitive homology search. Bioinformatics 18(3), 440–445 (2002)

    Article  Google Scholar 

  21. Maaß, M.G., Nowak, J.: Text indexing with errors. In: Apostolico, A., Crochemore, M., Park, K. (eds.) CPM. Lecture Notes in Computer Science, vol. 3537, pp. 21–32. Springer, New York (2005)

    Google Scholar 

  22. Maaß, M.G., Nowak, J.: Text indexing with errors. J. Discrete Algorithms 5(4), 662–681 (2007). doi:10.1016/j.jda.2006.11.001, selected papers from Combinatorial Pattern Matching (CPM) 2005, December 2007

    Article  MATH  MathSciNet  Google Scholar 

  23. Manber, U., Myers, E.W.: Suffix arrays: A new method for on-line string searches. SIAM J. Comput. 22(5), 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  24. McCreight, E.M.: A space-economical suffix tree construction algorithm. J. ACM 23(2), 262–272 (1976)

    MATH  MathSciNet  Google Scholar 

  25. Michael, M., Dieterich, C., Vingron, M.: Siteblast-rapid and sensitive local alignment of genomic sequences employing motif anchors. Bioinformatics 21(9), 2093–2094 (2005)

    Article  Google Scholar 

  26. Muthukrishnan, S.: Efficient algorithms for document retrieval problems. In: SODA, pp. 657–666 (2002)

  27. Navarro, G., Sutinen, E., Tanninen, J., Tarhio, J.: Indexing text with approximate q-grams. In: Giancarlo, R., Sankoff, D. (eds.) CPM. Lecture Notes in Computer Science, vol. 1848, pp. 350–363. Springer, New York (2000)

    Google Scholar 

  28. Peterlongo, P., Allali, J., Sagot, M.-F.: The gapped-factor tree. In: Holub, J., Zdárek, J. (eds.) Stringology, pp. 182–196. Czech Technical University, Prague (2006)

    Google Scholar 

  29. Rahman, M.S., Iliopoulos, C.S.: Indexing factors with gaps. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Plasil, F. (eds.) SOFSEM (1). Lecture Notes in Computer Science, vol. 4362, pp. 465–474. Springer, New York (2007)

    Google Scholar 

  30. Rahman, M.S., Iliopoulos, C.S., Lee, I., Mohamed, M., Smyth, W.F.: Finding patterns with variable length gaps or don’t cares. In: Chen, D.Z., Lee, D.T. (eds.) COCOON. Lecture Notes in Computer Science, vol. 4112, pp. 146–155. Springer, New York (2006)

    Google Scholar 

  31. Sutinen, E., Tarhio, J.: On using q-gram locations in approximate string matching. In: Spirakis, P.G. (ed.) ESA. Lecture Notes in Computer Science, vol. 979, pp. 327–340. Springer, Berlin (1995)

    Google Scholar 

  32. Ukkonen, E.: On-line construction of suffix trees. Algorithmica 14(3), 249–260 (1995)

    Article  MATH  MathSciNet  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to M. Sohel Rahman.

Additional information

Preliminary version appeared in [29].

C.S. Iliopoulos is supported by EPSRC and Royal Society grants.

M.S. Rahman is supported by the Commonwealth Scholarship Commission in the UK under the Commonwealth Scholarship and Fellowship Plan (CSFP).

M.S. Rahman is on leave from Department of CSE, BUET, Dhaka 1000, Bangladesh.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Iliopoulos, C.S., Rahman, M.S. Indexing Factors with Gaps. Algorithmica 55, 60–70 (2009). https://doi.org/10.1007/s00453-007-9141-3

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-007-9141-3

Keywords

Navigation