Skip to main content
Log in

Compressed Directed Acyclic Word Graph with Application in Local Alignment

  • Published:
Algorithmica Aims and scope Submit manuscript

Abstract

Suffix tree, suffix array, and directed acyclic word graph (DAWG) are data-structures for indexing a text. Although they enable efficient pattern matching, their data-structures require O(nlogn) bits, which make them impractical to index long text like human genome. Recently, the development of compressed data-structures allow us to simulate suffix tree and suffix array using O(n) bits. However, there is still no O(n)-bit data-structure for DAWG with full functionality. This work introduces an \(n(H_{k}(\overline{S})+ 2 H_{0}^{*}(\mathcal {T}_{\overline{S}}))+o(n)\)-bit compressed data-structure for simulating DAWG (where \(H_{k}(\overline{S})\) and \(H_{0}^{*}(\mathcal{T}_{\overline{S}})\) are the empirical entropies of the reversed sequence and the reversed suffix tree topology, respectively.) Besides, we also propose an application of DAWG to improve the time complexity for the local alignment problem. In this application, the previously proposed solutions using BWT (a version of compressed suffix array) run in O(n 2 m) worst case time and O(n 0.628 m) average case time where n and m are the lengths of the database and the query, respectively. Using compressed DAWG proposed in this paper, the problem can be solved in O(nm) worst case time and the same average case time.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Listing 1
Listing 2
Listing 3
Listing 4
Listing 5

Similar content being viewed by others

References

  1. Altschul, S., Gish, W., Miller, W., Myers, E., Lipman, D.: Basic local alignment search tool. J. Mol. Biol. 215(3), 403–410 (1990)

    Google Scholar 

  2. Appel, A., Jacobson, G.: The world’s fastest scrabble program. Commun. ACM 31(5), 572–578 (1988)

    Article  Google Scholar 

  3. Baeza-Yates, R., Gonnet, G.: A fast algorithm on average for all-against-all sequence matching. In: Proceedings of the String Processing and Information Retrieval Symposium, pp. 16–23 (1999)

    Google Scholar 

  4. Blumer, A., Blumer, J., Haussler, D., Ehrenfeucht, A., Chen, M., Seiferas, J.: The smallest automaton recognizing the subwords of a text. Theor. Comput. Sci. 40, 31–55 (1985)

    Article  MathSciNet  MATH  Google Scholar 

  5. Chim, H., Deng, X.: A new suffix tree similarity measure for document clustering. In: Proceedings of the 16th Conference on World Wide Web, pp. 121–130 (2007)

    Chapter  Google Scholar 

  6. Crochemore, M., Vérin, R.: On compact directed acyclic word graphs. In: Structures in Logic and Computer Science, vol. 1261, pp. 192–211 (1997)

    Chapter  Google Scholar 

  7. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  Google Scholar 

  8. Golynski, A., Munro, J.I., Rao, S.S.: Rank/select operations on large alphabets: a tool for text indexing. In: Proceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithm, pp. 368–373 (2006)

    Chapter  Google Scholar 

  9. Grossi, R., Gupta, A., Vitter, J.: High-order entropy-compressed text indexes. In: Proceedings of the Fourteenth Annual ACM-SIAM Symposium on Discrete Algorithms, pp. 841–850 (2003)

    Google Scholar 

  10. Gusfield, D.: Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, Cambridge (1997)

    Book  MATH  Google Scholar 

  11. Huang, J., Powers, D.: Suffix tree based approach for Chinese information retrieval. In: Eighth International Conference on Intelligent Systems Design and Applications, pp. 393–397 (2008)

    Chapter  Google Scholar 

  12. Inenaga, S., Takeda, M.: Sparse compact directed acyclic word graphs. In: Proceedings of Prague Stringology Conference, pp. 197–211 (2006)

    Google Scholar 

  13. Jansson, J., Sadakane, K., Sung, W.: Ultra-succinct representation of ordered trees with applications. J. Comput. Syst. Sci. 78(2), 619–631 (2012)

    Article  MathSciNet  MATH  Google Scholar 

  14. Kurtz, S., Choudhuri, J.V., Ohlebusch, E., Schleiermacher, C., Stoye, J., Giegerich, R.: Reputer: the manifold applications of repeat analysis on a genomic scale. Nucleic Acids Res. 29, 4633–4642 (2001)

    Article  Google Scholar 

  15. Lam, T.W., Sung, W.K., Tam, S.L., Wong, C.K., Yiu, S.M.: Compressed indexing and local alignment of DNA. Bioinformatics 24(6), 791–797 (2008)

    Article  Google Scholar 

  16. Langmead, B., Trapnell, C., Pop, M., Salzberg, S.: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biol. 10, R25 (2009)

    Article  Google Scholar 

  17. Larsson, N.: Extended application of suffix trees to data compression. In: Proceedings of the IEEE Data Compression Conference, pp. 190–199 (1996)

    Google Scholar 

  18. Li, H., Durbin, R.: Fast and accurate long-read alignment with burrows-wheeler transform. Bioinformatics 26(5), 589–595 (2010)

    Article  Google Scholar 

  19. Maaß, M.: Average-case analysis of approximate trie search. Algorithmica 46(3), 469–491 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  20. Mäkinen, V., Navarro, G.: Implicit compression boosting with applications to self-indexing. In: Proceedings of the String Processing and Information Retrieval Symposium, pp. 229–241 (2007)

    Chapter  Google Scholar 

  21. Manber, U., Myers, G.: Suffix arrays: a new method for on-line string searches. SIAM J. Comput. 22, 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  22. Meek, C., Patel, J., Kasetty, S.: Oasis: an online and accurate technique for local-alignment searches on biological sequences. In: Proceedings of the 29th International Conference on Very Large Data Bases, pp. 910–921 (2003)

    Google Scholar 

  23. Navarro, G.: A guided tour to approximate string matching. ACM Comput. Surv. 33, 31–88 (2001)

    Article  Google Scholar 

  24. Navarro, G., Baeza-Yates, R.: A hybrid indexing method for approximate string matching. J. Discrete Algorithms 1, 205–239 (2000)

    MathSciNet  Google Scholar 

  25. Sadakane, K.: Compressed suffix trees with full functionality. Theory Comput. Syst. 41, 589–607 (2007)

    Article  MathSciNet  MATH  Google Scholar 

  26. Senft, M.: Suffix tree based data compression. In: Proceedings of the 31st Conference on Current Trends in Theory and Practice of Computer, pp. 350–359 (2005)

    Google Scholar 

  27. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. J. Mol. Biol. 147, 195–197 (1981)

    Article  Google Scholar 

  28. Sung, W.-K.: Indexed approximate string matching. In: Encyclopedia of Algorithms, pp. 408–410 (2008)

    Chapter  Google Scholar 

  29. Weiner, P.: Linear pattern matching algorithms. In: IEEE 14th Annual Symposium on Switching and Automata Theory, pp. 1–11 (1973)

    Chapter  Google Scholar 

  30. Wong, S., Sung, W., Wong, L.: CPS-tree: a compact partitioned suffix tree for disk-based indexing on large genome sequences. In: IEEE 23rd International Conference on Data Engineering, pp. 1350–1354 (2007)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huy Hoang Do.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Do, H.H., Sung, W.K. Compressed Directed Acyclic Word Graph with Application in Local Alignment. Algorithmica 67, 125–141 (2013). https://doi.org/10.1007/s00453-013-9794-z

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s00453-013-9794-z

Keywords

Navigation