Skip to main content

Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory

  • Conference paper
Book cover String Processing and Information Retrieval (SPIRE 2006)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4209))

Included in the following conference series:

Abstract

Recent advances in the asymptotic resource costs of pattern matching with compressed suffix arrays are attractive, but a key rival structure, the compressed inverted file, has been dismissed or ignored in papers presenting the new structures. In this paper we examine the resource requirements of compressed suffix array algorithms against compressed inverted file data structures for general pattern matching in genomic and English texts. In both cases, the inverted file indexes q-grams, thus allowing full pattern matching capabilities, rather than simple word based search, making their functionality equivalent to the compressed suffix array structures. When using equivalent memory for the two structures, inverted files are faster at reporting the location of patterns when the number of occurrences of the patterns is high.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Abouelhoda, M.I., Ohlebusch, E., Kurtz, S.: Optimal exact string matching based on suffix arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)

    Chapter  Google Scholar 

  2. Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)

    Google Scholar 

  3. Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Information Retrieval 8, 151–166 (2005)

    Article  Google Scholar 

  4. Benson, D., Lipman, D.J., Ostell, J.: GenBank. Nucleic Acids Research 21(13), 2963–2965 (1993)

    Article  Google Scholar 

  5. Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank. Nucleic Acids Research 33, D34–D38 (2005)

    Article  Google Scholar 

  6. Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California (1994)

    Google Scholar 

  7. Cameron, M., Williams, H.E., Cannane, A.: Improved gapped alignment in blast. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(3), 116–129 (2004)

    Article  Google Scholar 

  8. Choi, Y., Park, K.: Time and space efficient search with suffix arrays. In: Hong, S. (ed.) Proceedings of AWOCA 2004, Ballina, Australia, pp. 230–238 (2004)

    Google Scholar 

  9. De Moura, E.S., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)

    Article  Google Scholar 

  10. Ensembl. Ensembl Genome Browser (2006), http://www.ensembl.org

  11. Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st IEEE Symposium on Found. of Comp. Sci., Redondo Beach, CA, pp. 390–398. IEEE Computer Society, Los Alamitos (2000)

    Google Scholar 

  12. Ferragina, P., Navarro, G.: Pizza & Chili Corpus – Compressed Indexes and their Testbeds (2005), http://pizzachili.dcc.uchile.cl

  13. Grossi, R., Vitter, J.S., Gupta, A.: When indexing equals compression: Experiments with compressing suffix arrays and applications. In: Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms, pp. 636–645 (2004)

    Google Scholar 

  14. Harman, D.K.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)

    Article  Google Scholar 

  15. Kärkkäinen, J.: Ziv-Lempel index for q-grams. Algorithmica 21(1), 137–154 (1998)

    Article  MATH  MathSciNet  Google Scholar 

  16. Kurtz, S.: Reducing the space requirement of suffix trees. Software, Practice and Experience 29(13), 1149–1171 (1999)

    Article  Google Scholar 

  17. Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(2), 40–66 (2005)

    MathSciNet  Google Scholar 

  18. Mäkinen, V., Navarro, G.: Compressed full text indexes. Technical Report TR/DCC-2005-7, Department of Computer Science, University of Chile (June 2006)

    Google Scholar 

  19. Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  20. Manber, U., Myers, G.W.: Suffix arrays: a new model for on-line string searches. SIAM Journal of Computing 22(5), 935–948 (1993)

    Article  MATH  MathSciNet  Google Scholar 

  21. Manber, U., Wu, S.: Glimpse: A tool to search through entire file systems. In: Proceedings of the USENIX Technical Conference, Berkeley, CA, pp. 23–32. USENIX Association (1994)

    Google Scholar 

  22. Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)

    Article  MathSciNet  Google Scholar 

  23. McCreight, E.M.: A space-economical suffix tree construction algroithm. Journal of the ACM 23(2), 262–272 (1976)

    Article  MATH  MathSciNet  Google Scholar 

  24. Navarro, G., De Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3, 49–77 (2000)

    Article  Google Scholar 

  25. NCBI. NCBI Blast (2006), http://www.ncbi.nlm.nih.gov/BLAST/

  26. Simon, J., Puglisi, W., Smyth, F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. In: Proceedings of the Prague Stringology Conference, Prague, pp. 1–30. Czech Technical University (August 2005)

    Google Scholar 

  27. Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Lee, D.T., Teng, S.-H. (eds.) ISAAC 2000. LNCS, vol. 1969, pp. 410–421. Springer, Heidelberg (2000)

    Chapter  Google Scholar 

  28. Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, pp. 225–232 (2002)

    Google Scholar 

  29. Sim, J.S., Kim, D.K., Park, H., Park, K.: Linear-time search in suffix arrays. In: Miller, M., Park, K. (eds.) Proceedings of AWOCA 2003, Seoul, Korea, pp. 139–146 (2003)

    Google Scholar 

  30. Smyth, W.F.: Computing Patterns in Strings. Addison-Wesley, Essex, England (2003)

    Google Scholar 

  31. Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th annual Symposium on Foundations of Computer Science, pp. 1–11 (1973)

    Google Scholar 

  32. Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering 14(1), 63–78 (2002)

    Article  Google Scholar 

  33. Williams, H., Zobel, J.: Compression of nucleotide databases for fast searching. CABIOS Computer Applications in the Biological Sciences 13(5), 549–554 (1997)

    Google Scholar 

  34. Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)

    Google Scholar 

  35. Zobel, J., Moffat, A., Sacks-Davis, R.: Searching large lexicons for partially specified terms using compressed inverted files. In: Agrawal, R., Baker, S., Bell, D. (eds.) Proceedings of the International Conference on Very Large Data Bases, Dublin, Ireland, August 1993, pp. 290–301 (1993)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2006 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Puglisi, S.J., Smyth, W.F., Turpin, A. (2006). Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_11

Download citation

  • DOI: https://doi.org/10.1007/11880561_11

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-540-45774-9

  • Online ISBN: 978-3-540-45775-6

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics