Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory

Puglisi, Simon J.; Smyth, W. F.; Turpin, Andrew

doi:10.1007/11880561_11

Simon J. Puglisi¹⁹,
W. F. Smyth^19,20 &
Andrew Turpin²¹

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 4209))

Included in the following conference series:

International Symposium on String Processing and Information Retrieval

686 Accesses
1 Altmetric

Abstract

Recent advances in the asymptotic resource costs of pattern matching with compressed suffix arrays are attractive, but a key rival structure, the compressed inverted file, has been dismissed or ignored in papers presenting the new structures. In this paper we examine the resource requirements of compressed suffix array algorithms against compressed inverted file data structures for general pattern matching in genomic and English texts. In both cases, the inverted file indexes q-grams, thus allowing full pattern matching capabilities, rather than simple word based search, making their functionality equivalent to the compressed suffix array structures. When using equivalent memory for the two structures, inverted files are faster at reporting the location of patterns when the number of occurrences of the patterns is high.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures

Article 08 January 2016

Flexible Indexing of Repetitive Collections

Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays

Article Open access 23 April 2016

References

Abouelhoda, M.I., Ohlebusch, E., Kurtz, S.: Optimal exact string matching based on suffix arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)
Chapter Google Scholar
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Google Scholar
Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Information Retrieval 8, 151–166 (2005)
Article Google Scholar
Benson, D., Lipman, D.J., Ostell, J.: GenBank. Nucleic Acids Research 21(13), 2963–2965 (1993)
Article Google Scholar
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank. Nucleic Acids Research 33, D34–D38 (2005)
Article Google Scholar
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California (1994)
Google Scholar
Cameron, M., Williams, H.E., Cannane, A.: Improved gapped alignment in blast. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(3), 116–129 (2004)
Article Google Scholar
Choi, Y., Park, K.: Time and space efficient search with suffix arrays. In: Hong, S. (ed.) Proceedings of AWOCA 2004, Ballina, Australia, pp. 230–238 (2004)
Google Scholar
De Moura, E.S., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)
Article Google Scholar
Ensembl. Ensembl Genome Browser (2006), http://www.ensembl.org
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st IEEE Symposium on Found. of Comp. Sci., Redondo Beach, CA, pp. 390–398. IEEE Computer Society, Los Alamitos (2000)
Google Scholar
Ferragina, P., Navarro, G.: Pizza & Chili Corpus – Compressed Indexes and their Testbeds (2005), http://pizzachili.dcc.uchile.cl
Grossi, R., Vitter, J.S., Gupta, A.: When indexing equals compression: Experiments with compressing suffix arrays and applications. In: Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms, pp. 636–645 (2004)
Google Scholar
Harman, D.K.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)
Article Google Scholar
Kärkkäinen, J.: Ziv-Lempel index for q-grams. Algorithmica 21(1), 137–154 (1998)
Article MATH MathSciNet Google Scholar
Kurtz, S.: Reducing the space requirement of suffix trees. Software, Practice and Experience 29(13), 1149–1171 (1999)
Article Google Scholar
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(2), 40–66 (2005)
MathSciNet Google Scholar
Mäkinen, V., Navarro, G.: Compressed full text indexes. Technical Report TR/DCC-2005-7, Department of Computer Science, University of Chile (June 2006)
Google Scholar
Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)
Chapter Google Scholar
Manber, U., Myers, G.W.: Suffix arrays: a new model for on-line string searches. SIAM Journal of Computing 22(5), 935–948 (1993)
Article MATH MathSciNet Google Scholar
Manber, U., Wu, S.: Glimpse: A tool to search through entire file systems. In: Proceedings of the USENIX Technical Conference, Berkeley, CA, pp. 23–32. USENIX Association (1994)
Google Scholar
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)
Article MathSciNet Google Scholar
McCreight, E.M.: A space-economical suffix tree construction algroithm. Journal of the ACM 23(2), 262–272 (1976)
Article MATH MathSciNet Google Scholar
Navarro, G., De Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3, 49–77 (2000)
Article Google Scholar
NCBI. NCBI Blast (2006), http://www.ncbi.nlm.nih.gov/BLAST/
Simon, J., Puglisi, W., Smyth, F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. In: Proceedings of the Prague Stringology Conference, Prague, pp. 1–30. Czech Technical University (August 2005)
Google Scholar
Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Lee, D.T., Teng, S.-H. (eds.) ISAAC 2000. LNCS, vol. 1969, pp. 410–421. Springer, Heidelberg (2000)
Chapter Google Scholar
Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, pp. 225–232 (2002)
Google Scholar
Sim, J.S., Kim, D.K., Park, H., Park, K.: Linear-time search in suffix arrays. In: Miller, M., Park, K. (eds.) Proceedings of AWOCA 2003, Seoul, Korea, pp. 139–146 (2003)
Google Scholar
Smyth, W.F.: Computing Patterns in Strings. Addison-Wesley, Essex, England (2003)
Google Scholar
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th annual Symposium on Foundations of Computer Science, pp. 1–11 (1973)
Google Scholar
Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering 14(1), 63–78 (2002)
Article Google Scholar
Williams, H., Zobel, J.: Compression of nucleotide databases for fast searching. CABIOS Computer Applications in the Biological Sciences 13(5), 549–554 (1997)
Google Scholar
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)
Google Scholar
Zobel, J., Moffat, A., Sacks-Davis, R.: Searching large lexicons for partially specified terms using compressed inverted files. In: Agrawal, R., Baker, S., Bell, D. (eds.) Proceedings of the International Conference on Very Large Data Bases, Dublin, Ireland, August 1993, pp. 290–301 (1993)
Google Scholar

Download references

Author information

Authors and Affiliations

Curtin University of Technology, Perth, Australia
Simon J. Puglisi & W. F. Smyth
McMaster University, Hamilton, Canada
W. F. Smyth
RMIT University, Melbourne, Australia
Andrew Turpin

Authors

Simon J. Puglisi
View author publications
You can also search for this author in PubMed Google Scholar
W. F. Smyth
View author publications
You can also search for this author in PubMed Google Scholar
Andrew Turpin
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

Department of Computer and Information Science, University of Strathclyde, Scotland
Fabio Crestani
Dipartimento di Informatica, University of Pisa, Largo B. Pontecorvo 3, 56127, Pisa, Italy
Paolo Ferragina
Department of Information Studies, University of Sheffield, Sheffield, UK
Mark Sanderson

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Puglisi, S.J., Smyth, W.F., Turpin, A. (2006). Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_11

Download citation

DOI: https://doi.org/10.1007/11880561_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory

Abstract

Access this chapter

Preview

Similar content being viewed by others

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures

Flexible Indexing of Repetitive Collections

Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Navigation

Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory

Abstract

Access this chapter

Preview

Similar content being viewed by others

AS-Index: A Structure for String Search Using n-Grams and Algebraic Signatures

Flexible Indexing of Repetitive Collections

Bitpacking techniques for indexing genomes: II. Enhanced suffix arrays

References

Author information

Authors and Affiliations

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation