Abstract
Recent advances in the asymptotic resource costs of pattern matching with compressed suffix arrays are attractive, but a key rival structure, the compressed inverted file, has been dismissed or ignored in papers presenting the new structures. In this paper we examine the resource requirements of compressed suffix array algorithms against compressed inverted file data structures for general pattern matching in genomic and English texts. In both cases, the inverted file indexes q-grams, thus allowing full pattern matching capabilities, rather than simple word based search, making their functionality equivalent to the compressed suffix array structures. When using equivalent memory for the two structures, inverted files are faster at reporting the location of patterns when the number of occurrences of the patterns is high.
Preview
Unable to display preview. Download preview PDF.
Similar content being viewed by others
References
Abouelhoda, M.I., Ohlebusch, E., Kurtz, S.: Optimal exact string matching based on suffix arrays. In: Laender, A.H.F., Oliveira, A.L. (eds.) SPIRE 2002. LNCS, vol. 2476, pp. 31–43. Springer, Heidelberg (2002)
Altschul, S.F., Gish, W., Miller, W., Myers, E.W., Lipman, D.J.: Basic local alignment search tool. Journal of Molecular Biology 215, 403–410 (1990)
Anh, V.N., Moffat, A.: Inverted index compression using word-aligned binary codes. Information Retrieval 8, 151–166 (2005)
Benson, D., Lipman, D.J., Ostell, J.: GenBank. Nucleic Acids Research 21(13), 2963–2965 (1993)
Benson, D.A., Karsch-Mizrachi, I., Lipman, D.J., Ostell, J., Wheeler, D.L.: Genbank. Nucleic Acids Research 33, D34–D38 (2005)
Burrows, M., Wheeler, D.J.: A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, Palo Alto, California (1994)
Cameron, M., Williams, H.E., Cannane, A.: Improved gapped alignment in blast. IEEE/ACM Transactions on Computational Biology and Bioinformatics 1(3), 116–129 (2004)
Choi, Y., Park, K.: Time and space efficient search with suffix arrays. In: Hong, S. (ed.) Proceedings of AWOCA 2004, Ballina, Australia, pp. 230–238 (2004)
De Moura, E.S., Navarro, G., Ziviani, N., Baeza-Yates, R.: Fast and flexible word searching on compressed text. ACM Transactions on Information Systems 18(2), 113–139 (2000)
Ensembl. Ensembl Genome Browser (2006), http://www.ensembl.org
Ferragina, P., Manzini, G.: Opportunistic data structures with applications. In: Proceedings of the 41st IEEE Symposium on Found. of Comp. Sci., Redondo Beach, CA, pp. 390–398. IEEE Computer Society, Los Alamitos (2000)
Ferragina, P., Navarro, G.: Pizza & Chili Corpus – Compressed Indexes and their Testbeds (2005), http://pizzachili.dcc.uchile.cl
Grossi, R., Vitter, J.S., Gupta, A.: When indexing equals compression: Experiments with compressing suffix arrays and applications. In: Proceedings of the 15th ACM-SIAM Symposium on Discrete Algorithms, pp. 636–645 (2004)
Harman, D.K.: Overview of the second text retrieval conference (TREC-2). Information Processing and Management 31(3), 271–289 (1995)
Kärkkäinen, J.: Ziv-Lempel index for q-grams. Algorithmica 21(1), 137–154 (1998)
Kurtz, S.: Reducing the space requirement of suffix trees. Software, Practice and Experience 29(13), 1149–1171 (1999)
Mäkinen, V., Navarro, G.: Succinct suffix arrays based on run-length encoding. Nordic Journal of Computing 12(2), 40–66 (2005)
Mäkinen, V., Navarro, G.: Compressed full text indexes. Technical Report TR/DCC-2005-7, Department of Computer Science, University of Chile (June 2006)
Mäkinen, V., Navarro, G., Sadakane, K.: Advantages of backward searching — efficient secondary memory and distributed implementation of compressed suffix arrays. In: Fleischer, R., Trippen, G. (eds.) ISAAC 2004. LNCS, vol. 3341, pp. 681–692. Springer, Heidelberg (2004)
Manber, U., Myers, G.W.: Suffix arrays: a new model for on-line string searches. SIAM Journal of Computing 22(5), 935–948 (1993)
Manber, U., Wu, S.: Glimpse: A tool to search through entire file systems. In: Proceedings of the USENIX Technical Conference, Berkeley, CA, pp. 23–32. USENIX Association (1994)
Manzini, G.: An analysis of the Burrows-Wheeler transform. Journal of the ACM 48(3), 407–430 (2001)
McCreight, E.M.: A space-economical suffix tree construction algroithm. Journal of the ACM 23(2), 262–272 (1976)
Navarro, G., De Moura, E.S., Neubert, M., Ziviani, N., Baeza-Yates, R.: Adding compression to block addressing inverted indexes. Information Retrieval 3, 49–77 (2000)
NCBI. NCBI Blast (2006), http://www.ncbi.nlm.nih.gov/BLAST/
Simon, J., Puglisi, W., Smyth, F., Turpin, A.H.: A taxonomy of suffix array construction algorithms. In: Proceedings of the Prague Stringology Conference, Prague, pp. 1–30. Czech Technical University (August 2005)
Sadakane, K.: Compressed text databases with efficient query algorithms based on the compressed suffix array. In: Lee, D.T., Teng, S.-H. (eds.) ISAAC 2000. LNCS, vol. 1969, pp. 410–421. Springer, Heidelberg (2000)
Sadakane, K.: Succinct representations of lcp information and improvements in the compressed suffix arrays. In: Proceedings of the 13th ACM-SIAM Symposium on Discrete Algorithms, San Francisco, CA, pp. 225–232 (2002)
Sim, J.S., Kim, D.K., Park, H., Park, K.: Linear-time search in suffix arrays. In: Miller, M., Park, K. (eds.) Proceedings of AWOCA 2003, Seoul, Korea, pp. 139–146 (2003)
Smyth, W.F.: Computing Patterns in Strings. Addison-Wesley, Essex, England (2003)
Weiner, P.: Linear pattern matching algorithms. In: Proceedings of the 14th annual Symposium on Foundations of Computer Science, pp. 1–11 (1973)
Williams, H.E., Zobel, J.: Indexing and retrieval for genomic databases. IEEE Transactions on Knowledge and Data Engineering 14(1), 63–78 (2002)
Williams, H., Zobel, J.: Compression of nucleotide databases for fast searching. CABIOS Computer Applications in the Biological Sciences 13(5), 549–554 (1997)
Witten, I.H., Moffat, A., Bell, T.C.: Managing Gigabytes: Compressing and Indexing Documents and Images, 2nd edn. Morgan Kaufmann Publishing, San Francisco (1999)
Zobel, J., Moffat, A., Sacks-Davis, R.: Searching large lexicons for partially specified terms using compressed inverted files. In: Agrawal, R., Baker, S., Bell, D. (eds.) Proceedings of the International Conference on Very Large Data Bases, Dublin, Ireland, August 1993, pp. 290–301 (1993)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2006 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Puglisi, S.J., Smyth, W.F., Turpin, A. (2006). Inverted Files Versus Suffix Arrays for Locating Patterns in Primary Memory. In: Crestani, F., Ferragina, P., Sanderson, M. (eds) String Processing and Information Retrieval. SPIRE 2006. Lecture Notes in Computer Science, vol 4209. Springer, Berlin, Heidelberg. https://doi.org/10.1007/11880561_11
Download citation
DOI: https://doi.org/10.1007/11880561_11
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-45774-9
Online ISBN: 978-3-540-45775-6
eBook Packages: Computer ScienceComputer Science (R0)