Skip to main content

Adding Compression and Blended Search to a Compact Two-Level Suffix Array

  • Conference paper
String Processing and Information Retrieval (SPIRE 2013)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 8214))

Included in the following conference series:

Abstract

The suffix array is an efficient in-memory data structure for pattern search; and two-level variants also exist that are suited to external searching and can handle strings larger than the available memory. Assuming the latter situation, we introduce a factor-based mechanism for compressing the text string that integrates seamlessly with the in-memory index search structure, rather than requiring a separate dictionary. We also introduce a mixture of indexed and sequential pattern search in a trade-off that allows further space savings. Experiments on a 4 GB computer with 62.5 GB of English text show that a two-level arrangement is possible in which around 2.5% of the text size is required as an index and for which the disk-resident components, including the text itself, occupy less than twice the space of the original text; and with which count queries can be carried out using two disk accesses and less than two milliseconds of CPU time.

The original version of this chapter was revised: The copyright line was incorrect. This has been corrected. The Erratum to this chapter is available at DOI: 10.1007/978-3-319-02432-5_33

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Baeza-Yates, R.A., Barbosa, E.F., Ziviani, N.: Hierarchies of indices for text searching. Inf. Systems 21(6), 497–514 (1996)

    Article  Google Scholar 

  2. Colussi, L., De Col, A.: A time and space efficient data structure for string searching on large texts. Inf. Processing Letters 58(5), 217–222 (1996)

    Article  MathSciNet  MATH  Google Scholar 

  3. Ferguson, M.P.: FEMTO: Fast search of large sequence collections. In: Kärkkäinen, J., Stoye, J. (eds.) CPM 2012. LNCS, vol. 7354, pp. 208–219. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  4. Ferragina, P., Grossi, R.: The string B-tree: A new data structure for search in external memory and its applications. J. ACM 46(2), 236–280 (1999)

    Article  MathSciNet  MATH  Google Scholar 

  5. Ferragina, P., Manzini, G.: Indexing compressed text. J. ACM 52(4), 552–581 (2005)

    Article  MathSciNet  MATH  Google Scholar 

  6. Gog, S., Moffat, A., Culpepper, J.S., Turpin, A., Wirth, A.: Large-scale pattern search using reduced-space on-disk suffix arrays. IEEE Trans. Knowledge and Data Engineering (to appear)

    Google Scholar 

  7. Gog, S., Petri, M.: Optimized succinct data structures for massive data. Software Practice & Experience (to appear, 2013), http://dx.doi.org/10.1002/spe.2198

  8. González, R., Navarro, G.: A compressed text index on secondary memory. J. Combinatorial Mathematics and Combinatorial Comp. 71, 127–154 (2009)

    MathSciNet  MATH  Google Scholar 

  9. Kärkkäinen, J., Rao, S.S.: Full-text indexes in external memory. In: Meyer, U., Sanders, P., Sibeyn, J.F. (eds.) Algorithms for Memory Hierarchies. LNCS, vol. 2625, pp. 149–170. Springer, Heidelberg (2003)

    Chapter  Google Scholar 

  10. Mäkinen, V., Navarro, G.: Compressed compact suffix arrays. In: Sahinalp, S.C., Muthukrishnan, S.M., Dogrusoz, U. (eds.) CPM 2004. LNCS, vol. 3109, pp. 420–433. Springer, Heidelberg (2004)

    Chapter  Google Scholar 

  11. Manber, U., Myers, G.W.: Suffix arrays: a new method for on-line string searches. SIAM J. Comp. 22(5), 935–948 (1993)

    Article  MathSciNet  MATH  Google Scholar 

  12. Moffat, A., Puglisi, S.J., Sinha, R.: Reducing space requirements for disk resident suffix arrays. In: Zhou, X., Yokota, H., Deng, K., Liu, Q. (eds.) DASFAA 2009. LNCS, vol. 5463, pp. 730–744. Springer, Heidelberg (2009)

    Chapter  Google Scholar 

  13. Sinha, R., Puglisi, S.J., Moffat, A., Turpin, A.: Improving suffix array locality for fast pattern matching on disk. In: Wang, J.T.-L. (ed.) Proc. ACM SIGMOD Int. Conf. Management of Data, pp. 661–672 (2008)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Gog, S., Moffat, A. (2013). Adding Compression and Blended Search to a Compact Two-Level Suffix Array. In: Kurland, O., Lewenstein, M., Porat, E. (eds) String Processing and Information Retrieval. SPIRE 2013. Lecture Notes in Computer Science, vol 8214. Springer, Cham. https://doi.org/10.1007/978-3-319-02432-5_18

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-02432-5_18

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-02431-8

  • Online ISBN: 978-3-319-02432-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics