Indexing Compressed Text

Ferragina, Paolo; Venturini, Rossano

doi:10.1007/978-0-387-39940-9_1144

Paolo Ferragina³ &
Rossano Venturini³

112 Accesses
1 Citations

Synonyms

Compressed full-text indexing; Compressed suffix array Compressed suffix tree Compressed and searchable data format

Definition

Given a text T[1,n], the Compressed Text Indexing problem requires to building an indexing data structure over T that takes space close to the empirical entropy of the input text and answers queries on the occurrences of an arbitrary pattern P[1,p] in T without any significant slowdown with respect to uncompressed indexes. There are three main queries:count(P), that returns the number of pattern occurrences in T, locate(P), that returns the starting positions of all pattern occurrences in T, and extract(i, j), that retrieves the substring T[i, j].

Historical Background

String processing and searching tasks are at the core of modern web search, information retrieval (IR), data base and data mining applications. Most of text manipulations required by these applications involve, sooner or later, searching those (long) texts for (short) patterns or accessing...

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 2,500.00; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Recommended Reading

Arroyuelo D., Navarro G., and Sadakane K. Reducing the space requirement of LZ-index. In Proc. 17th Annual Symposium on Combinatorial Pattern Matching, pp. 319–330.2006,
Google Scholar
Barbay J., He M., Munro J.I., and Srinivasa Rao S. Succinct indexes for string, binary relations and multi-labeled trees. In Proc. 18th Annual ACM -SIAM Symp. on Discrete Algorithms, 2007, pp. 680–689.
Google Scholar
Bender M.A., Farach-Colton M., and Kuszmaul B.C. Cache-oblivious string B-trees. In Proc. 25th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2006, pp. 233–242.
Google Scholar
Burrows M. and Wheeler D. A block sorting lossless data compression algorithm. Technical Report 124, Digital Equipment Corporation, 1994.
Google Scholar
Ferragina P. String Search in External Memory: Data Structures and Algorithms, In Handbook of Computational Molecular Biology, Chapman & Hall, London, 2005.
Google Scholar
Ferragina P., González R., Navarro G., and Venturini R. Compressed Text Indexes: From Theory to Practice, J. Exp. Algorithmics, 13:1.12–1.31, 2009.
Google Scholar
Ferragina P. and Grossi R. The String B-tree: A new data structure for string search in external memory and its applications. J. ACM, 46(2):236–280, 1999.
Article MathSciNet MATH Google Scholar
Ferragina P., Grossi R., Gupta A., Shah R., and Vitter J.S. On searching compressed string collections cache-obliviously. In Proc. 27th ACM SIGACT-SIGMOD-SIGART Symp. on Principles of Database Systems, 2008, pp. 181–190.
Google Scholar
Ferragina P. and Manzini G. Indexing compressed text. J. ACM, 52(4):552–581, 2005.
Article MathSciNet Google Scholar
Ferragina P., Manzini G., Mäkinen V., and Navarro G. Compressed representations of sequences and full-text indexes. ACM Trans. Algorithms, 3(2), 2007.
Google Scholar
Ferragina P. and Venturini R. Compressed permuterm index. In Proc. 33rd Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2007, pp. 535–542.
Google Scholar
Grossi R., Gupta A., and Vitter J.S. High-order entropy-compressed text indexes. In Proc. 14th Annual ACM-SIAM Symp. on Discrete Algorithms, 2003, pp. 841–850.
Google Scholar
Grossi R. and Vitter J.S. Compressed suffix arrays and suffix trees with applications to text indexing and string matching. SIAM J. Comput., 35(2):378–407, 2005.
Article MathSciNet MATH Google Scholar
Navarro G. and Mäkinen V. Compressed full-text indexes. ACM Comput. Surv., 39(1), 2007.
Google Scholar
Sadakane K. Compressed suffix trees with full functionality. Theory Comput. Syst., 41(4):589–607, 2007.
Article MathSciNet MATH Google Scholar
Sadakane K. New text indexing functionalities of the compressed suffix arrays. J. Algorithms, 48(2):294–413, 2007.
Article MathSciNet Google Scholar
Sadakane K. Succinct data structures for flexible text retrieval systems. J. Discrete Algorithms, 5(1):12–22, 2007.
Article MathSciNet MATH Google Scholar
Tam S.L., Wong C.K., Lam T.W., Sung W.K., and Yiu S.M. Compressed indexing and local alignment of DNA. Bioinformatics, 24(6):791–797, 2008.
Article Google Scholar

Download references

Author information

Authors and Affiliations

University of Pisa, Pisa, Italy
Paolo Ferragina & Rossano Venturini

Authors

Paolo Ferragina
View author publications
You can also search for this author in PubMed Google Scholar
Rossano Venturini
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

College of Computing, Georgia Institute of Technology, 266 Ferst Drive, 30332-0765, Atlanta, GA, USA
LING LIU (Professor) (Professor)
Database Research Group David R. Cheriton School of Computer Science, University of Waterloo, 200 University Avenue West, N2L 3G1, Waterloo, ON, Canada
M. TAMER ÖZSU (Professor and Director, University Research Chair) (Professor and Director, University Research Chair)

Rights and permissions

Reprints and permissions

Copyright information

About this entry

Cite this entry

Ferragina, P., Venturini, R. (2009). Indexing Compressed Text. In: LIU, L., ÖZSU, M.T. (eds) Encyclopedia of Database Systems. Springer, Boston, MA. https://doi.org/10.1007/978-0-387-39940-9_1144

Download citation

DOI: https://doi.org/10.1007/978-0-387-39940-9_1144
Publisher Name: Springer, Boston, MA
Print ISBN: 978-0-387-35544-3
Online ISBN: 978-0-387-39940-9
eBook Packages: Computer ScienceReference Module Computer Science and Engineering

Publish with us

Policies and ethics