skip to main content
10.1145/1376916.1376943acmconferencesArticle/Chapter ViewAbstractPublication PagespodsConference Proceedingsconference-collections
research-article

On searching compressed string collections cache-obliviously

Published: 09 June 2008 Publication History

Abstract

Current data structures for searching large string collections either fail to achieve minimum space or cause too many cache misses. In this paper we discuss some edge linearizations of the classic trie data structure that are simultaneously cache-friendly and compressed. We provide new insights on front coding [24], introduce other novel linearizations, and study how close their space occupancy is to the information-theoretic minimum. The moral is that they are not just heuristics. Our second contribution is a novel dictionary encoding scheme that builds upon such linearizations and achieves nearly optimal space, offers competitive I/O-search time, and is also conscious of the query distribution. Finally, we combine those data structures with cache-oblivious tries [2, 5] and obtain a succinct variant whose space is close to the information-theoretic minimum.

References

[1]
R. Bayer and K. Unterauer. Prefix B-trees. ACM Transactions on Database Systems, 2(1):11--26, 1977.
[2]
M. Bender, M. Farach-Colton, and B. Kuszmaul. Cache-oblivious string b-trees. In Proc. ACM PODS, 233--242, 2006.
[3]
D. Benoit, E. Demaine, I. Munro, R. Raman, V. Raman, and S. Rao. Representing trees of higher degree. Algorithmica, 43:275--292, 2005.
[4]
J. L. Bentley and R. Sedgewick. Fast algorithms for sorting and searching strings. In Proc. ACM-SIAM SODA, 360--369, 1996.
[5]
G. S. Brodal and R. Fagerberg. Cache-oblivious string dictionaries. In Proc. ACM-SIAM SODA, 581--590, 2006.
[6]
V. Ciriani, P. Ferragina, F. Luccio, and S. Muthukrishnan. A data structure for a sequence of string accesses in external memory. ACM Transactions on Algorithms, 3(1), 2007.
[7]
P. Ferragina and R. Grossi. The string B-tree: A new data structure for string search in external memory and its applications. Journal of the ACM, 46(2):236--280, 1999.
[8]
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Structuring labeled trees for optimal succinctness, and beyond. In Proc. IEEE FOCS, 184--193, 2005.
[9]
P. Ferragina, F. Luccio, G. Manzini, and S. Muthukrishnan. Compressing and searching xml data via two zips. In Proc. WWW, 751--760, 2006.
[10]
P. Ferragina and R. Venturini. Compressed permuterm index. In Proc. ACM SIGIR, 535--542, 2007.
[11]
M. Frigo, C. Leiserson, H. Prokop, and S. Ramachandran. Cache-oblivious algorithms. In Proc. IEEE FOCS, 285--298, 1999.
[12]
A. Golynski, R. Grossi, A. Gupta, R. Raman, and S. S. Rao. On the size of succinct indices. In Proc. ESA, LNCS 4698, 371--382, 2007.
[13]
M. He, J. I. Munro, and S. S. Rao. Succinct ordinal trees based on tree covering. In Proc. ICALP, LNCS 4596, 509--520, 2007.
[14]
G. Jacobson. Space-efficient static trees and graphs. In Proc. IEEE FOCS, 549--554, 1989.
[15]
J. Jansson, K. Sadakane, and W. Sung. Ultra-succinct representation of ordered trees. In Proc. ACM-SIAM SODA, 575--584, 2007.
[16]
D. E. Knuth. Sorting and Searching, volume 3 of The Art of Computer Programming. Addison-Wesley, Reading, second edition, 1998.
[17]
P. Ko and S. Aluru. Optimal self-adjusting trees for dynamic string data in secondary storage. In Proc. SPIRE, LNCS 4726, 184--194, 2007.
[18]
G. Manku, A. Jain, and A.-D. Sarma. Detecting near-duplicates for web crawling. In Proc. WWW, 141--150, 2007.
[19]
K. Mehlhorn and A. K. Tsakalidis. Data structures. In Handbook of Theoretical Computer Science, Volume A: Algorithms and Complexity (A), 301--342, 1990.
[20]
J. I. Munro. Succinct data structures. Electr. Notes Theor. Comput. Sci., 91(3), 2004.
[21]
G. Navarro and V. Mäkinen. Compressed full text indexes. ACM Computing Surveys, 39(1), 2007.
[22]
R. Raman, V. Raman, and S. S. Rao. Succinct indexable dictionaries with applications to encoding k-ary trees and multisets. In Proc. ACM-SIAM SODA, 233--242, 2002.
[23]
F. Ruskey. Combinatorial Generation, 2007. In preparation.
[24]
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, second edition, 1999.

Cited By

View all

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
PODS '08: Proceedings of the twenty-seventh ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems
June 2008
330 pages
ISBN:9781605581521
DOI:10.1145/1376916
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 09 June 2008

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. b-tree
  2. cache efficiency
  3. data compression
  4. front coding
  5. string searching

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '08
Sponsor:

Acceptance Rates

PODS '08 Paper Acceptance Rate 28 of 159 submissions, 18%;
Overall Acceptance Rate 642 of 2,707 submissions, 24%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)11
  • Downloads (Last 6 weeks)1
Reflects downloads up to 13 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Two-level massive string dictionariesInformation Systems10.1016/j.is.2024.102490128:COnline publication date: 1-Feb-2025
  • (2025)Packed Acyclic Deterministic Finite AutomataSOFSEM 2025: Theory and Practice of Computer Science10.1007/978-3-031-82697-9_21(284-297)Online publication date: 16-Feb-2025
  • (2024)CoCo-trieInformation Systems10.1016/j.is.2023.102316120:COnline publication date: 1-Feb-2024
  • (2024)LZ78 Substring Compression with CDAWGsString Processing and Information Retrieval10.1007/978-3-031-72200-4_22(289-305)Online publication date: 19-Sep-2024
  • (2023)On Nonlinear Learned String IndexingIEEE Access10.1109/ACCESS.2023.329543411(74021-74034)Online publication date: 2023
  • (2023)Engineering a Textbook Approach to Index Massive String DictionariesString Processing and Information Retrieval10.1007/978-3-031-43980-3_16(203-217)Online publication date: 26-Sep-2023
  • (2022)Building a Fast and Efficient LSM-tree Store by Integrating Local Storage with Cloud StorageACM Transactions on Architecture and Code Optimization10.1145/352745219:3(1-26)Online publication date: 25-May-2022
  • (2022)Succinct Data Structure for Path Graphs2022 Data Compression Conference (DCC)10.1109/DCC52660.2022.00034(262-271)Online publication date: Mar-2022
  • (2022)Compressed String Dictionaries via Data-Aware Subtrie CompactionString Processing and Information Retrieval10.1007/978-3-031-20643-6_17(233-249)Online publication date: 8-Nov-2022
  • (2021)Non-Overlapping LZ77 Factorization and LZ78 Substring Compression Queries with Suffix TreesAlgorithms10.3390/a1402004414:2(44)Online publication date: 29-Jan-2021
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media