skip to main content
10.1145/2600428.2609615acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
research-article

Partitioned Elias-Fano indexes

Published: 03 July 2014 Publication History

Abstract

The Elias-Fano representation of monotone sequences has been recently applied to the compression of inverted indexes, showing excellent query performance thanks to its efficient random access and search operations. While its space occupancy is competitive with some state-of-the-art methods such as gamma-delta-Golomb codes and PForDelta, it fails to exploit the local clustering that inverted lists usually exhibit, namely the presence of long subsequences of close identifiers. In this paper we describe a new representation based on partitioning the list into chunks and encoding both the chunks and their endpoints with Elias-Fano, hence forming a two-level data structure. This partitioning enables the encoding to better adapt to the local statistics of the chunk, thus exploiting clustering and improving compression. We present two partition strategies, respectively with fixed and variable-length chunks. For the latter case we introduce a linear-time optimization algorithm which identifies the minimum-space partition up to an arbitrarily small approximation factor.
We show that our partitioned Elias-Fano indexes offer significantly better compression than plain Elias-Fano, while preserving their query time efficiency. Furthermore, compared with other state-of-the-art compressed encodings, our indexes exhibit the best compression ratio/query time trade-off.

References

[1]
V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8(1), 2005.
[2]
V. N. Anh and A. Moffat. Index compression using 64-bit words. Softw., Pract. Exper., 40(2):131--147, 2010.
[3]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM, pages 426--434, 2003.
[4]
A. Buchsbaum, G. Fowler, and R. Giancarlo. Improving table compression with combinatorial optimization. Journal of the ACM, 50(6):825--851, 2003.
[5]
S. Büttcher and C. L. A. Clarke. Index compression is good, especially for random access. In CIKM, 2007.
[6]
S. Büttcher, C. L. A. Clarke, and G. V. Cormack. Information retrieval: implementing and evaluating search engines. MIT Press, Cambridge, Mass., 2010.
[7]
T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein. Introduction to Algorithms. The MIT Press, 2009.
[8]
M. Curtiss and et al. Unicorn: A system for searching the social graph. VLDB, 6(11):1150--1161, Aug. 2013.
[9]
R. Delbru, S. Campinas, and G. Tummarello. Searching web data: An entity retrieval and high-performance indexing model. J. Web Sem., 10:33--58, 2012.
[10]
P. Elias. Efficient storage and retrieval by content and address of static files. J. ACM, 21(2):246--260, 1974.
[11]
R. M. Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, Cambridge, MA, 1971.
[12]
P. Ferragina, I. Nitto, and R. Venturini. On optimally partitioning a text to improve its compression. Algorithmica, 61(1):51--74, 2011.
[13]
A. Fog. The microarchitecture of Intel, AMD and VIA CPUs. http://www.agner.org/optimize/microarchitecture.pdf.
[14]
J. Goldstein, R. Ramakrishnan, and U. Shaft. Compressing relations and indexes. In ICDE, 1998.
[15]
D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice & Experience, 2013.
[16]
C. D. Manning, P. Raghavan, and H. Schülze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[17]
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retr., 3(1), 2000.
[18]
S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information science, 27(3):129--146, 1976.
[19]
D. Salomon. Variable-length Codes for Data Compression. Springer, 2007.
[20]
F. Silvestri. Sorting out the document identifier assignment problem. In ECIR, pages 101--112, 2007.
[21]
F. Silvestri and R. Venturini. VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In CIKM, pages 1219--1228, 2010.
[22]
A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. Simd-based decoding of posting lists. In CIKM, pages 317--326, 2011.
[23]
S. Vigna. Quasi-succinct indices. In WSDM, 2013.
[24]
I. H. Witten, A. Moffat, and T. C. Bell. Managing gigabytes (2nd ed.): compressing and indexing documents and images. Morgan Kaufmann Publishers Inc., 1999.
[25]
H. Yan, S. Ding, and T. Suel. Compressing term positions in web indexes. In SIGIR, pages 147--154, 2009.
[26]
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In WWW, pages 401--410, 2009.
[27]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.
[28]
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, 2006.

Cited By

View all
  • (2025)Aster: Enhancing LSM-structures for Scalable Graph DatabaseProceedings of the ACM on Management of Data10.1145/37096623:1(1-26)Online publication date: 11-Feb-2025
  • (2024)Optimizing Collections of Bloom Filters within a Space BudgetProceedings of the VLDB Endowment10.14778/3681954.368202017:11(3551-3564)Online publication date: 1-Jul-2024
  • (2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '14: Proceedings of the 37th international ACM SIGIR conference on Research & development in information retrieval
July 2014
1330 pages
ISBN:9781450322577
DOI:10.1145/2600428
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 03 July 2014

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. compression
  2. dynamic programming
  3. inverted indexes

Qualifiers

  • Research-article

Conference

SIGIR '14
Sponsor:

Acceptance Rates

SIGIR '14 Paper Acceptance Rate 82 of 387 submissions, 21%;
Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)70
  • Downloads (Last 6 weeks)12
Reflects downloads up to 05 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2025)Aster: Enhancing LSM-structures for Scalable Graph DatabaseProceedings of the ACM on Management of Data10.1145/37096623:1(1-26)Online publication date: 11-Feb-2025
  • (2024)Optimizing Collections of Bloom Filters within a Space BudgetProceedings of the VLDB Endowment10.14778/3681954.368202017:11(3551-3564)Online publication date: 1-Jul-2024
  • (2024)Improving Graph Compression for Efficient Resource-Constrained Graph AnalyticsProceedings of the VLDB Endowment10.14778/3665844.366585217:9(2212-2226)Online publication date: 1-May-2024
  • (2024)Oasis: An Optimal Disjoint Segmented Learned Range FilterProceedings of the VLDB Endowment10.14778/3659437.365944717:8(1911-1924)Online publication date: 1-Apr-2024
  • (2024)Fulgor: a fast and compact k-mer index for large-scale matching and color queriesAlgorithms for Molecular Biology10.1186/s13015-024-00251-919:1Online publication date: 22-Jan-2024
  • (2024)Memento Filter: A Fast, Dynamic, and Robust Range FilterProceedings of the ACM on Management of Data10.1145/36988202:6(1-27)Online publication date: 20-Dec-2024
  • (2024)Binary Interpolative Coding RevisitedProceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region10.1145/3673791.3698419(216-225)Online publication date: 8-Dec-2024
  • (2024)The Ring: Worst-case Optimal Joins in Graph Databases using (Almost) No Extra SpaceACM Transactions on Database Systems10.1145/364482449:2(1-45)Online publication date: 23-Mar-2024
  • (2024)LeCo: Lightweight Compression via Learning Serial CorrelationsProceedings of the ACM on Management of Data10.1145/36393202:1(1-28)Online publication date: 26-Mar-2024
  • (2024)Grafite: Taming Adversarial Queries with Optimal Range FiltersProceedings of the ACM on Management of Data10.1145/36392582:1(1-23)Online publication date: 26-Mar-2024
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media