skip to main content
10.1145/2684822.2685297acmconferencesArticle/Chapter ViewAbstractPublication PageswsdmConference Proceedingsconference-collections
research-article

Optimal Space-time Tradeoffs for Inverted Indexes

Published: 02 February 2015 Publication History

Abstract

Inverted indexes are usually represented by dividing posting lists into constant-sized blocks and representing them with an encoder for sequences of integers. Different encoders yield a different point in the space-time trade-off curve, with the fastest being several times larger than the most space-efficient. An important design decision for an index is thus the choice of the fastest encoding method such that the index fits in the available memory. However, a better usage of the space budget could be obtained by using faster encoders for frequently accessed blocks, and more space-efficient ones those that are rarely accessed. To perform this choice optimally, we introduce a linear time algorithm that, given a query distribution and a set of encoders, selects the best encoder for each index block to obtain the lowest expected query processing time respecting a given space constraint. To demonstrate the effectiveness of this approach we perform an extensive experimental analysis, which shows that our algorithm produces indexes which are significantly faster than single-encoder indexes under several query processing strategies, while respecting the same space constraints.

References

[1]
V. N. Anh, O. de Kretser, and A. Moffat. Vector-space ranking with effective early termination. In SIGIR, pages 35--42, 2001.
[2]
V. N. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retr., 8(1), 2005.
[3]
V. N. Anh and A. Moffat. Index compression using 64-bit words. Softw., Pract. Exper., 40(2):131--147, 2010.
[4]
A. Z. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Y. Zien. Efficient query evaluation using a two-level retrieval process. In CIKM, pages 426--434, 2003.
[5]
S. Buttcher, C. L. A. Clarke, and G. V. Cormack. Information retrieval: implementing and evaluating search engines. MIT Press, Cambridge, Mass., 2010.
[6]
R. H. Byrd, J. Nocedal, and R. B. Schnabel. Representations of quasi-newton matrices and their use in limited memory methods. Mathematical Programming, 63(1-3):129--156, 1994.
[7]
F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In WSDM, pages 15--24, 2008.
[8]
J. Dean. Challenges in building large-scale information retrieval systems: invited talk. In WSDM, 2009.
[9]
R. Delbru, S. Campinas, and G. Tummarello. Searching web data: An entity retrieval and high-performance indexing model. J. Web Sem., 10:33--58, 2012.
[10]
R. Dementiev, L. Kettner, and P. Sanders. Stxxl: standard template library for xxl data sets. Software: Practice and Experience, 38(6):589--637, 2008.
[11]
P. Elias. Efficient storage and retrieval by content and address of static files. J. ACM, 21(2):246--260, 1974.
[12]
R. M. Fano. On the number of bits required to implement an associative memory. Memorandum 61, Computer Structures Group, MIT, Cambridge, MA, 1971.
[13]
T. Hastie, R. Tibshirani, and J. H. Friedman. The Elements of Statistical Learning. Springer, 2001.
[14]
P. J. Huber et al. Robust estimation of a location parameter. The Annals of Mathematical Statistics, 35(1):73--101, 1964.
[15]
D. Lemire and L. Boytsov. Decoding billions of integers per second through vectorization. Software: Practice & Experience, 2013.
[16]
C. D. Manning, P. Raghavan, and H. Schulze. Introduction to Information Retrieval. Cambridge University Press, 2008.
[17]
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retr., 3(1), 2000.
[18]
A. Moffat, W. Webber, J. Zobel, and R. Baeza-Yates. A pipelined architecture for distributed text query evaluation. Inf. Retr., 10:205--231, June 2007.
[19]
G. Ottaviano and R. Venturini. Partitioned Elias-Fano indexes. In SIGIR, pages 273--282, 2014.
[20]
S. E. Robertson and K. S. Jones. Relevance weighting of search terms. Journal of the American Society for Information science, 27(3):129--146, 1976.
[21]
D. Salomon. Variable-length Codes for Data Compression. Springer, 2007.
[22]
F. Silvestri. Sorting out the document identifier assignment problem. In ECIR, pages 101--112, 2007.
[23]
F. Silvestri and R. Venturini. VSEncoding: Efficient coding and fast decoding of integer lists via dynamic programming. In CIKM, pages 1219--1228, 2010.
[24]
P. Sinha and A. A. Zoltners. The multiple-choice knapsack problem. Operations Research, 27(3):503--515, 1979.
[25]
A. A. Stepanov, A. R. Gangolli, D. E. Rose, R. J. Ernst, and P. S. Oberoi. Simd-based decoding of posting lists. In CIKM, pages 317--326, 2011.
[26]
H. Turtle and J. Flood. Query evaluation: Strategies and optimizations. Information Processing & Management, 31(6):831 -- 850, 1995.
[27]
S. Vigna. Quasi-succinct indices. In WSDM, 2013.
[28]
I. H. Witten, A. Moffat, and T. C. Bell. Managing gigabytes (2nd ed.): compressing and indexing documents and images. Morgan Kaufmann Publishers Inc., 1999.
[29]
H. Yan, S. Ding, and T. Suel. Compressing term positions in web indexes. In SIGIR, pages 147--154, 2009.
[30]
H. Yan, S. Ding, and T. Suel. Inverted index compression and query processing with optimized document ordering. In WWW, pages 401--410, 2009.
[31]
E. Zemel. An O(n) algorithm for the linear multiple choice knapsack problem and related problems. Inf. Process. Lett., 18(3):123--128, 1984.
[32]
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Comput. Surv., 38(2), 2006.
[33]
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In ICDE, 2006.

Cited By

View all
  • (2024)CoCo-trieInformation Systems10.1016/j.is.2023.102316120:COnline publication date: 1-Feb-2024
  • (2023)Khronos: A Real-Time Indexing Framework for Time Series Databases on Large-Scale Performance Monitoring SystemsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614944(1607-1616)Online publication date: 21-Oct-2023
  • (2022)An NVM SSD-based High Performance Query Processing Framework for Search EnginesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3160557(1-1)Online publication date: 2022
  • Show More Cited By

Index Terms

  1. Optimal Space-time Tradeoffs for Inverted Indexes

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    WSDM '15: Proceedings of the Eighth ACM International Conference on Web Search and Data Mining
    February 2015
    482 pages
    ISBN:9781450333177
    DOI:10.1145/2684822
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 February 2015

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. compression
    2. inverted indexes
    3. knapsack problems

    Qualifiers

    • Research-article

    Conference

    WSDM 2015

    Acceptance Rates

    WSDM '15 Paper Acceptance Rate 39 of 238 submissions, 16%;
    Overall Acceptance Rate 498 of 2,863 submissions, 17%

    Upcoming Conference

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)7
    • Downloads (Last 6 weeks)1
    Reflects downloads up to 05 Mar 2025

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)CoCo-trieInformation Systems10.1016/j.is.2023.102316120:COnline publication date: 1-Feb-2024
    • (2023)Khronos: A Real-Time Indexing Framework for Time Series Databases on Large-Scale Performance Monitoring SystemsProceedings of the 32nd ACM International Conference on Information and Knowledge Management10.1145/3583780.3614944(1607-1616)Online publication date: 21-Oct-2023
    • (2022)An NVM SSD-based High Performance Query Processing Framework for Search EnginesIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2022.3160557(1-1)Online publication date: 2022
    • (2022)Compressing and Querying Integer Dictionaries Under Linearities and RepetitionsIEEE Access10.1109/ACCESS.2022.322152010(118831-118848)Online publication date: 2022
    • (2020)Techniques for Inverted Index CompressionACM Computing Surveys10.1145/341514853:6(1-36)Online publication date: 6-Dec-2020
    • (2020)Using an Inverted Index Synopsis for Query Latency and Performance PredictionACM Transactions on Information Systems10.1145/338979538:3(1-33)Online publication date: 18-May-2020
    • (2020)An NVM SSD-Optimized Query Processing FrameworkProceedings of the 29th ACM International Conference on Information & Knowledge Management10.1145/3340531.3412010(935-944)Online publication date: 19-Oct-2020
    • (2020)Pipelined Query Processing Using Non-volatile Memory SSDsWeb and Big Data10.1007/978-3-030-60290-1_35(457-472)Online publication date: 14-Oct-2020
    • (2019)A Hybrid BitFunnel and Partitioned Elias-Fano Inverted IndexThe World Wide Web Conference10.1145/3308558.3313553(1153-1163)Online publication date: 13-May-2019
    • (2019)Fast Dictionary-Based Compression for Inverted IndexesProceedings of the Twelfth ACM International Conference on Web Search and Data Mining10.1145/3289600.3290962(6-14)Online publication date: 30-Jan-2019
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media