Skip to main content
Log in

Optimizing partitioning strategies for faster inverted index compression

  • Research Article
  • Published:
Frontiers of Computer Science Aims and scope Submit manuscript

Abstract

The inverted index is a key component for search engines to manage billions of documents and quickly respond to users’ queries.Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when constructing the index has been overlooked. Partitioning the index aligning to its clustered distribution can effectively minimize the compressed size while accelerating its construction procedure. In this study, we introduce compression speed as one criterion to evaluate compression techniques, and thoroughly analyze the performance of different partitioning strategies. Optimizations are also proposed to enhance state-of-the-art methods with faster compression speed and more flexibility to partition an index. Experiments show that our methods offer a much better compression speed, while retaining an excellent space occupancy and decompression speed. networks.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval, Vol. 1. Cambridge: Cambridge University Press, 2008

    Book  MATH  Google Scholar 

  2. Witten I H, Moffat A, Bell T C. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA: Morgan Kaufmann, 1999

    MATH  Google Scholar 

  3. Zobel J, Moffat A. Inverted files for text search engines. ACM Computing Surveys, 2006, 38(2): 6

    Article  Google Scholar 

  4. Catena M, Macdonald C, Ounis I. On inverted index compression for search engine efficiency. In: Proceedings of European Conference on Information Retrieval. 2014, 359–371

    Google Scholar 

  5. Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 2015, 45(1): 1–29

    Google Scholar 

  6. Ottaviano G, Tonellotto N, Venturini R. Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 47–56

    Google Scholar 

  7. Silvestri F, Venturini R. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 2010, 1219–1228

    Google Scholar 

  8. Yan H, Ding S, Suel T. Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 401–410

    Google Scholar 

  9. Ottaviano G, Grossi R. Semi-indexing semi-structured data in tiny space. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 1485–1494

    Google Scholar 

  10. Anh V N, Moffat A. Inverted index compression using word-aligned binary codes. Information Retrieval, 2005, 8(1): 151–166

    Article  Google Scholar 

  11. Anh V N, Moffat A. Index compression using 64-bit words. Software: Practice and Experience, 2010, 40(2): 131–147

    Google Scholar 

  12. Anh V N, Moffat A. Index compression using fixed binary codewords. In: Proceedings of the 15th Australasian Database Conference. 2004, 61–67

    Google Scholar 

  13. Delbru R, Campinas S, Tummarello G. Searching Web data: an entity retrieval and high-performance indexing model. Journal of Web Semantics, 2012, 10: 33–58

    Article  Google Scholar 

  14. Ottaviano G, Venturini R. Partitioned elias-fano indexes. In: Proceedings of the 37th international ACM SIGIR Conference on Research & Development in Information Retrieval. 2014, 273–282

    Google Scholar 

  15. Ferragina P, Nitto I, Venturini R. On optimally partitioning a text to improve its compression. Algorithmica, 2011, 61(1): 51–74

    Article  MathSciNet  MATH  Google Scholar 

  16. Trotman A. Compression, SIMD, and postings lists. In: Proceedings of the Australasian Document Computing Symposium. 2014

    Google Scholar 

  17. Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th international ACM SIGIR Conference on Research and Development in Information Retrieval. 2011, 993–1002

    Google Scholar 

  18. Navarro G, Puglisi S J. Dual-sorted inverted lists. In: Proceedings of String Processing and Information Retrieval. 2010, 309–321

    Chapter  Google Scholar 

  19. Dimopoulos C, Nepomnyachiy S, Suel T. Optimizing top-k document retrieval strategies for block-max indexes. In: Proceedings of the 6th ACM International Conference onWeb Search and DataMining. 2013, 113–122

    Google Scholar 

  20. Stepanov A A, Gangolli A R, Rose D E, Ernst R J, Oberoi P S. SIMDbased decoding of posting lists. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 317–326

    Google Scholar 

  21. Zhao W X, Zhang X, Lemire D, Shan D, Nie J Y, Yan H F, Wen J R. A general SIMD-based approach to accelerating compression algorithms. ACM Transactions on Information Systems, 2015, 33(3): 15

    Article  Google Scholar 

  22. Goldstein J, Ramakrishnan R, Shaft U. Compressing relations and indexes. In: Proceedings of the 14th International Conference on Data Engineering. 1998, 370–379

    Chapter  Google Scholar 

  23. Boldi P, Vigna S. Compressed perfect embedded skip lists for quick inverted-index lookups. In: Proceedings of International Symposium on String Processing and Information Retrieval. 2005, 25–28

    Chapter  Google Scholar 

  24. Jonassen S, Bratsberg S E. Efficient compressed inverted index skipping for disjunctive text-queries. In: Proceedings of European Conference on Information Retrieval. 2011, 530–542

    Google Scholar 

  25. Sacco G M. Fast block-compressed inverted lists. In: Proceedings of International Conference on Database and Expert Systems Applications. 2012, 412–421

    Chapter  Google Scholar 

  26. Culpepper J S, Moffat A. Efficient set intersection for inverted indexing. ACM Transactions on Information Systems, 2010, 29(1): 1

    Article  Google Scholar 

  27. Ao N Y, Zhang F, Wu D, Stones D S, Wang G, Liu X G, Liu J, Lin S. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment. 2011, 8(4): 470–481

    Article  Google Scholar 

  28. Lemire D, Boytsov L, Kurz N. SIMD Compression and the Intersection of Sorted Integers. Software: Practice and Experience, 2015

    Google Scholar 

  29. Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms, Vol 3. Cambridge, MA: The MIT Press, 2009

    MATH  Google Scholar 

  30. Gog S, Venturini R. Succinct data structures in information retrieval: theory and practice. In: Proceedings of the 39th International ACM SIGIR Conference on Research & Development in Information Retrieval. 2016, 1231–1233

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xingshen Song.

Additional information

Xingshen Song, a doctoral candidate, received his MS degree in remote sensing from the Aviation University of the Air Force, China in 2013. His research interests include data structures for search engines, inverted index compression, and query processing optimization.

Yuxiang Yang received his PhD degree in computer science from the National University of Defense Technology (NUDT), China in 2008. Currently, he is a professor of the College of Computer, NUDT. His main research fields include information retrieval, information security, and cloud computing.

Yu Jiang, a master candidate, received her BE degree in computer science and technology from Xi’an Jiaotong University, China in 2010. Her research interests include query processing optimization and data structures for search engines.

Kun Jiang received his PhD degree in computer science from the National University of Defense Technology, China in 2015. He is now a postdoctoral fellow in the School of the Electronic and Information Engineering, Xi’an Jiaotong University, China. His current research interests include information retrieval and machine learning.

Electronic supplementary material

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Song, X., Yang, Y., Jiang, Y. et al. Optimizing partitioning strategies for faster inverted index compression. Front. Comput. Sci. 13, 343–356 (2019). https://doi.org/10.1007/s11704-016-6252-5

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11704-016-6252-5

Keywords

Navigation