Abstract
The inverted index is a key component for search engines to manage billions of documents and quickly respond to users’ queries.Whereas substantial effort has been devoted to reducing space occupancy and decoding speed, the encoding speed when constructing the index has been overlooked. Partitioning the index aligning to its clustered distribution can effectively minimize the compressed size while accelerating its construction procedure. In this study, we introduce compression speed as one criterion to evaluate compression techniques, and thoroughly analyze the performance of different partitioning strategies. Optimizations are also proposed to enhance state-of-the-art methods with faster compression speed and more flexibility to partition an index. Experiments show that our methods offer a much better compression speed, while retaining an excellent space occupancy and decompression speed. networks.
Similar content being viewed by others
References
Manning C D, Raghavan P, Schütze H. Introduction to Information Retrieval, Vol. 1. Cambridge: Cambridge University Press, 2008
Witten I H, Moffat A, Bell T C. Managing Gigabytes: Compressing and Indexing Documents and Images. San Francisco, CA: Morgan Kaufmann, 1999
Zobel J, Moffat A. Inverted files for text search engines. ACM Computing Surveys, 2006, 38(2): 6
Catena M, Macdonald C, Ounis I. On inverted index compression for search engine efficiency. In: Proceedings of European Conference on Information Retrieval. 2014, 359–371
Lemire D, Boytsov L. Decoding billions of integers per second through vectorization. Software: Practice and Experience, 2015, 45(1): 1–29
Ottaviano G, Tonellotto N, Venturini R. Optimal space-time tradeoffs for inverted indexes. In: Proceedings of the 8th ACM International Conference on Web Search and Data Mining. 2015, 47–56
Silvestri F, Venturini R. Vsencoding: efficient coding and fast decoding of integer lists via dynamic programming. In: Proceedings of the 19th ACM International Conference on Information and Knowledge Management. 2010, 1219–1228
Yan H, Ding S, Suel T. Inverted index compression and query processing with optimized document ordering. In: Proceedings of the 18th International Conference on World Wide Web. 2009, 401–410
Ottaviano G, Grossi R. Semi-indexing semi-structured data in tiny space. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 1485–1494
Anh V N, Moffat A. Inverted index compression using word-aligned binary codes. Information Retrieval, 2005, 8(1): 151–166
Anh V N, Moffat A. Index compression using 64-bit words. Software: Practice and Experience, 2010, 40(2): 131–147
Anh V N, Moffat A. Index compression using fixed binary codewords. In: Proceedings of the 15th Australasian Database Conference. 2004, 61–67
Delbru R, Campinas S, Tummarello G. Searching Web data: an entity retrieval and high-performance indexing model. Journal of Web Semantics, 2012, 10: 33–58
Ottaviano G, Venturini R. Partitioned elias-fano indexes. In: Proceedings of the 37th international ACM SIGIR Conference on Research & Development in Information Retrieval. 2014, 273–282
Ferragina P, Nitto I, Venturini R. On optimally partitioning a text to improve its compression. Algorithmica, 2011, 61(1): 51–74
Trotman A. Compression, SIMD, and postings lists. In: Proceedings of the Australasian Document Computing Symposium. 2014
Ding S, Suel T. Faster top-k document retrieval using block-max indexes. In: Proceedings of the 34th international ACM SIGIR Conference on Research and Development in Information Retrieval. 2011, 993–1002
Navarro G, Puglisi S J. Dual-sorted inverted lists. In: Proceedings of String Processing and Information Retrieval. 2010, 309–321
Dimopoulos C, Nepomnyachiy S, Suel T. Optimizing top-k document retrieval strategies for block-max indexes. In: Proceedings of the 6th ACM International Conference onWeb Search and DataMining. 2013, 113–122
Stepanov A A, Gangolli A R, Rose D E, Ernst R J, Oberoi P S. SIMDbased decoding of posting lists. In: Proceedings of the 20th ACM International Conference on Information and Knowledge Management. 2011, 317–326
Zhao W X, Zhang X, Lemire D, Shan D, Nie J Y, Yan H F, Wen J R. A general SIMD-based approach to accelerating compression algorithms. ACM Transactions on Information Systems, 2015, 33(3): 15
Goldstein J, Ramakrishnan R, Shaft U. Compressing relations and indexes. In: Proceedings of the 14th International Conference on Data Engineering. 1998, 370–379
Boldi P, Vigna S. Compressed perfect embedded skip lists for quick inverted-index lookups. In: Proceedings of International Symposium on String Processing and Information Retrieval. 2005, 25–28
Jonassen S, Bratsberg S E. Efficient compressed inverted index skipping for disjunctive text-queries. In: Proceedings of European Conference on Information Retrieval. 2011, 530–542
Sacco G M. Fast block-compressed inverted lists. In: Proceedings of International Conference on Database and Expert Systems Applications. 2012, 412–421
Culpepper J S, Moffat A. Efficient set intersection for inverted indexing. ACM Transactions on Information Systems, 2010, 29(1): 1
Ao N Y, Zhang F, Wu D, Stones D S, Wang G, Liu X G, Liu J, Lin S. Efficient parallel lists intersection and index compression algorithms using graphics processing units. Proceedings of the VLDB Endowment. 2011, 8(4): 470–481
Lemire D, Boytsov L, Kurz N. SIMD Compression and the Intersection of Sorted Integers. Software: Practice and Experience, 2015
Cormen T H, Leiserson C E, Rivest R L, Stein C. Introduction to Algorithms, Vol 3. Cambridge, MA: The MIT Press, 2009
Gog S, Venturini R. Succinct data structures in information retrieval: theory and practice. In: Proceedings of the 39th International ACM SIGIR Conference on Research & Development in Information Retrieval. 2016, 1231–1233
Author information
Authors and Affiliations
Corresponding author
Additional information
Xingshen Song, a doctoral candidate, received his MS degree in remote sensing from the Aviation University of the Air Force, China in 2013. His research interests include data structures for search engines, inverted index compression, and query processing optimization.
Yuxiang Yang received his PhD degree in computer science from the National University of Defense Technology (NUDT), China in 2008. Currently, he is a professor of the College of Computer, NUDT. His main research fields include information retrieval, information security, and cloud computing.
Yu Jiang, a master candidate, received her BE degree in computer science and technology from Xi’an Jiaotong University, China in 2010. Her research interests include query processing optimization and data structures for search engines.
Kun Jiang received his PhD degree in computer science from the National University of Defense Technology, China in 2015. He is now a postdoctoral fellow in the School of the Electronic and Information Engineering, Xi’an Jiaotong University, China. His current research interests include information retrieval and machine learning.
Electronic supplementary material
Rights and permissions
About this article
Cite this article
Song, X., Yang, Y., Jiang, Y. et al. Optimizing partitioning strategies for faster inverted index compression. Front. Comput. Sci. 13, 343–356 (2019). https://doi.org/10.1007/s11704-016-6252-5
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11704-016-6252-5