skip to main content
10.1145/1526709.1526764acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Inverted index compression and query processing with optimized document ordering

Published:20 April 2009Publication History

ABSTRACT

Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages.

References

  1. V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61--67, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retrieval, 8(1):151--166, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. V. Anh and A. Moffat. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering, 18(6):857--861, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. J. Bentley, D. Sleator, R. Tarjan, and V. Wei. A locally adaptive data compression scheme. Comm. of the ACM, 29(4), 1986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A time machine for text search. In Proc. of the 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 519--526, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conf. on Information Retrieval, pages 375--387, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. D. Blandford and G. Blelloch. Index compression through document reordering. In Proc. of the Data Compression Conference, pages 342--351, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In Proc. of the 12th Int. Conf. on String Processing and Information Retrieval, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh World Wide Web Conference, 1998. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. A. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th Int. Conf. on Information and Knowledge Management, pages 426--434, November 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing shared content in information retrieval systems. In Proc. of the 10th Int. Conf. on Extending Database Technology, pages 313--330, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In Proc. of the Int. Conf. on Web Search and Data Mining, pages 15--24, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. R. Fagin. Combining fuzzy information: an overview. SIGMOD Record, 31(2):109--118, June 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands, July 2005.Google ScholarGoogle Scholar
  15. M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In Proc. of the 29th European Conf. on Information Retrieval, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proc. of the 29th Int. Conf. on Very Large Data Bases, pages 129--140, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retrieval, 3(1):25--47, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. A. Moffat and J. Zobel. Parameterised compression for sparse bitmaps. In Proc. of the 15th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 274--285, 1992. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. M. Persin, J. Zobel, and R. Sacks--Davis. Filtered document retrieval with frequency-sorted indexes. J. of the American Society for Information Science, 47(10):749--764, 1996. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Richardson, A. Prakash, and E. Brill. Beyond pagerank: machine learning for static ranking. In Proc. of the 15th Int. World Wide Web Conference, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In First Latin American Web Congress, pages 132--143, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, Aug. 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Processing and Management, 39(1):117--131, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101--112, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. World Wide Web Conference, April 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. J. Zhang and T. Suel. Efficient search in large textual collection with redundancy. In Proc. of the 16th Int. World Wide Web Conference, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proc. of the Int. Conf. on Data Engineering, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Inverted index compression and query processing with optimized document ordering

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader