research-article

Inverted index compression and query processing with optimized document ordering

Authors:
Hao Yan

Polytechnic Institute of NYU, Brooklyn, NY, USA

Polytechnic Institute of NYU, Brooklyn, NY, USA
View Profile

,
Shuai Ding

Polytechnic Institute of NYU, Brooklyn, NY, USA

Polytechnic Institute of NYU, Brooklyn, NY, USA
View Profile

,
Torsten Suel

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

WWW '09: Proceedings of the 18th international conference on World wide webApril 2009Pages 401–410https://doi.org/10.1145/1526709.1526764

Published:20 April 2009Publication History

WWW '09: Proceedings of the 18th international conference on World wide web

Pages 401–410

ABSTRACT

Web search engines use highly optimized compression schemes to decrease inverted index size and improve query throughput, and many index compression techniques have been studied in the literature. One approach taken by several recent studies first performs a renumbering of the document IDs in the collection that groups similar documents together, and then applies standard compression techniques. It is known that this can significantly improve index compression compared to a random document ordering. We study index compression and query processing techniques for such reordered indexes. Previous work has focused on determining the best possible ordering of documents. In contrast, we assume that such an ordering is already given, and focus on how to optimize compression methods and query processing for this case. We perform an extensive study of compression techniques for document IDs and present new optimizations of existing techniques which can achieve significant improvement in both compression and decompression performances. We also propose and evaluate techniques for compressing frequency values for this case. Finally, we study the effect of this approach on query processing performance. Our experiments show very significant improvements in index size and query processing speed on the TREC GOV2 collection of 25.2 million web pages.

References

V. Anh and A. Moffat. Index compression using fixed binary codewords. In Proc. of the 15th Int. Australasian Database Conference, pages 61--67, 2004. Google ScholarDigital Library
V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Inf. Retrieval, 8(1):151--166, 2005. Google ScholarDigital Library
V. Anh and A. Moffat. Improved word-aligned binary compression for text indexing. IEEE Transactions on Knowledge and Data Engineering, 18(6):857--861, 2006. Google ScholarDigital Library
J. Bentley, D. Sleator, R. Tarjan, and V. Wei. A locally adaptive data compression scheme. Comm. of the ACM, 29(4), 1986. Google ScholarDigital Library
K. Berberich, S. Bedathur, T. Neumann, and G. Weikum. A time machine for text search. In Proc. of the 30th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 519--526, 2007. Google ScholarDigital Library
R. Blanco and A. Barreiro. Document identifier reassignment through dimensionality reduction. In Proc. of the 27th European Conf. on Information Retrieval, pages 375--387, 2005. Google ScholarDigital Library
D. Blandford and G. Blelloch. Index compression through document reordering. In Proc. of the Data Compression Conference, pages 342--351, 2002. Google ScholarDigital Library
P. Boldi and S. Vigna. Compressed perfect embedded skip lists for quick inverted-index lookups. In Proc. of the 12th Int. Conf. on String Processing and Information Retrieval, 2005. Google ScholarDigital Library
S. Brin and L. Page. The anatomy of a large-scale hypertextual web search engine. In Proc. of the Seventh World Wide Web Conference, 1998. Google ScholarDigital Library
A. Broder, D. Carmel, M. Herscovici, A. Soffer, and J. Zien. Efficient query evaluation using a two-level retrieval process. In Proc. of the 12th Int. Conf. on Information and Knowledge Management, pages 426--434, November 2003. Google ScholarDigital Library
A. Broder, N. Eiron, M. Fontoura, M. Herscovici, R. Lempel, J. McPherson, R. Qi, and E. Shekita. Indexing shared content in information retrieval systems. In Proc. of the 10th Int. Conf. on Extending Database Technology, pages 313--330, 2006. Google ScholarDigital Library
F. Chierichetti, S. Lattanzi, F. Mari, and A. Panconesi. On placing skips optimally in expectation. In Proc. of the Int. Conf. on Web Search and Data Mining, pages 15--24, 2008. Google ScholarDigital Library
R. Fagin. Combining fuzzy information: an overview. SIGMOD Record, 31(2):109--118, June 2002. Google ScholarDigital Library
S. Heman. Super-scalar database compression between RAM and CPU-cache. MS Thesis, Centrum voor Wiskunde en Informatica, Amsterdam, Netherlands, July 2005.Google Scholar
M. Herscovici, R. Lempel, and S. Yogev. Efficient indexing of versioned document sequences. In Proc. of the 29th European Conf. on Information Retrieval, 2007. Google ScholarDigital Library
X. Long and T. Suel. Optimized query execution in large search engines with global page ordering. In Proc. of the 29th Int. Conf. on Very Large Data Bases, pages 129--140, 2003. Google ScholarDigital Library
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Inf. Retrieval, 3(1):25--47, 2000. Google ScholarDigital Library
A. Moffat and J. Zobel. Parameterised compression for sparse bitmaps. In Proc. of the 15th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, pages 274--285, 1992. Google ScholarDigital Library
M. Persin, J. Zobel, and R. Sacks--Davis. Filtered document retrieval with frequency-sorted indexes. J. of the American Society for Information Science, 47(10):749--764, 1996. Google ScholarDigital Library
M. Richardson, A. Prakash, and E. Brill. Beyond pagerank: machine learning for static ranking. In Proc. of the 15th Int. World Wide Web Conference, 2006. Google ScholarDigital Library
K. Risvik, Y. Aasheim, and M. Lidal. Multi-tier architecture for web search engines. In First Latin American Web Congress, pages 132--143, 2003. Google ScholarDigital Library
F. Scholer, H. Williams, J. Yiannis, and J. Zobel. Compression of inverted indexes for fast query evaluation. In Proc. of the 25th Annual SIGIR Conf. on Research and Development in Information Retrieval, pages 222--229, Aug. 2002. Google ScholarDigital Library
W. Shieh, T. Chen, J. Shann, and C. Chung. Inverted file compression through document identifier reassignment. Inf. Processing and Management, 39(1):117--131, 2003. Google ScholarDigital Library
F. Silvestri. Sorting out the document identifier assignment problem. In Proc. of 29th European Conf. on Information Retrieval, pages 101--112, 2007. Google ScholarDigital Library
F. Silvestri, S. Orlando, and R. Perego. Assigning identifiers to documents to enhance the clustering property of fulltext indexes. In Proc. of the 27th Annual Int. ACM SIGIR Conf. on Research and Development in Information Retrieval, 2004. Google ScholarDigital Library
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, second edition, 1999. Google ScholarDigital Library
J. Zhang, X. Long, and T. Suel. Performance of compressed inverted list caching in search engines. In Proc. of the 17th Int. World Wide Web Conference, April 2008. Google ScholarDigital Library
J. Zhang and T. Suel. Efficient search in large textual collection with redundancy. In Proc. of the 16th Int. World Wide Web Conference, 2007. Google ScholarDigital Library
J. Zobel and A. Moffat. Inverted files for text search engines. ACM Computing Surveys, 38(2), 2006. Google ScholarDigital Library
M. Zukowski, S. Heman, N. Nes, and P. Boncz. Super-scalar RAM-CPU cache compression. In Proc. of the Int. Conf. on Data Engineering, 2006. Google ScholarDigital Library

Index Terms

Inverted index compression and query processing with optimized document ordering
1. Information systems
  1. Information retrieval

Recommendations

Performance of compressed inverted list caching in search engines
WWW '08: Proceedings of the 17th international conference on World Wide Web

Due to the rapid growth in the size of the web, web search engines are facing enormous performance challenges. The larger engines in particular have to be able to process tens of thousands of queries per second on tens of billions of documents, making ...
Read More
Compression of inverted indexes For fast query evaluation
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to ...
Read More
Improved index compression techniques for versioned document collections
CIKM '10: Proceedings of the 19th ACM international conference on Information and knowledge management

Current Information Retrieval systems use inverted index structures for efficient query processing. Due to the extremely large size of many data sets, these index structures are usually kept in compressed form, and many techniques for optimizing ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '09: Proceedings of the 18th international conference on World wide web
April 2009
1280 pages
ISBN:9781605584874
DOI:10.1145/1526709
General Chairs:
Juan Quemada
DIT-UPM
,
Gonzalo León
DIT-UPM
,
Program Chairs:
Yoelle Maarek
Google Inc., Israel
,
Wolfgang Nejdl
L3S and Hannover University
Copyright © 2009 IW3C2 org
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 April 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
IR query processing
document ordering
index compression
inverted index
search engines
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 185
  Total Citations
  View Citations
- 1,652
  Total Downloads
- Downloads (Last 12 months)62
- Downloads (Last 6 weeks)9
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Inverted index compression and query processing with optimized document ordering

WWW '09: Proceedings of the 18th international conference on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Performance of compressed inverted list caching in search engines

Compression of inverted indexes For fast query evaluation

Improved index compression techniques for versioned document collections