ABSTRACT
Web Search Engines provide a large-scale text document retrieval service by processing huge Inverted File indexes. Inverted File indexes allow fast query resolution and good memory utilization since their d-gaps representation can be effectively and efficiently compressed by using variable length encoding methods. This paper proposes and evaluates some algorithms aimed to find an assignment of the document identifiers which minimizes the average values of d-gaps, thus enhancing the effectiveness of traditional compression methods. We ran several tests over the Google contest collection in order to validate the techniques proposed. The experiments demonstrated the scalability and effectiveness of our algorithms. Using the proposed algorithms, we were able to sensibly improve (up to 20.81%) the compression ratios of several encoding schemes.
- V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 2004. To appear. Google ScholarDigital Library
- V. N. Anh and A. Moffat. Index compression using fixed binary codewords. In K.-D. Schewe and H. Williams, editors, Proc. 15th Australasian Database Conference, Dunedin, New Zealand, Jan. 2004. Google ScholarDigital Library
- D. Blandford and G. Blelloch. Index compression through document reordering. In IEEE, editor, Proceedings of the Data Compression Conference (DCC'02). IEEE, 2002. Google ScholarDigital Library
- C. Buckley. Implementation of the smart information retrieval system. Technical Report TR85--686, Cornell University, Computer Science Department, May 1985. Google ScholarDigital Library
- S. Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco, 2003. Google ScholarDigital Library
- D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 318--329. ACM Press, 1992. Google ScholarDigital Library
- W. B. Frakes and E. R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms, chapter Clustering Algorithms (E. Rasmussen). Prentice Hall, Englewood Cliffs, NJ, 1992.Google Scholar
- G. Karypis. Metis: Family of multilevel partitioning algorithms. http://www-users.cs.umn.edu/karypis/metis/.Google Scholar
- A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25--47, July 2000. Google ScholarDigital Library
- R. Rivest. Rfc 1321: The md5 algorithm.Google Scholar
- F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted index for fast query evaluation. In Proceedings of the 25rd annual international ACM SIGIR conference on Research and development in information retrieval, 2002. Google ScholarDigital Library
- W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Information Processing and Management, 39(1):117--131, January 2003. Google ScholarDigital Library
- F. Silvestri, R. Perego, and S. Orlando. Assigning document identifiers to enhance compressibility of web search. In Proceedings of the 19th Annual ACM Symposium on Applied Computing - Data Mining Track, 2004. Google ScholarDigital Library
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes -- Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, second edition edition, 1999. Google ScholarDigital Library
Index Terms
- Assigning identifiers to documents to enhance the clustering property of fulltext indexes
Recommendations
Scalable techniques for document identifier assignment in inverted indexes
WWW '10: Proceedings of the 19th international conference on World wide webWeb search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing ...
Assigning document identifiers to enhance compressibility of Web Search Engines indexes
SAC '04: Proceedings of the 2004 ACM symposium on Applied computingGranting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as ...
Compression of inverted indexes For fast query evaluation
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrievalCompression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to ...
Comments