Article

Assigning identifiers to documents to enhance the clustering property of fulltext indexes

Authors:
Fabrizio Silvestri

Università di Pisa, Italy

Università di Pisa, Italy
View Profile

,
Salvatore Orlando

Università di Venezia, Mestre, Italy

Università di Venezia, Mestre, Italy
View Profile

,
Raffaele Perego

ISTI - CNR, Pisa, Italy

ISTI - CNR, Pisa, Italy
View Profile

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrievalJuly 2004Pages 305–312https://doi.org/10.1145/1008992.1009046

Published:25 July 2004Publication History

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 305–312

ABSTRACT

Web Search Engines provide a large-scale text document retrieval service by processing huge Inverted File indexes. Inverted File indexes allow fast query resolution and good memory utilization since their d-gaps representation can be effectively and efficiently compressed by using variable length encoding methods. This paper proposes and evaluates some algorithms aimed to find an assignment of the document identifiers which minimizes the average values of d-gaps, thus enhancing the effectiveness of traditional compression methods. We ran several tests over the Google contest collection in order to validate the techniques proposed. The experiments demonstrated the scalability and effectiveness of our algorithms. Using the proposed algorithms, we were able to sensibly improve (up to 20.81%) the compression ratios of several encoding schemes.

References

V. Anh and A. Moffat. Inverted index compression using word-aligned binary codes. Information Retrieval, 2004. To appear. Google ScholarDigital Library
V. N. Anh and A. Moffat. Index compression using fixed binary codewords. In K.-D. Schewe and H. Williams, editors, Proc. 15th Australasian Database Conference, Dunedin, New Zealand, Jan. 2004. Google ScholarDigital Library
D. Blandford and G. Blelloch. Index compression through document reordering. In IEEE, editor, Proceedings of the Data Compression Conference (DCC'02). IEEE, 2002. Google ScholarDigital Library
C. Buckley. Implementation of the smart information retrieval system. Technical Report TR85--686, Cornell University, Computer Science Department, May 1985. Google ScholarDigital Library
S. Chakrabarti. Mining the Web - Discovering Knowledge from Hypertext Data. Morgan Kaufmann Publishers, San Francisco, 2003. Google ScholarDigital Library
D. R. Cutting, D. R. Karger, J. O. Pedersen, and J. W. Tukey. Scatter/gather: a cluster-based approach to browsing large document collections. In Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval, pages 318--329. ACM Press, 1992. Google ScholarDigital Library
W. B. Frakes and E. R. Baeza-Yates. Information Retrieval: Data Structures and Algorithms, chapter Clustering Algorithms (E. Rasmussen). Prentice Hall, Englewood Cliffs, NJ, 1992.Google Scholar
G. Karypis. Metis: Family of multilevel partitioning algorithms. http://www-users.cs.umn.edu/karypis/metis/.Google Scholar
A. Moffat and L. Stuiver. Binary interpolative coding for effective index compression. Information Retrieval, 3(1):25--47, July 2000. Google ScholarDigital Library
R. Rivest. Rfc 1321: The md5 algorithm.Google Scholar
F. Scholer, H. E. Williams, J. Yiannis, and J. Zobel. Compression of inverted index for fast query evaluation. In Proceedings of the 25rd annual international ACM SIGIR conference on Research and development in information retrieval, 2002. Google ScholarDigital Library
W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Information Processing and Management, 39(1):117--131, January 2003. Google ScholarDigital Library
F. Silvestri, R. Perego, and S. Orlando. Assigning document identifiers to enhance compressibility of web search. In Proceedings of the 19th Annual ACM Symposium on Applied Computing - Data Mining Track, 2004. Google ScholarDigital Library
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes -- Compressing and Indexing Documents and Images. Morgan Kaufmann Publishing, San Francisco, second edition edition, 1999. Google ScholarDigital Library

Index Terms

Assigning identifiers to documents to enhance the clustering property of fulltext indexes
1. Information systems
  1. Data management systems
    1. Data structures
      1. Data layout
        Data compression
  2. Information retrieval
    1. Evaluation of retrieval results

Recommendations

Scalable techniques for document identifier assignment in inverted indexes
WWW '10: Proceedings of the 19th international conference on World wide web

Web search engines depend on the full-text inverted index data structure. Because the query processing performance is so dependent on the size of the inverted index, a plethora of research has focused on fast end effective techniques for compressing ...
Read More
Assigning document identifiers to enhance compressibility of Web Search Engines indexes
SAC '04: Proceedings of the 2004 ACM symposium on Applied computing

Granting efficient accesses to the index is a key issue for the performances of Web Search Engines (WSE). In order to enhance memory utilization and favor fast query resolution, WSEs use Inverted File (IF) indexes where the posting lists are stored as ...
Read More
Compression of inverted indexes For fast query evaluation
SIGIR '02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval

Compression reduces both the size of indexes and the time needed to evaluate queries. In this paper, we revisit the compression of inverted lists of document postings that store the position and frequency of indexed terms, considering two approaches to ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)
Copyright © 2004 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 25 July 2004
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
clustering property
document identifier assignment
index compression
web search engines
Qualifiers
- Article
Conference

Acceptance Rates
Overall Acceptance Rate792of3,983submissions,20%
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 35
  Total Citations
  View Citations
- 717
  Total Downloads
- Downloads (Last 12 months)4
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Assigning identifiers to documents to enhance the clustering property of fulltext indexes

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

ABSTRACT

References

Cited By

Index Terms

Recommendations

Scalable techniques for document identifier assignment in inverted indexes

Assigning document identifiers to enhance compressibility of Web Search Engines indexes

Compression of inverted indexes For fast query evaluation