research-article

Compressed web indexes

Authors:
Flavio Chierichetti

Sapienza University of Rome, Rome, Italy

Sapienza University of Rome, Rome, Italy
View Profile

,
Ravi Kumar

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

,
Prabhakar Raghavan

Yahoo! Research, Sunnyvale, CA, USA

Yahoo! Research, Sunnyvale, CA, USA
View Profile

WWW '09: Proceedings of the 18th international conference on World wide webApril 2009Pages 451–460https://doi.org/10.1145/1526709.1526770

Published:20 April 2009Publication History

WWW '09: Proceedings of the 18th international conference on World wide web

Pages 451–460

ABSTRACT

Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.

References

T. M. Apostol. Introduction to Analytic Number Theory. Springer-Verlag, 1976.Google Scholar
R. Baeza-Yates and G. Navarro. Block-addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69--82, 2000. Google ScholarDigital Library
R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarDigital Library
A.-L. Barabasi. Linked: How Everything is Connected to Everything Else and What It Means. Penguin Group, 2003. Google ScholarDigital Library
D. Bladford and G. Blelloch. Index compression through document reordering. In Proceedings of the Data Compression Conference, pages 342--351, 2002. Google ScholarDigital Library
P. Boldi and S. Vigna. The Webgraph framework i: Compression techniques. In Proceedings of the 13th International Conference on World Wide Web, pages 595--602, 2004. Google ScholarDigital Library
P. Boldi and S. Vigna. The Webgraph framework ii: Codes for the world-wide web. In Data Compression Conference, 2004. Google ScholarDigital Library
A. Gelbukh and G. Sidorov. Zipf and Heaps laws' coefficients depend on language. In Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing, pages 332--335, 2001. Google ScholarDigital Library
L. Q. Ha, E. I. Sicilia-Garcia, J. Ming, and F. J. Smith. Extension of Zipf's law to word and character n-grams for English and Chinese. Computational Linguistics and Chinese Language Processing, 8(1):77--102, 2003.Google Scholar
H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York, 1978. Google ScholarDigital Library
R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11--16):1481--1493, 1999. Google ScholarDigital Library
W. Li. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6):1842--1845, 1992.Google ScholarDigital Library
B. Mandelbrot. An information theory of the statistical structure of language. In W. Jackson, editor, Communication Theory, pages 486--502. Academic Press, 1953.Google Scholar
C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
C. D. Manning and H. Sch¨utze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999. Google ScholarDigital Library
M. Mitzenmacher. Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305--333, 2003.Google ScholarCross Ref
M. Molloy and B. Reed. Graph Coloring and the Probabilistic Method. Springer-Verlag, 2002.Google ScholarCross Ref
M. Newman, A.-L. Barabasi, and D. J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006. Google ScholarDigital Library
W. J. Reed and M. Jorgensen. The double Pareto-lognormal distribution - A new parametric model for size distributions. Communications in Statistics: Theory and Methods, 33(8):1733--1753, 2004.Google ScholarCross Ref
W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Information Processing and Management, 39(1):117--131, 2003. Google ScholarDigital Library
F. Silvestri, R. Perego, and S. Orlando. Assigning document identifiers to enhance compressibility of web search indexes. In Proceedings of the Symposium on Applied Computing, pages 600--605, 2004. Google ScholarDigital Library
H. A. Simon. On a class of skew distribution functions. Biometrika, 42:425--440, 1955.Google ScholarCross Ref
D. C. van Leijenhorst and T. P. van der Weide. A formal derivation of Heap's law. Information Sciences, 170:263--272, 2005. Google ScholarDigital Library
D. Watts. Six Degrees: The Science of a Connected Age. W. W. Norton, 2003.Google Scholar
H. E. Williams and J. Zobel. Searchable words on the web. International Journal on Digital Libraries, 5(2):99--105, 2005.Google ScholarDigital Library
I. H. Witten and T. C. Bell. Source models for natural language text. International Journal Man-Machine Studies, 32(5):545--579, 1990. Google ScholarDigital Library
I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. Google ScholarDigital Library
G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge MA, 1949Google Scholar

Index Terms

Compressed web indexes
1. Information systems
  1. Information retrieval

Recommendations

Analyses of multi-level and multi-component compressed bitmap indexes

Bitmap indexes are known as the most effective indexing methods for range queries on append-only data, and many different bitmap indexes have been proposed in the research literature. However, only two of the simplest ones are used in commercial ...
Read More
APPLE: a new compression scheme for bitmap indexes: poster abstract
SenSys '20: Proceedings of the 18th Conference on Embedded Networked Sensor Systems

Compressed bitmap indexes are increasingly used in databases and search engines. By exploiting bit-level parallelism and bitwise operations, e.g. AND/OR operations, they can significantly accelerate the development of many areas. The Word Aligned Hybrid ...
Read More
Sorting improves word-aligned bitmap indexes

Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
WWW '09: Proceedings of the 18th international conference on World wide web
April 2009
1280 pages
ISBN:9781605584874
DOI:10.1145/1526709
General Chairs:
Juan Quemada
DIT-UPM
,
Gonzalo León
DIT-UPM
,
Program Chairs:
Yoelle Maarek
Google Inc., Israel
,
Wolfgang Nejdl
L3S and Hannover University
Copyright © 2009 IW3C2 org
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 April 2009
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
compression
double-pareto
index size
power law
Qualifiers
- research-article
Conference

Acceptance Rates
Overall Acceptance Rate1,899of8,196submissions,23%
Upcoming Conference
WWW '24

Sponsor:

sigweb

The ACM Web Conference 2024

May 13 - 17, 2024

Singapore , Singapore
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 16
  Total Citations
  View Citations
- 342
  Total Downloads
- Downloads (Last 12 months)7
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Compressed web indexes

WWW '09: Proceedings of the 18th international conference on World wide web

ABSTRACT

References

Cited By

Index Terms

Recommendations

Analyses of multi-level and multi-component compressed bitmap indexes

APPLE: a new compression scheme for bitmap indexes: poster abstract

Sorting improves word-aligned bitmap indexes