ABSTRACT
Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.
- T. M. Apostol. Introduction to Analytic Number Theory. Springer-Verlag, 1976.Google Scholar
- R. Baeza-Yates and G. Navarro. Block-addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69--82, 2000. Google ScholarDigital Library
- R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarDigital Library
- A.-L. Barabasi. Linked: How Everything is Connected to Everything Else and What It Means. Penguin Group, 2003. Google ScholarDigital Library
- D. Bladford and G. Blelloch. Index compression through document reordering. In Proceedings of the Data Compression Conference, pages 342--351, 2002. Google ScholarDigital Library
- P. Boldi and S. Vigna. The Webgraph framework i: Compression techniques. In Proceedings of the 13th International Conference on World Wide Web, pages 595--602, 2004. Google ScholarDigital Library
- P. Boldi and S. Vigna. The Webgraph framework ii: Codes for the world-wide web. In Data Compression Conference, 2004. Google ScholarDigital Library
- A. Gelbukh and G. Sidorov. Zipf and Heaps laws' coefficients depend on language. In Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing, pages 332--335, 2001. Google ScholarDigital Library
- L. Q. Ha, E. I. Sicilia-Garcia, J. Ming, and F. J. Smith. Extension of Zipf's law to word and character n-grams for English and Chinese. Computational Linguistics and Chinese Language Processing, 8(1):77--102, 2003.Google Scholar
- H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York, 1978. Google ScholarDigital Library
- R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11--16):1481--1493, 1999. Google ScholarDigital Library
- W. Li. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6):1842--1845, 1992.Google ScholarDigital Library
- B. Mandelbrot. An information theory of the statistical structure of language. In W. Jackson, editor, Communication Theory, pages 486--502. Academic Press, 1953.Google Scholar
- C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarDigital Library
- C. D. Manning and H. Sch¨utze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999. Google ScholarDigital Library
- M. Mitzenmacher. Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305--333, 2003.Google ScholarCross Ref
- M. Molloy and B. Reed. Graph Coloring and the Probabilistic Method. Springer-Verlag, 2002.Google ScholarCross Ref
- M. Newman, A.-L. Barabasi, and D. J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006. Google ScholarDigital Library
- W. J. Reed and M. Jorgensen. The double Pareto-lognormal distribution - A new parametric model for size distributions. Communications in Statistics: Theory and Methods, 33(8):1733--1753, 2004.Google ScholarCross Ref
- W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Information Processing and Management, 39(1):117--131, 2003. Google ScholarDigital Library
- F. Silvestri, R. Perego, and S. Orlando. Assigning document identifiers to enhance compressibility of web search indexes. In Proceedings of the Symposium on Applied Computing, pages 600--605, 2004. Google ScholarDigital Library
- H. A. Simon. On a class of skew distribution functions. Biometrika, 42:425--440, 1955.Google ScholarCross Ref
- D. C. van Leijenhorst and T. P. van der Weide. A formal derivation of Heap's law. Information Sciences, 170:263--272, 2005. Google ScholarDigital Library
- D. Watts. Six Degrees: The Science of a Connected Age. W. W. Norton, 2003.Google Scholar
- H. E. Williams and J. Zobel. Searchable words on the web. International Journal on Digital Libraries, 5(2):99--105, 2005.Google ScholarDigital Library
- I. H. Witten and T. C. Bell. Source models for natural language text. International Journal Man-Machine Studies, 32(5):545--579, 1990. Google ScholarDigital Library
- I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. Google ScholarDigital Library
- G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge MA, 1949Google Scholar
Index Terms
- Compressed web indexes
Recommendations
Analyses of multi-level and multi-component compressed bitmap indexes
Bitmap indexes are known as the most effective indexing methods for range queries on append-only data, and many different bitmap indexes have been proposed in the research literature. However, only two of the simplest ones are used in commercial ...
APPLE: a new compression scheme for bitmap indexes: poster abstract
SenSys '20: Proceedings of the 18th Conference on Embedded Networked Sensor SystemsCompressed bitmap indexes are increasingly used in databases and search engines. By exploiting bit-level parallelism and bitwise operations, e.g. AND/OR operations, they can significantly accelerate the development of many areas. The Word Aligned Hybrid ...
Sorting improves word-aligned bitmap indexes
Bitmap indexes must be compressed to reduce input/output costs and minimize CPU usage. To accelerate logical operations (AND, OR, XOR) over bitmaps, we use techniques based on run-length encoding (RLE), such as Word-Aligned Hybrid (WAH) compression. ...
Comments