skip to main content
10.1145/1526709.1526770acmconferencesArticle/Chapter ViewAbstractPublication PageswwwConference Proceedingsconference-collections
research-article

Compressed web indexes

Published:20 April 2009Publication History

ABSTRACT

Web search engines use indexes to efficiently retrieve pages containing specified query terms, as well as pages linking to specified pages. The problem of compressed indexes that permit such fast retrieval has a long history. We consider the problem: assuming that the terms in (or links to) a page are generated from a probability distribution, how well compactly can we build such indexes that allow fast retrieval? Of particular interest is the case when the probability distribution is Zipfian (or a similar power law), since these are the distributions that arise on the web. We obtain sharp bounds on the space requirement of Boolean indexes for text documents that follow Zipf's law. In the process we develop a general technique that applies to any probability distribution, not necessarily a power law; this is the first analysis of compression in indexes under arbitrary distributions. Our bounds lead to quantitative versions of rules of thumb that are folklore in indexing. Our experiments on several document collections show that the distribution of terms appears to follow a double-Pareto law rather than Zipf's law. Despite widely varying sets of documents, the index sizes observed in the experiments conform well to our theoretical predictions.

References

  1. T. M. Apostol. Introduction to Analytic Number Theory. Springer-Verlag, 1976.Google ScholarGoogle Scholar
  2. R. Baeza-Yates and G. Navarro. Block-addressing indices for approximate text retrieval. Journal of the American Society for Information Science, 51(1):69--82, 2000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. Baeza-Yates and B. Ribeiro-Neto. Modern Information Retrieval. Addison Wesley, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. A.-L. Barabasi. Linked: How Everything is Connected to Everything Else and What It Means. Penguin Group, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. D. Bladford and G. Blelloch. Index compression through document reordering. In Proceedings of the Data Compression Conference, pages 342--351, 2002. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. P. Boldi and S. Vigna. The Webgraph framework i: Compression techniques. In Proceedings of the 13th International Conference on World Wide Web, pages 595--602, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. P. Boldi and S. Vigna. The Webgraph framework ii: Codes for the world-wide web. In Data Compression Conference, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. A. Gelbukh and G. Sidorov. Zipf and Heaps laws' coefficients depend on language. In Proceedings of the 2nd International Conference on Computational Linguistics and Intelligent Text Processing, pages 332--335, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. L. Q. Ha, E. I. Sicilia-Garcia, J. Ming, and F. J. Smith. Extension of Zipf's law to word and character n-grams for English and Chinese. Computational Linguistics and Chinese Language Processing, 8(1):77--102, 2003.Google ScholarGoogle Scholar
  10. H. S. Heaps. Information Retrieval: Computational and Theoretical Aspects. Academic Press, New York, 1978. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins. Trawling the Web for emerging cyber-communities. Computer Networks, 31(11--16):1481--1493, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. W. Li. Random texts exhibit Zipf's-law-like word frequency distribution. IEEE Transactions on Information Theory, 38(6):1842--1845, 1992.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. B. Mandelbrot. An information theory of the statistical structure of language. In W. Jackson, editor, Communication Theory, pages 486--502. Academic Press, 1953.Google ScholarGoogle Scholar
  14. C. D. Manning, P. Raghavan, and H. Schutze. Introduction to Information Retrieval. Cambridge University Press, 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. C. D. Manning and H. Sch¨utze. Foundations of Statistical Natural Language Processing. MIT Press, Cambridge, MA, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. M. Mitzenmacher. Dynamic models for file sizes and double Pareto distributions. Internet Mathematics, 1(3):305--333, 2003.Google ScholarGoogle ScholarCross RefCross Ref
  17. M. Molloy and B. Reed. Graph Coloring and the Probabilistic Method. Springer-Verlag, 2002.Google ScholarGoogle ScholarCross RefCross Ref
  18. M. Newman, A.-L. Barabasi, and D. J. Watts. The Structure and Dynamics of Networks. Princeton University Press, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. W. J. Reed and M. Jorgensen. The double Pareto-lognormal distribution - A new parametric model for size distributions. Communications in Statistics: Theory and Methods, 33(8):1733--1753, 2004.Google ScholarGoogle ScholarCross RefCross Ref
  20. W.-Y. Shieh, T.-F. Chen, J. J.-J. Shann, and C.-P. Chung. Inverted file compression through document identifier reassignment. Information Processing and Management, 39(1):117--131, 2003. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. F. Silvestri, R. Perego, and S. Orlando. Assigning document identifiers to enhance compressibility of web search indexes. In Proceedings of the Symposium on Applied Computing, pages 600--605, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. H. A. Simon. On a class of skew distribution functions. Biometrika, 42:425--440, 1955.Google ScholarGoogle ScholarCross RefCross Ref
  23. D. C. van Leijenhorst and T. P. van der Weide. A formal derivation of Heap's law. Information Sciences, 170:263--272, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. D. Watts. Six Degrees: The Science of a Connected Age. W. W. Norton, 2003.Google ScholarGoogle Scholar
  25. H. E. Williams and J. Zobel. Searchable words on the web. International Journal on Digital Libraries, 5(2):99--105, 2005.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. I. H. Witten and T. C. Bell. Source models for natural language text. International Journal Man-Machine Studies, 32(5):545--579, 1990. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. I. H. Witten, A. Moffat, and T. C. Bell. Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann, 1999. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. G. K. Zipf. Human Behavior and the Principle of Least Effort. Addison-Wesley, Cambridge MA, 1949Google ScholarGoogle Scholar

Index Terms

  1. Compressed web indexes

    Recommendations

    Comments

    Login options

    Check if you have access through your login credentials or your institution to get full access on this article.

    Sign in
    • Published in

      cover image ACM Conferences
      WWW '09: Proceedings of the 18th international conference on World wide web
      April 2009
      1280 pages
      ISBN:9781605584874
      DOI:10.1145/1526709

      Copyright © 2009 IW3C2 org

      Publisher

      Association for Computing Machinery

      New York, NY, United States

      Publication History

      • Published: 20 April 2009

      Permissions

      Request permissions about this article.

      Request Permissions

      Check for updates

      Qualifiers

      • research-article

      Acceptance Rates

      Overall Acceptance Rate1,899of8,196submissions,23%

      Upcoming Conference

      WWW '24
      The ACM Web Conference 2024
      May 13 - 17, 2024
      Singapore , Singapore

    PDF Format

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader