skip to main content
10.1145/2505515.2505646acmconferencesArticle/Chapter ViewAbstractPublication PagescikmConference Proceedingsconference-collections
research-article

Permutation indexing: fast approximate retrieval from large corpora

Published:27 October 2013Publication History

ABSTRACT

Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.

References

  1. D. Agarwal and M. Gurevich. Fast top-k retrieval for model based recommendation. In WSDM, pages 483--492, 2012. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. A. Anagnostopoulos, L. Becchetti, S. Leonardi, I. Mele, and P. Sankowski. Stochastic query covering. In WSDM, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. R. A. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, and F. Silvestri. Design trade-offs for search engine caching. TWEB, 2(4), 2008. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. M. Bendersky, E. Gabrilovich, V. Josifovski, and D. Metzler. The anatomy of an ad: structured indexing and retrieval for sponsored search. In WWW, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. S. Büttcher and C. L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In CIKM, pages 182--189, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In SIGIR, 2001. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. J. S. Culpepper and A. Moffat. Compact set representation for information retrieval. In SPIRE, pages 137--148, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. R. Cummins and C. O'Riordan. Learning in a pairwise term-term proximity framework for information retrieval. In SIGIR, pages 251--258, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. J. Dean. Challenges in building large-scale information retrieval systems: invited talk. In WSDM, page 1, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. M. Fontoura, M. Gurevich, V. Josifovski, and S. Vassilvitskii. Efficiently encoding term co-occurrences in inverted indexes. In CIKM, pages 307--316, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. P. Goundan and A. Schulz. Revisiting the greedy approach to submodular set function maximization. Optimization online, 2007.Google ScholarGoogle Scholar
  12. T. S. Jayram, S. Khot, R. Kumar, and Y. Rabani. Cell-probe lower bounds for the partial match problem. J. Comput. Syst. Sci., 69(3):435--447, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. S. Khuller and B. Saha. On finding dense subgraphs. Automata, Languages and Programming, 2009. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. X. Long and T. Suel. Three-level caching for efficient query processing in large web search engines. In WWW, pages 257--266. ACM, 2005. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In Proc. 1st InfoScale, 2006. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. P. K. C. Singitham, M. S. Mahabhashyam, and P. Raghavan. Efficiency-quality tradeoffs for vector score aggregation. In VLDB, pages 624--635, 2004. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. K. M. Svore, P. H. Kanani, and N. Khan. How good is a span of terms?: exploiting proximity to improve web retrieval. In SIGIR, pages 154--161, 2010. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR, pages 295--302, 2007. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Inf. Process. Manage., 31(6), 1995. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. S. K. Tyler, S. Pandey, E. Gabrilovich, and V. Josifovski. Retrieval models for audience selection in display advertising. In CIKM, pages 593--598, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In SIGIR, 2011. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Permutation indexing: fast approximate retrieval from large corpora

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
        October 2013
        2612 pages
        ISBN:9781450322638
        DOI:10.1145/2505515

        Copyright © 2013 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 27 October 2013

        Check for updates

        Qualifiers

        • research-article

        Acceptance Rates

        CIKM '13 Paper Acceptance Rate143of848submissions,17%Overall Acceptance Rate1,861of8,427submissions,22%

        Upcoming Conference

      • Article Metrics

        • Downloads (Last 12 months)6
        • Downloads (Last 6 weeks)1

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader