ABSTRACT
Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.
- D. Agarwal and M. Gurevich. Fast top-k retrieval for model based recommendation. In WSDM, pages 483--492, 2012. Google ScholarDigital Library
- A. Anagnostopoulos, L. Becchetti, S. Leonardi, I. Mele, and P. Sankowski. Stochastic query covering. In WSDM, 2011. Google ScholarDigital Library
- R. A. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, and F. Silvestri. Design trade-offs for search engine caching. TWEB, 2(4), 2008. Google ScholarDigital Library
- M. Bendersky, E. Gabrilovich, V. Josifovski, and D. Metzler. The anatomy of an ad: structured indexing and retrieval for sponsored search. In WWW, 2010. Google ScholarDigital Library
- S. Büttcher and C. L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In CIKM, pages 182--189, 2006. Google ScholarDigital Library
- D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In SIGIR, 2001. Google ScholarDigital Library
- J. S. Culpepper and A. Moffat. Compact set representation for information retrieval. In SPIRE, pages 137--148, 2007. Google ScholarDigital Library
- R. Cummins and C. O'Riordan. Learning in a pairwise term-term proximity framework for information retrieval. In SIGIR, pages 251--258, 2009. Google ScholarDigital Library
- J. Dean. Challenges in building large-scale information retrieval systems: invited talk. In WSDM, page 1, 2009. Google ScholarDigital Library
- M. Fontoura, M. Gurevich, V. Josifovski, and S. Vassilvitskii. Efficiently encoding term co-occurrences in inverted indexes. In CIKM, pages 307--316, 2011. Google ScholarDigital Library
- P. Goundan and A. Schulz. Revisiting the greedy approach to submodular set function maximization. Optimization online, 2007.Google Scholar
- T. S. Jayram, S. Khot, R. Kumar, and Y. Rabani. Cell-probe lower bounds for the partial match problem. J. Comput. Syst. Sci., 69(3):435--447, 2004. Google ScholarDigital Library
- S. Khuller and B. Saha. On finding dense subgraphs. Automata, Languages and Programming, 2009. Google ScholarDigital Library
- X. Long and T. Suel. Three-level caching for efficient query processing in large web search engines. In WWW, pages 257--266. ACM, 2005. Google ScholarDigital Library
- G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In Proc. 1st InfoScale, 2006. Google ScholarDigital Library
- P. K. C. Singitham, M. S. Mahabhashyam, and P. Raghavan. Efficiency-quality tradeoffs for vector score aggregation. In VLDB, pages 624--635, 2004. Google ScholarDigital Library
- K. M. Svore, P. H. Kanani, and N. Khan. How good is a span of terms?: exploiting proximity to improve web retrieval. In SIGIR, pages 154--161, 2010. Google ScholarDigital Library
- T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR, pages 295--302, 2007. Google ScholarDigital Library
- H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Inf. Process. Manage., 31(6), 1995. Google ScholarDigital Library
- S. K. Tyler, S. Pandey, E. Gabrilovich, and V. Josifovski. Retrieval models for audience selection in display advertising. In CIKM, pages 593--598, 2011. Google ScholarDigital Library
- L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In SIGIR, 2011. Google ScholarDigital Library
Index Terms
- Permutation indexing: fast approximate retrieval from large corpora
Recommendations
Efficiently encoding term co-occurrences in inverted indexes
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge managementPrecomputation of common term co-occurrences has been successfully applied to improve query performance in large scale search engines based on inverted indexes. The results of such precomputations are traditionally stored as additional posting lists in ...
Fast Forward Index Methods for Pseudo-Relevance Feedback Retrieval
The inverted index is the dominant indexing method in information retrieval systems. It enables fast return of the list of all documents containing a given query term. However, for retrieval schemes involving query expansion, as in pseudo-relevance ...
An efficient inverted index technique for XML documents using RDBMS
AbstractThe inverted index is widely used in the existing information retrieval field. In order to support containment queries for structured documents such as XML, it needs to be extended. Previous work suggested an extension in storing the ...
Comments