research-article

Permutation indexing: fast approximate retrieval from large corpora

Authors:
Maxim Gurevich

RelateIQ, Palo Alto, CA, USA

RelateIQ, Palo Alto, CA, USA
View Profile

,
Tamás Sarlós

Google Inc., Mountain View, CA, USA

Google Inc., Mountain View, CA, USA
View Profile

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge ManagementOctober 2013Pages 1771–1776https://doi.org/10.1145/2505515.2505646

Published:27 October 2013Publication History

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

Pages 1771–1776

ABSTRACT

Inverted indexing is a ubiquitous technique used in retrieval systems including web search. Despite its popularity, it has a drawback - query retrieval time is highly variable and grows with the corpus size. In this work we propose an alternative technique, permutation indexing, where retrieval cost is strictly bounded and has only logarithmic dependence on the corpus size. Our approach is based on two novel techniques: (a) partitioning of the term space into overlapping clusters of terms that frequently co-occur in queries, and (b) a data structure for compactly encoding results of all queries composed of terms in a cluster as continuous sequences of document ids. Then, query results are retrieved by fetching few small chunks of these sequences. There is a price though: our encoding is lossy and thus returns approximate result sets. The fraction of the true results returned, recall, is controlled by the level of redundancy. The more space is allocated for the permutation index the higher is the recall. We analyze permutation indexing both theoretically under simplified document and query models, and empirically on a realistic document and query collections. We show that although permutation indexing can not replace traditional retrieval methods, since high recall cannot be guaranteed on all queries, it covers up to 77% of tail queries and can be used to speed up retrieval for these queries.

References

D. Agarwal and M. Gurevich. Fast top-k retrieval for model based recommendation. In WSDM, pages 483--492, 2012. Google ScholarDigital Library
A. Anagnostopoulos, L. Becchetti, S. Leonardi, I. Mele, and P. Sankowski. Stochastic query covering. In WSDM, 2011. Google ScholarDigital Library
R. A. Baeza-Yates, A. Gionis, F. Junqueira, V. Murdock, V. Plachouras, and F. Silvestri. Design trade-offs for search engine caching. TWEB, 2(4), 2008. Google ScholarDigital Library
M. Bendersky, E. Gabrilovich, V. Josifovski, and D. Metzler. The anatomy of an ad: structured indexing and retrieval for sponsored search. In WWW, 2010. Google ScholarDigital Library
S. Büttcher and C. L. A. Clarke. A document-centric approach to static index pruning in text retrieval systems. In CIKM, pages 182--189, 2006. Google ScholarDigital Library
D. Carmel, D. Cohen, R. Fagin, E. Farchi, M. Herscovici, Y. S. Maarek, and A. Soffer. Static index pruning for information retrieval systems. In SIGIR, 2001. Google ScholarDigital Library
J. S. Culpepper and A. Moffat. Compact set representation for information retrieval. In SPIRE, pages 137--148, 2007. Google ScholarDigital Library
R. Cummins and C. O'Riordan. Learning in a pairwise term-term proximity framework for information retrieval. In SIGIR, pages 251--258, 2009. Google ScholarDigital Library
J. Dean. Challenges in building large-scale information retrieval systems: invited talk. In WSDM, page 1, 2009. Google ScholarDigital Library
M. Fontoura, M. Gurevich, V. Josifovski, and S. Vassilvitskii. Efficiently encoding term co-occurrences in inverted indexes. In CIKM, pages 307--316, 2011. Google ScholarDigital Library
P. Goundan and A. Schulz. Revisiting the greedy approach to submodular set function maximization. Optimization online, 2007.Google Scholar
T. S. Jayram, S. Khot, R. Kumar, and Y. Rabani. Cell-probe lower bounds for the partial match problem. J. Comput. Syst. Sci., 69(3):435--447, 2004. Google ScholarDigital Library
S. Khuller and B. Saha. On finding dense subgraphs. Automata, Languages and Programming, 2009. Google ScholarDigital Library
X. Long and T. Suel. Three-level caching for efficient query processing in large web search engines. In WWW, pages 257--266. ACM, 2005. Google ScholarDigital Library
G. Pass, A. Chowdhury, and C. Torgeson. A picture of search. In Proc. 1st InfoScale, 2006. Google ScholarDigital Library
P. K. C. Singitham, M. S. Mahabhashyam, and P. Raghavan. Efficiency-quality tradeoffs for vector score aggregation. In VLDB, pages 624--635, 2004. Google ScholarDigital Library
K. M. Svore, P. H. Kanani, and N. Khan. How good is a span of terms?: exploiting proximity to improve web retrieval. In SIGIR, pages 154--161, 2010. Google ScholarDigital Library
T. Tao and C. Zhai. An exploration of proximity measures in information retrieval. In SIGIR, pages 295--302, 2007. Google ScholarDigital Library
H. Turtle and J. Flood. Query evaluation: strategies and optimizations. Inf. Process. Manage., 31(6), 1995. Google ScholarDigital Library
S. K. Tyler, S. Pandey, E. Gabrilovich, and V. Josifovski. Retrieval models for audience selection in display advertising. In CIKM, pages 593--598, 2011. Google ScholarDigital Library
L. Wang, J. Lin, and D. Metzler. A cascade ranking model for efficient ranked retrieval. In SIGIR, 2011. Google ScholarDigital Library

Index Terms

Permutation indexing: fast approximate retrieval from large corpora
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
    2. Retrieval tasks and goals

Recommendations

Efficiently encoding term co-occurrences in inverted indexes
CIKM '11: Proceedings of the 20th ACM international conference on Information and knowledge management

Precomputation of common term co-occurrences has been successfully applied to improve query performance in large scale search engines based on inverted indexes. The results of such precomputations are traditionally stored as additional posting lists in ...
Read More
Fast Forward Index Methods for Pseudo-Relevance Feedback Retrieval

The inverted index is the dominant indexing method in information retrieval systems. It enables fast return of the list of all documents containing a given query term. However, for retrieval schemes involving query expansion, as in pseudo-relevance ...
Read More
An efficient inverted index technique for XML documents using RDBMS
Abstract
The inverted index is widely used in the existing information retrieval field. In order to support containment queries for structured documents such as XML, it needs to be extended. Previous work suggested an extension in storing the ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management
October 2013
2612 pages
ISBN:9781450322638
DOI:10.1145/2505515
General Chairs:
Qi He
LinkedIn, USA
,
Arun Iyengar
IBM T.J. Watson Research Center, USA
,
Program Chairs:
Wolfgang Nejdl
L3S Research Center, Germany
,
Jian Pei
Simon Fraser University, Canada
,
Rajeev Rastogi
Amazon, India
Copyright © 2013 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 27 October 2013
Check for updates
Author Tags
embedding
inverted index
precomputation
Qualifiers
- research-article
Conference

Acceptance Rates
CIKM '13 Paper Acceptance Rate143of848submissions,17%Overall Acceptance Rate1,861of8,427submissions,22%
More
Upcoming Conference
CIKM '24

Sponsor:

sigir

sigir

The 33rd ACM International Conference on Information and Knowledge Management

October 21 - 25, 2024

Boise , ID , USA
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 238
  Total Downloads
- Downloads (Last 12 months)6
- Downloads (Last 6 weeks)1
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Permutation indexing: fast approximate retrieval from large corpora

CIKM '13: Proceedings of the 22nd ACM international conference on Information & Knowledge Management

ABSTRACT

References

Cited By

Index Terms

Recommendations

Efficiently encoding term co-occurrences in inverted indexes

Fast Forward Index Methods for Pseudo-Relevance Feedback Retrieval

An efficient inverted index technique for XML documents using RDBMS