research-article

Scalable top-k retrieval with Sparta

Authors:
Gali Sheffi

Technion, Haifa, Israel

Technion, Haifa, Israel
View Profile

,
Dmitry Basin

Yahoo Research, Haifa, Israel

Yahoo Research, Haifa, Israel
View Profile

,
Edward Bortnikov

Yahoo Research, Haifa, Israel

Yahoo Research, Haifa, Israel
View Profile

,
David Carmel

Amazon, Haifa, Israel

Amazon, Haifa, Israel
View Profile

,
Idit Keidar

Technion and Yahoo Research, Haifa, Israel

Technion and Yahoo Research, Haifa, Israel
View Profile

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel ProgrammingFebruary 2020Pages 62–73https://doi.org/10.1145/3332466.3374522

Published:19 February 2020Publication History

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

Pages 62–73

ABSTRACT

Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a document's score is the sum of its scores for all query terms. Top-k retrieval is often used to sift through massive data and identify a smaller subset of it for further analysis. Because it filters out the bulk of the data, it often constitutes the main performance bottleneck.

Beyond the rise in data sizes, today's data processing scenarios also increase the number of features contributing to the overall score. In web search, for example, verbose queries are becoming mainstream, while state-of-the-art algorithms fail to process long queries in real-time.

We present Sparta, a practical parallel algorithm that exploits multi-core hardware for fast (approximate) top-k retrieval. Thanks to lightweight coordination and judicious context sharing among threads, Sparta scales both in the number of features and in the searched index size. In our web search case study on 50M documents, Sparta processes 12-term queries more than twice as fast as the state-of-the-art. On a tenfold bigger index, Sparta processes queries at the same speed, whereas the average latency of existing algorithms soars to be an order-of-magnitude larger than Sparta's.

References

[n. d.]. https://docs.oracle.com/javase/7/docs/api/java/util/concurren/ConcurrentHashMap.html.Google Scholar
[n. d.]. https://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html.Google Scholar
[n. d.]. https://lucene.apache.org.Google Scholar
[n. d.]. http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection.Google Scholar
[n. d.]. Flurry, https://www.flurry.com/.Google Scholar
[n. d.]. TopN queries, http://druid.io/docs/latest/querying/topnquery.html.Google Scholar
Reza Akbarinia, Esther Pacitti, and Patrick Valduriez. 2007. Best Position Algorithms for Top-k Queries. In Proceedings of VLDB. VLDB Endowment, 495--506. http://dl.acm.org/citation.cfm?id=1325851.1325909Google Scholar
Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin. 2011. Efficient Parallel Lists Intersection and Index Compression Algorithms Using Graphics Processing Units. Proc. VLDB Endow. 4, 8 (May 2011), 470--481. Google ScholarDigital Library
Ioannis Arapakis, Xiao Bai, and B. Barla Cambazoglu. 2014. Impact of Response Latency on User Behavior in Web Search. In Proceedings of SIGIR. ACM, 103--112.Google Scholar
Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.Google ScholarDigital Library
Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. 2006. IO-Top-k: Index-access Optimized Top-k Query Processing. In Proceedings of VLDB. VLDB Endowment, 475--486.Google Scholar
Carolina Bonacic, Carlos García, Mauricio Marin, Manuel Prieto-Matias, and Francisco Tirado. 2010. Building Efficient Multi-threaded Search Nodes. In Proceedings of CIKM. ACM, 1249--1258.Google ScholarDigital Library
Edward Bortnikov, David Carmel, and Guy Golan-Gueta. 2017. Top-k Query Processing with Conditional Skips. In Proceedings of WWW Companion. 653--661.Google ScholarDigital Library
Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient Query Evaluation Using a Two-level Retrieval Process. In Proceedings of CIKM. ACM, 426--434.Google ScholarDigital Library
Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set.Google Scholar
Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM '17). ACM, New York, NY, USA, 201--210. Google ScholarDigital Library
Shuai Ding and Torsten Suel. 2011. Faster Top-k Document Retrieval Using Block-max Indexes. In Proceedings of SIGIR. ACM, 993--1002.Google ScholarDigital Library
Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. Journal of computer and system sciences 66, 4 (2003), 614--656.Google ScholarDigital Library
Peter Gurský and Peter Vojtáš. 2008. Speeding Up the NRA Algorithm. In Proceedings of the 2Nd International Conference on Scalable Uncertainty Management (SUM '08). Springer-Verlag, 243--255. Google ScholarDigital Library
Ido Guy. 2016. Searching by Talking: Analysis of Voice Queries on Mobile Web Search. In Proceedings of SIGIR. ACM, 35--44.Google ScholarDigital Library
Samuel Huston and W. Bruce Croft. 2010. Evaluating Verbose Query Processing Techniques. In Proceedings of SIGIR '10. ACM, 291--298. Google ScholarDigital Library
Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40, 4 (2008), 11:1--11:58. Google ScholarDigital Library
Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR) 40, 4 (2008), 11.Google ScholarDigital Library
Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety, Alan L. Cox, and Scott Rixner. 2014. Predictive Parallelization: Taming Tail Latencies in Web Search. In Proceedings of SIGIR. ACM, 253--262. Google ScholarDigital Library
Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In Proceedings ICTIR. ACM, 301--304.Google ScholarDigital Library
Jimmy Lin and Andrew Trotman. 2017. The Role of Index Compression in Score-at-a-time Query Evaluation. Inf. Retr. 20, 3 (June 2017), 199--220. Google ScholarDigital Library
Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: Uniting CPU and GPU in Information Retrieval Systems for Intra-query Parallelism. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 327--337. Google ScholarDigital Library
Joel Mackenzie, Falk Scholer, and J. Shane Culpepper. 2017. Early Termination Heuristics for Score-at-a-Time Index Traversal. In Proceedings of the 22Nd Australasian Document Computing Symposium (ADCS 2017). ACM, New York, NY, USA, Article 8, 8 pages. Google ScholarDigital Library
Nikos Mamoulis, Man Lung Yiu, Kit Hung Cheng, and David W. Cheung. 2007. Efficient Top-k Aggregation of Ranked Inputs. ACM Trans. Database Syst. 32, 3, Article 19 (Aug. 2007). Google ScholarDigital Library
Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Distributing efficiently the Block-Max WAND algorithm. Procedia Computer Science 18 (2013), 120--129.Google ScholarCross Ref
Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Efficient parallel block-max WAND algorithm. In European Conference on Parallel Processing. Springer, 394--405.Google ScholarDigital Library
Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, and Tefko Saracevic. 2001. Searching the Web: The Public and Their Queries. J. Am. Soc. Inf. Sci. Technol. 52, 3 (Feb. 2001), 226--234.Google ScholarCross Ref
Trevor Strohman, Howard Turtle, and W. Bruce Croft. 2005. Optimization Strategies for Complex Queries. In Proceedings of SIGIR. ACM, 219--225.Google Scholar
Shirish Tatikonda, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Posting List Intersection on Multicore Architectures. In Proceedings of SIGIR. ACM, 963--972.Google Scholar
Martin Theobald, Holger Bast, Debapriyo Majumdar, Ralf Schenkel, and Gerhard Weikum. 2008. TopX: Efficient and Versatile Top-k Query Processing for Semistructured Data. The VLDB Journal 17, 1 (Jan. 2008), 81--115. Google ScholarDigital Library
Martin Theobald, Gerhard Weikum, and Ralf Schenkel. 2004. Top-k Query Evaluation with Probabilistic Guarantees. In Proceedings of VLDB (VLDB '04). VLDB Endowment, 648--659. http://dl.acm.org/citation.cfm?id=1316689.1316746Google ScholarCross Ref
Howard Turtle and James Hood. 1995. Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31, 6 (Nov. 1995), 831--850.Google ScholarDigital Library
Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proceedings of SIGIR. ACM, 105--114.Google ScholarDigital Library
Jing Yuan, Guangzhong Sun, Tao Luo, Defu Lian, and Guoliang Chen. 2012. Efficient processing of top-k queries: selective NRA algorithms. Journal of Intelligent Information Systems 39, 3 (2012), 687--710.Google ScholarCross Ref

Index Terms

Recommendations

Supporting efficient top-k queries in type-ahead search
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval

Type-ahead search can on-the-fly find answers as a user types in a keyword query. A main challenge in this search paradigm is the high-efficiency requirement that queries must be answered within milliseconds. In this paper we study how to answer top-k ...
Read More
Scalable information extraction for web queries

The dominant way to find information on the web nowadays is through search. General search engines are very effective, but search phrases and results are unstructured and that limits a user's ability to further automate the processing of the search ...
Read More
Scalable and efficient processing of top-k multiple-type integrated queries
Abstract
In this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
February 2020
454 pages
ISBN:9781450368186
DOI:10.1145/3332466
General Chair:
Rajiv Gupta
UC Riverside
,
Program Chair:
Xipeng Shen
NCSU
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 19 February 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Badges
- Artifacts Evaluated & Functional
- Results Reproduced / v1.1
Author Tags
information retrieval
multi-threading
parallel computing
performance
top-k search
web search
Qualifiers
- research-article
Conference

Acceptance Rates
PPoPP '20 Paper Acceptance Rate28of121submissions,23%Overall Acceptance Rate230of1,014submissions,23%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 228
  Total Downloads
- Downloads (Last 12 months)11
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Scalable top-k retrieval with Sparta

PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

ABSTRACT

References

Cited By

Index Terms

Recommendations

Supporting efficient top-k queries in type-ahead search

Scalable information extraction for web queries

Scalable and efficient processing of top-k multiple-type integrated queries