ABSTRACT
Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a document's score is the sum of its scores for all query terms. Top-k retrieval is often used to sift through massive data and identify a smaller subset of it for further analysis. Because it filters out the bulk of the data, it often constitutes the main performance bottleneck.
Beyond the rise in data sizes, today's data processing scenarios also increase the number of features contributing to the overall score. In web search, for example, verbose queries are becoming mainstream, while state-of-the-art algorithms fail to process long queries in real-time.
We present Sparta, a practical parallel algorithm that exploits multi-core hardware for fast (approximate) top-k retrieval. Thanks to lightweight coordination and judicious context sharing among threads, Sparta scales both in the number of features and in the searched index size. In our web search case study on 50M documents, Sparta processes 12-term queries more than twice as fast as the state-of-the-art. On a tenfold bigger index, Sparta processes queries at the same speed, whereas the average latency of existing algorithms soars to be an order-of-magnitude larger than Sparta's.
- [n. d.]. https://docs.oracle.com/javase/7/docs/api/java/util/concurren/ConcurrentHashMap.html.Google Scholar
- [n. d.]. https://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html.Google Scholar
- [n. d.]. https://lucene.apache.org.Google Scholar
- [n. d.]. http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection.Google Scholar
- [n. d.]. Flurry, https://www.flurry.com/.Google Scholar
- [n. d.]. TopN queries, http://druid.io/docs/latest/querying/topnquery.html.Google Scholar
- Reza Akbarinia, Esther Pacitti, and Patrick Valduriez. 2007. Best Position Algorithms for Top-k Queries. In Proceedings of VLDB. VLDB Endowment, 495--506. http://dl.acm.org/citation.cfm?id=1325851.1325909Google Scholar
- Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin. 2011. Efficient Parallel Lists Intersection and Index Compression Algorithms Using Graphics Processing Units. Proc. VLDB Endow. 4, 8 (May 2011), 470--481. Google ScholarDigital Library
- Ioannis Arapakis, Xiao Bai, and B. Barla Cambazoglu. 2014. Impact of Response Latency on User Behavior in Web Search. In Proceedings of SIGIR. ACM, 103--112.Google Scholar
- Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.Google ScholarDigital Library
- Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. 2006. IO-Top-k: Index-access Optimized Top-k Query Processing. In Proceedings of VLDB. VLDB Endowment, 475--486.Google Scholar
- Carolina Bonacic, Carlos García, Mauricio Marin, Manuel Prieto-Matias, and Francisco Tirado. 2010. Building Efficient Multi-threaded Search Nodes. In Proceedings of CIKM. ACM, 1249--1258.Google ScholarDigital Library
- Edward Bortnikov, David Carmel, and Guy Golan-Gueta. 2017. Top-k Query Processing with Conditional Skips. In Proceedings of WWW Companion. 653--661.Google ScholarDigital Library
- Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient Query Evaluation Using a Two-level Retrieval Process. In Proceedings of CIKM. ACM, 426--434.Google ScholarDigital Library
- Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set.Google Scholar
- Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM '17). ACM, New York, NY, USA, 201--210. Google ScholarDigital Library
- Shuai Ding and Torsten Suel. 2011. Faster Top-k Document Retrieval Using Block-max Indexes. In Proceedings of SIGIR. ACM, 993--1002.Google ScholarDigital Library
- Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. Journal of computer and system sciences 66, 4 (2003), 614--656.Google ScholarDigital Library
- Peter Gurský and Peter Vojtáš. 2008. Speeding Up the NRA Algorithm. In Proceedings of the 2Nd International Conference on Scalable Uncertainty Management (SUM '08). Springer-Verlag, 243--255. Google ScholarDigital Library
- Ido Guy. 2016. Searching by Talking: Analysis of Voice Queries on Mobile Web Search. In Proceedings of SIGIR. ACM, 35--44.Google ScholarDigital Library
- Samuel Huston and W. Bruce Croft. 2010. Evaluating Verbose Query Processing Techniques. In Proceedings of SIGIR '10. ACM, 291--298. Google ScholarDigital Library
- Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40, 4 (2008), 11:1--11:58. Google ScholarDigital Library
- Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR) 40, 4 (2008), 11.Google ScholarDigital Library
- Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety, Alan L. Cox, and Scott Rixner. 2014. Predictive Parallelization: Taming Tail Latencies in Web Search. In Proceedings of SIGIR. ACM, 253--262. Google ScholarDigital Library
- Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In Proceedings ICTIR. ACM, 301--304.Google ScholarDigital Library
- Jimmy Lin and Andrew Trotman. 2017. The Role of Index Compression in Score-at-a-time Query Evaluation. Inf. Retr. 20, 3 (June 2017), 199--220. Google ScholarDigital Library
- Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: Uniting CPU and GPU in Information Retrieval Systems for Intra-query Parallelism. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 327--337. Google ScholarDigital Library
- Joel Mackenzie, Falk Scholer, and J. Shane Culpepper. 2017. Early Termination Heuristics for Score-at-a-Time Index Traversal. In Proceedings of the 22Nd Australasian Document Computing Symposium (ADCS 2017). ACM, New York, NY, USA, Article 8, 8 pages. Google ScholarDigital Library
- Nikos Mamoulis, Man Lung Yiu, Kit Hung Cheng, and David W. Cheung. 2007. Efficient Top-k Aggregation of Ranked Inputs. ACM Trans. Database Syst. 32, 3, Article 19 (Aug. 2007). Google ScholarDigital Library
- Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Distributing efficiently the Block-Max WAND algorithm. Procedia Computer Science 18 (2013), 120--129.Google ScholarCross Ref
- Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Efficient parallel block-max WAND algorithm. In European Conference on Parallel Processing. Springer, 394--405.Google ScholarDigital Library
- Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, and Tefko Saracevic. 2001. Searching the Web: The Public and Their Queries. J. Am. Soc. Inf. Sci. Technol. 52, 3 (Feb. 2001), 226--234.Google ScholarCross Ref
- Trevor Strohman, Howard Turtle, and W. Bruce Croft. 2005. Optimization Strategies for Complex Queries. In Proceedings of SIGIR. ACM, 219--225.Google Scholar
- Shirish Tatikonda, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Posting List Intersection on Multicore Architectures. In Proceedings of SIGIR. ACM, 963--972.Google Scholar
- Martin Theobald, Holger Bast, Debapriyo Majumdar, Ralf Schenkel, and Gerhard Weikum. 2008. TopX: Efficient and Versatile Top-k Query Processing for Semistructured Data. The VLDB Journal 17, 1 (Jan. 2008), 81--115. Google ScholarDigital Library
- Martin Theobald, Gerhard Weikum, and Ralf Schenkel. 2004. Top-k Query Evaluation with Probabilistic Guarantees. In Proceedings of VLDB (VLDB '04). VLDB Endowment, 648--659. http://dl.acm.org/citation.cfm?id=1316689.1316746Google ScholarCross Ref
- Howard Turtle and James Hood. 1995. Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31, 6 (Nov. 1995), 831--850.Google ScholarDigital Library
- Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proceedings of SIGIR. ACM, 105--114.Google ScholarDigital Library
- Jing Yuan, Guangzhong Sun, Tao Luo, Defu Lian, and Guoliang Chen. 2012. Efficient processing of top-k queries: selective NRA algorithms. Journal of Intelligent Information Systems 39, 3 (2012), 687--710.Google ScholarCross Ref
Index Terms
- Scalable top-k retrieval with Sparta
Recommendations
Supporting efficient top-k queries in type-ahead search
SIGIR '12: Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrievalType-ahead search can on-the-fly find answers as a user types in a keyword query. A main challenge in this search paradigm is the high-efficiency requirement that queries must be answered within milliseconds. In this paper we study how to answer top-k ...
Scalable information extraction for web queries
The dominant way to find information on the web nowadays is through search. General search engines are very effective, but search phrases and results are unstructured and that limits a user's ability to further automate the processing of the search ...
Scalable and efficient processing of top-k multiple-type integrated queries
AbstractIn this paper, we define a new class of queries, the top-k multiple-type integrated query (simply, top-k MULTI query). It deals with multiple data types and finds the information in the order of relevance between the query and the object. Various ...
Comments