skip to main content
10.1145/3332466.3374522acmconferencesArticle/Chapter ViewAbstractPublication PagesppoppConference Proceedingsconference-collections

Scalable top-k retrieval with Sparta

Published:19 February 2020Publication History

ABSTRACT

Many big data processing applications rely on a top-k retrieval building block, which selects (or approximates) the k highest-scoring data items based on an aggregation of features. In web search, for instance, a document's score is the sum of its scores for all query terms. Top-k retrieval is often used to sift through massive data and identify a smaller subset of it for further analysis. Because it filters out the bulk of the data, it often constitutes the main performance bottleneck.

Beyond the rise in data sizes, today's data processing scenarios also increase the number of features contributing to the overall score. In web search, for example, verbose queries are becoming mainstream, while state-of-the-art algorithms fail to process long queries in real-time.

We present Sparta, a practical parallel algorithm that exploits multi-core hardware for fast (approximate) top-k retrieval. Thanks to lightweight coordination and judicious context sharing among threads, Sparta scales both in the number of features and in the searched index size. In our web search case study on 50M documents, Sparta processes 12-term queries more than twice as fast as the state-of-the-art. On a tenfold bigger index, Sparta processes queries at the same speed, whereas the average latency of existing algorithms soars to be an order-of-magnitude larger than Sparta's.

References

  1. [n. d.]. https://docs.oracle.com/javase/7/docs/api/java/util/concurren/ConcurrentHashMap.html.Google ScholarGoogle Scholar
  2. [n. d.]. https://docs.oracle.com/javase/7/docs/api/java/nio/MappedByteBuffer.html.Google ScholarGoogle Scholar
  3. [n. d.]. https://lucene.apache.org.Google ScholarGoogle Scholar
  4. [n. d.]. http://www.cim.mcgill.ca/~dudek/206/Logs/AOL-user-ct-collection.Google ScholarGoogle Scholar
  5. [n. d.]. Flurry, https://www.flurry.com/.Google ScholarGoogle Scholar
  6. [n. d.]. TopN queries, http://druid.io/docs/latest/querying/topnquery.html.Google ScholarGoogle Scholar
  7. Reza Akbarinia, Esther Pacitti, and Patrick Valduriez. 2007. Best Position Algorithms for Top-k Queries. In Proceedings of VLDB. VLDB Endowment, 495--506. http://dl.acm.org/citation.cfm?id=1325851.1325909Google ScholarGoogle Scholar
  8. Naiyong Ao, Fan Zhang, Di Wu, Douglas S. Stones, Gang Wang, Xiaoguang Liu, Jing Liu, and Sheng Lin. 2011. Efficient Parallel Lists Intersection and Index Compression Algorithms Using Graphics Processing Units. Proc. VLDB Endow. 4, 8 (May 2011), 470--481. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Ioannis Arapakis, Xiao Bai, and B. Barla Cambazoglu. 2014. Impact of Response Latency on User Behavior in Web Search. In Proceedings of SIGIR. ACM, 103--112.Google ScholarGoogle Scholar
  10. Ricardo A. Baeza-Yates and Berthier Ribeiro-Neto. 1999. Modern Information Retrieval. Addison-Wesley Longman Publishing Co., Inc., Boston, MA, USA.Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Holger Bast, Debapriyo Majumdar, Ralf Schenkel, Martin Theobald, and Gerhard Weikum. 2006. IO-Top-k: Index-access Optimized Top-k Query Processing. In Proceedings of VLDB. VLDB Endowment, 475--486.Google ScholarGoogle Scholar
  12. Carolina Bonacic, Carlos García, Mauricio Marin, Manuel Prieto-Matias, and Francisco Tirado. 2010. Building Efficient Multi-threaded Search Nodes. In Proceedings of CIKM. ACM, 1249--1258.Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Edward Bortnikov, David Carmel, and Guy Golan-Gueta. 2017. Top-k Query Processing with Conditional Skips. In Proceedings of WWW Companion. 653--661.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Andrei Z. Broder, David Carmel, Michael Herscovici, Aya Soffer, and Jason Zien. 2003. Efficient Query Evaluation Using a Two-level Retrieval Process. In Proceedings of CIKM. ACM, 426--434.Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. Jamie Callan, Mark Hoy, Changkuk Yoo, and Le Zhao. 2009. Clueweb09 data set.Google ScholarGoogle Scholar
  16. Matt Crane, J. Shane Culpepper, Jimmy Lin, Joel Mackenzie, and Andrew Trotman. 2017. A Comparison of Document-at-a-Time and Score-at-a-Time Query Evaluation. In Proceedings of the Tenth ACM International Conference on Web Search and Data Mining (WSDM '17). ACM, New York, NY, USA, 201--210. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Shuai Ding and Torsten Suel. 2011. Faster Top-k Document Retrieval Using Block-max Indexes. In Proceedings of SIGIR. ACM, 993--1002.Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Ronald Fagin, Amnon Lotem, and Moni Naor. 2003. Optimal aggregation algorithms for middleware. Journal of computer and system sciences 66, 4 (2003), 614--656.Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Peter Gurský and Peter Vojtáš. 2008. Speeding Up the NRA Algorithm. In Proceedings of the 2Nd International Conference on Scalable Uncertainty Management (SUM '08). Springer-Verlag, 243--255. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Ido Guy. 2016. Searching by Talking: Analysis of Voice Queries on Mobile Web Search. In Proceedings of SIGIR. ACM, 35--44.Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Samuel Huston and W. Bruce Croft. 2010. Evaluating Verbose Query Processing Techniques. In Proceedings of SIGIR '10. ACM, 291--298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Ihab F. Ilyas, George Beskales, and Mohamed A. Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Comput. Surv. 40, 4 (2008), 11:1--11:58. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Ihab F Ilyas, George Beskales, and Mohamed A Soliman. 2008. A survey of top-k query processing techniques in relational database systems. ACM Computing Surveys (CSUR) 40, 4 (2008), 11.Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Myeongjae Jeon, Saehoon Kim, Seung-won Hwang, Yuxiong He, Sameh Elnikety, Alan L. Cox, and Scott Rixner. 2014. Predictive Parallelization: Taming Tail Latencies in Web Search. In Proceedings of SIGIR. ACM, 253--262. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Jimmy Lin and Andrew Trotman. 2015. Anytime Ranking for Impact-Ordered Indexes. In Proceedings ICTIR. ACM, 301--304.Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Jimmy Lin and Andrew Trotman. 2017. The Role of Index Compression in Score-at-a-time Query Evaluation. Inf. Retr. 20, 3 (June 2017), 199--220. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Yang Liu, Jianguo Wang, and Steven Swanson. 2018. Griffin: Uniting CPU and GPU in Information Retrieval Systems for Intra-query Parallelism. In Proceedings of the 23rd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP '18). ACM, New York, NY, USA, 327--337. Google ScholarGoogle ScholarDigital LibraryDigital Library
  28. Joel Mackenzie, Falk Scholer, and J. Shane Culpepper. 2017. Early Termination Heuristics for Score-at-a-Time Index Traversal. In Proceedings of the 22Nd Australasian Document Computing Symposium (ADCS 2017). ACM, New York, NY, USA, Article 8, 8 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Nikos Mamoulis, Man Lung Yiu, Kit Hung Cheng, and David W. Cheung. 2007. Efficient Top-k Aggregation of Ranked Inputs. ACM Trans. Database Syst. 32, 3, Article 19 (Aug. 2007). Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Distributing efficiently the Block-Max WAND algorithm. Procedia Computer Science 18 (2013), 120--129.Google ScholarGoogle ScholarCross RefCross Ref
  31. Oscar Rojas, Veronica Gil-Costa, and Mauricio Marin. 2013. Efficient parallel block-max WAND algorithm. In European Conference on Parallel Processing. Springer, 394--405.Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Amanda Spink, Dietmar Wolfram, Major B. J. Jansen, and Tefko Saracevic. 2001. Searching the Web: The Public and Their Queries. J. Am. Soc. Inf. Sci. Technol. 52, 3 (Feb. 2001), 226--234.Google ScholarGoogle ScholarCross RefCross Ref
  33. Trevor Strohman, Howard Turtle, and W. Bruce Croft. 2005. Optimization Strategies for Complex Queries. In Proceedings of SIGIR. ACM, 219--225.Google ScholarGoogle Scholar
  34. Shirish Tatikonda, B. Barla Cambazoglu, and Flavio P. Junqueira. 2011. Posting List Intersection on Multicore Architectures. In Proceedings of SIGIR. ACM, 963--972.Google ScholarGoogle Scholar
  35. Martin Theobald, Holger Bast, Debapriyo Majumdar, Ralf Schenkel, and Gerhard Weikum. 2008. TopX: Efficient and Versatile Top-k Query Processing for Semistructured Data. The VLDB Journal 17, 1 (Jan. 2008), 81--115. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Martin Theobald, Gerhard Weikum, and Ralf Schenkel. 2004. Top-k Query Evaluation with Probabilistic Guarantees. In Proceedings of VLDB (VLDB '04). VLDB Endowment, 648--659. http://dl.acm.org/citation.cfm?id=1316689.1316746Google ScholarGoogle ScholarCross RefCross Ref
  37. Howard Turtle and James Hood. 1995. Query Evaluation: Strategies and Optimizations. Inf. Process. Manage. 31, 6 (Nov. 1995), 831--850.Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Lidan Wang, Jimmy Lin, and Donald Metzler. 2011. A Cascade Ranking Model for Efficient Ranked Retrieval. In Proceedings of SIGIR. ACM, 105--114.Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Jing Yuan, Guangzhong Sun, Tao Luo, Defu Lian, and Guoliang Chen. 2012. Efficient processing of top-k queries: selective NRA algorithms. Journal of Intelligent Information Systems 39, 3 (2012), 687--710.Google ScholarGoogle ScholarCross RefCross Ref

Index Terms

  1. Scalable top-k retrieval with Sparta

              Recommendations

              Comments

              Login options

              Check if you have access through your login credentials or your institution to get full access on this article.

              Sign in
              • Published in

                cover image ACM Conferences
                PPoPP '20: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming
                February 2020
                454 pages
                ISBN:9781450368186
                DOI:10.1145/3332466

                Copyright © 2020 ACM

                Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                Publisher

                Association for Computing Machinery

                New York, NY, United States

                Publication History

                • Published: 19 February 2020

                Permissions

                Request permissions about this article.

                Request Permissions

                Check for updates

                Qualifiers

                • research-article

                Acceptance Rates

                PPoPP '20 Paper Acceptance Rate28of121submissions,23%Overall Acceptance Rate230of1,014submissions,23%
              • Article Metrics

                • Downloads (Last 12 months)11
                • Downloads (Last 6 weeks)2

                Other Metrics

              PDF Format

              View or Download as a PDF file.

              PDF

              eReader

              View online with eReader.

              eReader