skip to main content
research-article

Presto: A Decade of SQL Analytics at Meta

Authors Info & Claims
Published:20 June 2023Publication History
Skip Abstract Section

Abstract

Presto is an open-source distributed SQL query engine that supports analytics workloads involving multiple exabyte-scale data sources. Presto is used for low-latency interactive use cases as well as long-running ETL jobs at Meta. It was originally launched at Meta in 2013 and donated to the Linux Foundation in 2019. Over the last ten years, upholding query latency and scalability with the hyper growth of data volume at Meta as well as new SQL analytics requirements have raised impressive challenges for Presto. A top priority has been ensuring query reliability does not regress with the shift towards smaller, more elastic container allocation, which requires queries to run with substantially smaller memory headroom and can be preempted at any time. Additionally, new demands from machine learning, privacy, and graph analytics have driven Presto maintainers to think beyond traditional data analytics. In this paper, we discuss several successful evolutions in recent years that have improved Presto latency as well as scalability by several orders of magnitude in production at Meta. Some of the notable ones are hierarchical caching, native vectorized execution engines, materialized views, and Presto on Spark. With these new capabilities, we have deprecated or are in the process of deprecating various legacy query engines so that Presto becomes the single piece to serve interactive, ad-hoc, ETL, and graph processing workloads for the entire data warehouse.

Skip Supplemental Material Section

Supplemental Material

PACMMOD-V1mod189.mp4

mp4

19.1 MB

References

  1. RaptorX: Building a 10X Faster Presto. 2021. https://prestodb.io/blog/2021/02/04/raptorx.Google ScholarGoogle Scholar
  2. Oracle Labs PGX: Parallel Graph AnalytiX. 2022. https://www.oracle.com/middleware/technologies/parallel-graph-analytix.html.Google ScholarGoogle Scholar
  3. Renzo Angles, Marcelo Arenas, Pablo Barceló, Peter Boncz, George Fletcher, Claudio Gutierrez, Tobias Lindaaker, Marcus Paradies, Stefan Plantikow, Juan Sequeda, et al. 2018. G-CORE: A core for future graph query languages. In Proceedings of the 2018 International Conference on Management of Data. 1421--1432.Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Snowpark API. 2022. https://docs.snowflake.com/en/developer-guide/snowpark/index.html.Google ScholarGoogle Scholar
  5. Michael Armbrust, Tathagata Das, Sameer Paranjpye, Reynold Xin, Shixiong Zhu, Ali Ghodsi, Burak Yavuz, Mukul Murthy, Joseph Torres, Liwen Sun, Peter A. Boncz, Mostafa Mokhtar, Herman Van Hovell, Adrian Ionescu, Alicja Luszczak, Michal Switakowski, Takuya Ueshin, Xiao Li, Michal Szafranski, Pieter Senster, and Matei Zaharia. 2020. Delta Lake: High-Performance ACID Table Storage over Cloud Object Stores. Proc. VLDB Endow. , Vol. 13, 12 (2020), 3411--3424.Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia. 1383--1394.Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Nikos Armenatzoglou, Sanuj Basu, Naga Bhanoori, Mengchu Cai, Naresh Chainani, Kiran Chinta, Venkatraman Govindaraju, Todd J. Green, Monish Gupta, Sebastian Hillig, Eric Hotinger, Yan Leshinksy, Jintian Liang, Michael McCreedy, Fabian Nagel, Ippokratis Pandis, Panos Parchas, Rahul Pathak, Orestis Polychroniou, Foyzur Rahman, Gaurav Saxena, Gokul Soundararajan, Sriram Subramanian, and Doug Terry. 2022. Amazon Redshift Re-invented. In SIGMOD '22: International Conference on Management of Data. ACM, 2205--2217.Google ScholarGoogle Scholar
  8. Presto Unlimited: MPP SQL Engine at Scale. 2019. https://prestodb.io/blog/2019/08/05/presto-unlimited-mpp-database-at-scale.Google ScholarGoogle Scholar
  9. Bradley R Bebee, Daniel Choi, Ankit Gupta, Andi Gutmans, Ankesh Khandelwal, Yigit Kiran, Sainath Mallidi, Bruce McGaughy, Mike Personick, Karthik Rajan, et al. 2018. Amazon Neptune: Graph Data Management in the Cloud.. In ISWC (P&D/Industry/BlueSky).Google ScholarGoogle Scholar
  10. Alexander Behm, Shoumik Palkar, Utkarsh Agarwal, Timothy Armstrong, David Cashman, Ankur Dave, Todd Greenstein, Shant Hovsepian, Ryan Johnson, Arvind Sai Krishnan, Paul Leventis, Ala Luszczak, Prashanth Menon, Mostafa Mokhtar, Gene Pang, Sameer Paranjpye, Greg Rahn, Bart Samwel, Tom van Bussel, Herman Van Hovell, Maryann Xue, Reynold Xin, and Matei Zaharia. 2022. Photon: A Fast Query Engine for Lakehouse Systems. In SIGMOD '22: International Conference on Management of Data. ACM, 2326--2339.Google ScholarGoogle Scholar
  11. Brendan Burns, Brian Grant, David Oppenheimer, Eric A. Brewer, and John Wilkes. 2016. Borg, Omega, and Kubernetes. Commun. ACM , Vol. 59, 5 (2016), 50--57.Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Meta Data Centers. 2022. https://datacenters.fb.com/.Google ScholarGoogle Scholar
  13. Biswapesh Chattopadhyay, Priyam Dutta, Weiran Liu, Ott Tinn, Andrew McCormick, Aniket Mokashi, Paul Harvey, Hector Gonzalez, David Lomax, Sagar Mittal, Roee Ebenstein, Nikita Mikhaylin, Hung-Ching Lee, Xiaoyan Zhao, Tony Xu, Luis Perez, Farhad Shahmohammadi, Tran Bui, Neil Mckay, Selcuk Aya, Vera Lychagina, and Brett Elliott. 2019. Procella: Unifying serving and analytical data at YouTube. Proc. VLDB Endow. , Vol. 12, 12 (2019), 2022--2034.Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Biswapesh Chattopadhyay, Pedro Eugenio Rocha Pedreira, Sundaram Narayanan, Sameer Agarwal, Yutian Sun, Peng Li, Suketu Vakharia, and Weiran Liu. 2023. Shared Foundations: Modernizing Meta's Data Lakehouse. In 13th Conference on Innovative Data Systems Research, CIDR.Google ScholarGoogle Scholar
  15. Avery Ching, Sergey Edunov, Maja Kabiljo, Dionysios Logothetis, and Sambavi Muthukrishnan. 2015. One Trillion Edges: Graph Processing at Facebook-Scale. Proc. VLDB Endow. , Vol. 8, 12 (2015), 1804--1815.Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. ClickHouse. 2016. https://clickhouse.com/.Google ScholarGoogle Scholar
  17. Disaggregated Coordinator. 2022. https://prestodb.io/blog/2022/04/15/disggregated-coordinator.Google ScholarGoogle Scholar
  18. Beno^i t Dageville, Thierry Cruanes, Marcin Zukowski, Vadim Antonov, Artin Avanes, Jon Bock, Jonathan Claybaugh, Daniel Engovatov, Martin Hentschel, Jiansheng Huang, Allison W. Lee, Ashish Motivala, Abdul Q. Munir, Steven Pelley, Peter Povinec, Greg Rahn, Spyridon Triantafyllis, and Philipp Unterbrunner. 2016. The Snowflake Elastic Data Warehouse. In Proceedings of the 2016 International Conference on Management of Data, SIGMOD Conference 2016. ACM, 215--226.Google ScholarGoogle Scholar
  19. Ankur Dave, Alekh Jindal, Li Erran Li, Reynold Xin, Joseph Gonzalez, and Matei Zaharia. 2016. GraphFrames: an integrated API for mixing graph and relational queries. In Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems, Redwood Shores, CA, USA, June 24 - 24, 2016, , Peter A. Boncz and Josep Llu'i s Larriba-Pey (Eds.). ACM, 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified Data Processing on Large Clusters. In 6th Symposium on Operating System Design and Implementation (OSDI 2004). 137--150.Google ScholarGoogle Scholar
  21. Alin Deutsch, Nadime Francis, Alastair Green, Keith Hare, Bei Li, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Wim Martens, Jan Michels, et al. 2022. Graph pattern matching in gql and sql/pgq. In Proceedings of the 2022 International Conference on Management of Data. 2246--2258.Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael Stonebraker, and David A. Wood. 1984. Implementation Techniques for Main Memory Database Systems. In SIGMOD'84, Proceedings of Annual Meeting, Boston, Massachusetts, USA, June 18--21, 1984. ACM Press, 1--8.Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Tomasz Drabas and Denny Lee. 2017. Learning PySpark. Packt Publishing Ltd.Google ScholarGoogle Scholar
  24. Cynthia Dwork. 2006. Differential privacy. In Automata, Languages and Programming: 33rd International Colloquium, ICALP 2006, Venice, Italy, July 10--14, 2006, Proceedings, Part II 33. Springer, 1--12.Google ScholarGoogle Scholar
  25. Cosco: An efficient facebook-scale shuffle service. 2020. https://databricks.com/session/cosco-an-efficient-facebook-scale-shuffle-service.Google ScholarGoogle Scholar
  26. Nadime Francis, Alastair Green, Paolo Guagliardo, Leonid Libkin, Tobias Lindaaker, Victor Marsault, Stefan Plantikow, Mats Rydberg, Petra Selmer, and Andrés Taylor. 2018. Cypher: An evolving query language for property graphs. In Proceedings of the 2018 International Conference on Management of Data. 1433--1445.Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Apache Hudi. 2017. https://hudi.apache.org.Google ScholarGoogle Scholar
  28. Apache Iceberg. 2018. https://iceberg.apache.org.Google ScholarGoogle Scholar
  29. Avoid Data Silos in Presto in Meta: the journey from Raptor to RaptorX. 2022. https://prestodb.io/blog/2022/01/28/avoid-data-silos-in-presto-in-meta.Google ScholarGoogle Scholar
  30. Xiaowei Jiang, Yuejun Hu, Yu Xiang, Guangran Jiang, Xiaojun Jin, Chen Xia, Weihua Jiang, Jun Yu, Haitao Wang, Yuan Jiang, Jihong Ma, Li Su, and Kai Zeng. 2020. Alibaba Hologres: A Cloud-Native Service for Hybrid Serving/Analytical Processing. Proc. VLDB Endow. , Vol. 13, 12 (2020), 3272--3284.Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. GQL: One Property Query Language. 2022. https://gql.today/.Google ScholarGoogle Scholar
  32. Yuan Mei, Luwei Cheng, Vanish Talwar, Michael Y. Levin, Gabriela Jacques-Silva, Nikhil Simha, Anirban Banerjee, Brian Smith, Tim Williamson, Serhat Yilmaz, Weitao Chen, and Guoqiang Jerry Chen. 2020. Turbine: Facebook's Service Management Platform for Stream Processing. In 36th IEEE International Conference on Data Engineering, ICDE 2020, Dallas, TX, USA, April 20--24, 2020. IEEE, 1591--1602.Google ScholarGoogle ScholarCross RefCross Ref
  33. Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, and Theo Vassilakis. 2010. Dremel: Interactive Analysis of Web-Scale Datasets. Proc. VLDB Endow. , Vol. 3, 1 (2010), 330--339.Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivakumar, Matt Tolton, Theo Vassilakis, Hossein Ahmadi, Dan Delorey, Slava Min, Mosha Pasumansky, and Jeff Shute. 2020. Dremel: A Decade of Interactive SQL Analysis at Web Scale. Proc. VLDB Endow. , Vol. 13, 12 (2020), 3461--3472.Google ScholarGoogle ScholarDigital LibraryDigital Library
  35. Neo4j. 2022. https://neo4j.com/.Google ScholarGoogle Scholar
  36. Diego Ongaro and John K. Ousterhout. 2014. In Search of an Understandable Consensus Algorithm. In 2014 USENIX Annual Technical Conference, USENIX ATC '14. 305--319.Google ScholarGoogle Scholar
  37. Common Sub-Expression optimization. 2021. https://prestodb.io/blog/2021/11/22/common-sub-expression-optimization.Google ScholarGoogle Scholar
  38. Apache ORC. 2013. https://orc.apache.org/.Google ScholarGoogle Scholar
  39. Apache Parquet. 2013. https://parquet.apache.org/.Google ScholarGoogle Scholar
  40. Pedro Pedreira, Chris Croswhite, and Luis Carlos Erpen De Bona. 2016. Cubrick: Indexing Millions of Records per Second for Interactive Analytics. Proc. VLDB Endow. , Vol. 9, 13 (2016), 1305--1316.Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Pedro Pedreira, Orri Erling, Maria Basmanova, Kevin Wilfong, Laith S. Sakka, Krishna Pai, Wei He, and Biswapesh Chattopadhyay. 2022. Velox: Meta's Unified Execution Engine. Proc. VLDB Endow. , Vol. 15, 12, 3372--3384.Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Mark Raasveldt and Hannes Mü hleisen. 2019. DuckDB: an Embeddable Analytical Database. In Proceedings of the 2019 International Conference on Management of Data, SIGMOD Conference. ACM, 1981--1984.Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Bart Samwel, John Cieslewicz, Ben Handy, Jason Govig, Petros Venetis, Chanjun Yang, Keith Peters, Jeff Shute, Daniel Tenedorio, Himani Apte, Felix Weigel, David Wilhite, Jiacheng Yang, Jun Xu, Jiexing Li, Zhan Yuan, Craig Chasseur, Qiang Zeng, Ian Rae, Anurag Biyani, Andrew Harn, Yang Xia, Andrey Gubichev, Amr El-Helw, Orri Erling, Zhepeng Yan, Mohan Yang, Yiqun Wei, Thanh Do, Colin Zheng, Goetz Graefe, Somayeh Sardashti, Ahmed M. Aly, Divy Agrawal, Ashish Gupta, and Shivakumar Venkataraman. 2018. F1 Query: Declarative Querying at Scale. Proc. VLDB Endow. , Vol. 11, 12 (2018), 1835--1848.Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Raghav Sethi, Martin Traverso, Dain Sundstrom, David Phillips, Wenlei Xie, Yutian Sun, Nezih Yegitbasi, Haozhun Jin, Eric Hwang, Nileema Shingte, and Christopher Berner. 2019. Presto: SQL on Everything. In 35th IEEE International Conference on Data Engineering, ICDE. IEEE, 1802--1813.Google ScholarGoogle Scholar
  45. Leonard D. Shapiro. 1986. Join Processing in Database Systems with Large Main Memories. ACM Trans. Database Syst. , Vol. 11, 3 (1986), 239--264.Google ScholarGoogle ScholarDigital LibraryDigital Library
  46. Chunqiang Tang, Kenny Yu, Kaushik Veeraraghavan, Jonathan Kaldor, Scott Michelson, Thawan Kooburat, Aravind Anbudurai, Matthew Clark, Kabir Gogia, Long Cheng, Ben Christensen, Alex Gartrell, Maxim Khutornenko, Sachin Kulkarni, Marcin Pawlowski, Tuomas Pelkonen, Andre Rodrigues, Rounak Tibrewal, Vaishnavi Venkatesan, and Peter Zhang. 2020. Twine: A Unified Cluster Management System for Shared Infrastructure. In 14th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2020, Virtual Event, November 4--6, 2020. USENIX Association, 787--803.Google ScholarGoogle Scholar
  47. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Ning Zhang, Suresh Anthony, Hao Liu, and Raghotham Murthy. 2010. Hive - a petabyte scale data warehouse using Hadoop. In Proceedings of the 26th International Conference on Data Engineering, ICDE. 996--1005.Google ScholarGoogle ScholarCross RefCross Ref
  48. TigerGraph. 2022. https://www.tigergraph.com/.Google ScholarGoogle Scholar
  49. Apache Tinkerpop. 2022. https://tinkerpop.apache.org/.Google ScholarGoogle Scholar
  50. Tutorial: How to Define SQL Functions With Presto Across All Connectors. 2021. https://dzone.com/articles/tutorial-how-to-define-sql-functions-with-presto-a.Google ScholarGoogle Scholar
  51. Oskar van Rest, Sungpack Hong, Jinha Kim, Xuming Meng, and Hassan Chafi. 2016. PGQL: a property graph query language. In Proceedings of the Fourth International Workshop on Graph Data Management Experiences and Systems. 1--6.Google ScholarGoogle ScholarDigital LibraryDigital Library
  52. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O'Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: yet another resource negotiator. In ACM Symposium on Cloud Computing, SOCC '13, Santa Clara, CA, USA, October 1--3, 2013, , Guy M. Lohman (Ed.). ACM, 5:1--5:16.Google ScholarGoogle ScholarDigital LibraryDigital Library
  53. Royce J Wilson, Celia Yuxin Zhang, William Lam, Damien Desfontaines, Daniel Simmons-Marengo, and Bryant Gipson. 2020. Differentially private SQL with bounded user contribution. Proceedings on privacy enhancing technologies, Vol. 2020, 2 (2020), 230--250.Google ScholarGoogle ScholarCross RefCross Ref
  54. Scaling with Presto on Spark. 2021. https://prestodb.io/blog/2021/10/26/Scaling-with-Presto-on-Spark.Google ScholarGoogle Scholar
  55. Getting Started with PrestoDB and Aria Scan Optimizations. 2020. https://prestodb.io/blog/2020/08/14/getting-started-and-aria.Google ScholarGoogle Scholar
  56. Reynold S. Xin, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. 2013. GraphX: a resilient distributed graph system on Spark. In First International Workshop on Graph Data Management Experiences and Systems, GRADES, co-located with SIGMOD/PODS. CWI/ACM, 2.Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Matei Zaharia, Mosharaf Chowdhury, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2010. Spark: Cluster Computing with Working Sets. In 2nd USENIX Workshop on Hot Topics in Cloud Computing, HotCloud'10. ioGoogle ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Presto: A Decade of SQL Analytics at Meta

        Recommendations

        Comments

        Login options

        Check if you have access through your login credentials or your institution to get full access on this article.

        Sign in

        Full Access

        • Published in

          cover image Proceedings of the ACM on Management of Data
          Proceedings of the ACM on Management of Data  Volume 1, Issue 2
          PACMMOD
          June 2023
          2310 pages
          EISSN:2836-6573
          DOI:10.1145/3605748
          Issue’s Table of Contents

          Copyright © 2023 ACM

          Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

          Publisher

          Association for Computing Machinery

          New York, NY, United States

          Publication History

          • Published: 20 June 2023
          Published in pacmmod Volume 1, Issue 2

          Permissions

          Request permissions about this article.

          Request Permissions

          Qualifiers

          • research-article
        • Article Metrics

          • Downloads (Last 12 months)309
          • Downloads (Last 6 weeks)38

          Other Metrics

        PDF Format

        View or Download as a PDF file.

        PDF

        eReader

        View online with eReader.

        eReader