ABSTRACT
Next generation flash storage will be armed with a substantial amount of computing power. In this paper, we investigate opportunities to utilize this computational capability to optimize Online Analytical Processing (OLAP) applications. We have directed our analysis at the performance of a subset of TPC-DS queries using Hadoop clusters and two database engines, SPARK-SQL and Presto. We model the expected speed-up achieved by offloading a few operations that are executed first within most SQL plans. Offloading these operations requires minimal cooperation from the database engine, and no changes to the existing plan. We show that the speed-up achieved varies significantly among queries and between engines, and that the queries benefiting the most are I/O heavy with high selectivity of the "needle in the haystack" variety. Our main contribution is estimating the speed-up anticipated from pushing the execution of a few key SQL building blocks (scan, filter, and project operations) to computational storage when using read optimized, columnar Parquet format files.
- Samsung SmartSSD: https://samsungatfirst.com/smartssd/ Accessed August, 10,2019.Google Scholar
- NGD systems: https://www.ngdsystems.com/ Accessed August 10, 2019.Google Scholar
- ScaleFlux: http://www.scaleflux.com/ Accessed October 1, 2019.Google Scholar
- SIMMS https://www.simms.co.uk/tech-talk-2/sas-sata-or-pcie-know-your-interface/ Accessed 8/15/2019.Google Scholar
- G. Koo, et al. "Summarizer: Trading Communication with Computing Near Storage" MICRO'17, Oct 14--18, 2017, Boston, MA, USA.Google Scholar
- I. Jo, et al. "YourSQL: A High-Performance Database System Leveraging In-Storage Computing" Proceedings of the VLDB Endowment, Vol. 9, No 12, pp. 924--935, August 2016.Google ScholarDigital Library
- B. Gu, et al. "Biscuit: A Framework for Near-Data Processing of Big Data Workloads" ISCA, Seoul, Korea, pp. 153--165, June 2016.Google Scholar
- J. Lee, et al. "ExtraV: Boosting Graph Processing Near Storage with a Coherent Accelerator", Proceedings of the VLDB Endowment, Vol. 10, No. 12, pp. 1706--1717, August 2017.Google ScholarDigital Library
- J. Stuecheli, B. Blaner, C. Johns, M. Siegel. "CAPRI: A coherent accelerator processor interface". IBM Journal of Research and Development, 59(1):7:1{7:7, January 2015.Google ScholarDigital Library
- K. Kohei, "GPCPU Accelerates PostgreSQL", DB Tech Showcase, Tokyo, Japan, November 2014.Google Scholar
- "Postgres Derived Databases", Documentation at https://wiki.postgresql.org/wiki/PostgreSQL_derived_databases. Accessed 6/12/2018.Google Scholar
- P. Francisco "IBM PureData System for Analytics Architecture" IBM White Paper, 2014.Google Scholar
- TPC Benchmark DS Standard Specification Version 2.10.1. www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.10.1.pdf Accessed May 13, 2019.Google Scholar
- M. Poess, et al. "Analysis of TPC-DS the first standard benchmark for SQL-based big data systems", Proceedings of the 2017 Symposium on Cloud Computing, Santa Clara, CA, USA, pp. 573--585, September 2017.Google ScholarDigital Library
- TPC-DS Top Results. www.tpc.org/tpcds/results/tpcds_advanced_sort.asp Accessed May 13, 2019.Google Scholar
- T. Ansley "Accelerating the Apache Hadoop 3.1-based Distribution Ecosystem with Flash Storage" www.micron.com/about/blog/2018/july/accelerating-the-apache-hadoop-based-distribution-ecosystem-with-flash-storage July 31, 2018.Google Scholar
- A. Thapliyal "Azure HDInsight Performance Benchmarking: Interactive Query, Spark and Presto" azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/ December 20, 2017.Google Scholar
- Transaction Processing Performance Council website www.tpc.orgGoogle Scholar
- Apache Spark Documentation 2.4.3. spark.apache.org/docs/latest/ Accessed 8/6/2019.Google Scholar
- Presto Hive Connector. prestodb.io/docs/current/connector/hive.html Accessed 6/1/2018.Google Scholar
- Presto Documentation. prestodb.io/docs/current/overview.html Accessed 4/5/2018.Google Scholar
- B. Braams, "Predicate Pushdown in Parquet and Apache Spark" Master's Thesis. Univ. of Amsterdam. December, 2018.Google Scholar
- S. Melnik, S. et al. "Dremel: interactive analysis of web-scale datasets". Proceedings of the VLDB Endowment 3.1--2 (2010), pages 330--339.Google Scholar
- S. Pei, J. Yang, Q. Yang "REGISTOR: A Platform for Unstructured Data Processing Inside SSD Storage" SYSTOR, June 4--8, 2018, Haifa, Israel.Google Scholar
- Z. Ruan, T. He, J. Cong "INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive" USENIX ATC 2019, Renton, WA, USA.Google Scholar
Index Terms
- Modeling Analytics for Computational Storage
Recommendations
Evaluating Presto and SparkSQL with TPC-DS
Database Systems for Advanced Applications. DASFAA 2022 International WorkshopsAbstractFrom the perspective of the development trend of database technology and the application of big data, the unified management and analysis of relational data and non-relational data is a new trend. New relational computing engines, such as SparkSQL ...
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications SymposiumBig Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Presto: A Decade of SQL Analytics at Meta
PACMMODPresto is an open-source distributed SQL query engine that supports analytics workloads involving multiple exabyte-scale data sources. Presto is used for low-latency interactive use cases as well as long-running ETL jobs at Meta. It was originally ...
Comments