research-article

Modeling Analytics for Computational Storage

Authors:
Veronica Lagrange Moutinho dos Reis

Samsung Semiconductor, Inc., San Jose, CA, USA

Samsung Semiconductor, Inc., San Jose, CA, USA
View Profile

,
Harry (Huan) Li

Samsung Semiconductor, Inc., San Jose, CA, USA

Samsung Semiconductor, Inc., San Jose, CA, USA
View Profile

,
Anahita Shayesteh

Samsung Semiconductor, Inc., San Jose, CA, USA

Samsung Semiconductor, Inc., San Jose, CA, USA
View Profile

ICPE '20: Proceedings of the ACM/SPEC International Conference on Performance EngineeringApril 2020Pages 88–99https://doi.org/10.1145/3358960.3375794

Published:20 April 2020Publication History

ICPE '20: Proceedings of the ACM/SPEC International Conference on Performance Engineering

Pages 88–99

ABSTRACT

Next generation flash storage will be armed with a substantial amount of computing power. In this paper, we investigate opportunities to utilize this computational capability to optimize Online Analytical Processing (OLAP) applications. We have directed our analysis at the performance of a subset of TPC-DS queries using Hadoop clusters and two database engines, SPARK-SQL and Presto. We model the expected speed-up achieved by offloading a few operations that are executed first within most SQL plans. Offloading these operations requires minimal cooperation from the database engine, and no changes to the existing plan. We show that the speed-up achieved varies significantly among queries and between engines, and that the queries benefiting the most are I/O heavy with high selectivity of the "needle in the haystack" variety. Our main contribution is estimating the speed-up anticipated from pushing the execution of a few key SQL building blocks (scan, filter, and project operations) to computational storage when using read optimized, columnar Parquet format files.

References

Samsung SmartSSD: https://samsungatfirst.com/smartssd/ Accessed August, 10,2019.Google Scholar
NGD systems: https://www.ngdsystems.com/ Accessed August 10, 2019.Google Scholar
ScaleFlux: http://www.scaleflux.com/ Accessed October 1, 2019.Google Scholar
SIMMS https://www.simms.co.uk/tech-talk-2/sas-sata-or-pcie-know-your-interface/ Accessed 8/15/2019.Google Scholar
G. Koo, et al. "Summarizer: Trading Communication with Computing Near Storage" MICRO'17, Oct 14--18, 2017, Boston, MA, USA.Google Scholar
I. Jo, et al. "YourSQL: A High-Performance Database System Leveraging In-Storage Computing" Proceedings of the VLDB Endowment, Vol. 9, No 12, pp. 924--935, August 2016.Google ScholarDigital Library
B. Gu, et al. "Biscuit: A Framework for Near-Data Processing of Big Data Workloads" ISCA, Seoul, Korea, pp. 153--165, June 2016.Google Scholar
J. Lee, et al. "ExtraV: Boosting Graph Processing Near Storage with a Coherent Accelerator", Proceedings of the VLDB Endowment, Vol. 10, No. 12, pp. 1706--1717, August 2017.Google ScholarDigital Library
J. Stuecheli, B. Blaner, C. Johns, M. Siegel. "CAPRI: A coherent accelerator processor interface". IBM Journal of Research and Development, 59(1):7:1{7:7, January 2015.Google ScholarDigital Library
K. Kohei, "GPCPU Accelerates PostgreSQL", DB Tech Showcase, Tokyo, Japan, November 2014.Google Scholar
"Postgres Derived Databases", Documentation at https://wiki.postgresql.org/wiki/PostgreSQL_derived_databases. Accessed 6/12/2018.Google Scholar
P. Francisco "IBM PureData System for Analytics Architecture" IBM White Paper, 2014.Google Scholar
TPC Benchmark DS Standard Specification Version 2.10.1. www.tpc.org/tpc_documents_current_versions/pdf/tpc-ds_v2.10.1.pdf Accessed May 13, 2019.Google Scholar
M. Poess, et al. "Analysis of TPC-DS the first standard benchmark for SQL-based big data systems", Proceedings of the 2017 Symposium on Cloud Computing, Santa Clara, CA, USA, pp. 573--585, September 2017.Google ScholarDigital Library
TPC-DS Top Results. www.tpc.org/tpcds/results/tpcds_advanced_sort.asp Accessed May 13, 2019.Google Scholar
T. Ansley "Accelerating the Apache Hadoop 3.1-based Distribution Ecosystem with Flash Storage" www.micron.com/about/blog/2018/july/accelerating-the-apache-hadoop-based-distribution-ecosystem-with-flash-storage July 31, 2018.Google Scholar
A. Thapliyal "Azure HDInsight Performance Benchmarking: Interactive Query, Spark and Presto" azure.microsoft.com/en-us/blog/hdinsight-interactive-query-performance-benchmarks-and-integration-with-power-bi-direct-query/ December 20, 2017.Google Scholar
Transaction Processing Performance Council website www.tpc.orgGoogle Scholar
Apache Spark Documentation 2.4.3. spark.apache.org/docs/latest/ Accessed 8/6/2019.Google Scholar
Presto Hive Connector. prestodb.io/docs/current/connector/hive.html Accessed 6/1/2018.Google Scholar
Presto Documentation. prestodb.io/docs/current/overview.html Accessed 4/5/2018.Google Scholar
B. Braams, "Predicate Pushdown in Parquet and Apache Spark" Master's Thesis. Univ. of Amsterdam. December, 2018.Google Scholar
S. Melnik, S. et al. "Dremel: interactive analysis of web-scale datasets". Proceedings of the VLDB Endowment 3.1--2 (2010), pages 330--339.Google Scholar
S. Pei, J. Yang, Q. Yang "REGISTOR: A Platform for Unstructured Data Processing Inside SSD Storage" SYSTOR, June 4--8, 2018, Haifa, Israel.Google Scholar
Z. Ruan, T. He, J. Cong "INSIDER: Designing In-Storage Computing System for Emerging High-Performance Drive" USENIX ATC 2019, Renton, WA, USA.Google Scholar

Index Terms

Modeling Analytics for Computational Storage

Recommendations

Evaluating Presto and SparkSQL with TPC-DS
Database Systems for Advanced Applications. DASFAA 2022 International Workshops
Abstract
From the perspective of the development trend of database technology and the application of big data, the unified management and analysis of relational data and non-relational data is a new trend. New relational computing engines, such as SparkSQL ...
Read More
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Read More
Presto: A Decade of SQL Analytics at Meta
PACMMOD

Presto is an open-source distributed SQL query engine that supports analytics workloads involving multiple exabyte-scale data sources. Presto is used for low-latency interactive use cases as well as long-running ETL jobs at Meta. It was originally ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
ICPE '20: Proceedings of the ACM/SPEC International Conference on Performance Engineering
April 2020
319 pages
ISBN:9781450369916
DOI:10.1145/3358960
General Chairs:
J. Nelson Amaral
University of Alberta, Canada
,
Anne Koziolek
Karlruhe Institute of Technology (KIT), Germany
,
Program Chairs:
Catia Trubiani
Gran Sasso Science Institute, GSSI, Italy
,
Alexandru Iosup
VU Amsterdam, Netherlands
Copyright © 2020 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 20 April 2020
Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
OLAP
SQL
TPC-DS
acceleration
columnar database
offloading
parquet
presto
smart storage
spark
Qualifiers
- research-article
Conference

Acceptance Rates
ICPE '20 Paper Acceptance Rate15of62submissions,24%Overall Acceptance Rate252of851submissions,30%
More
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 3
  Total Citations
  View Citations
- 318
  Total Downloads
- Downloads (Last 12 months)30
- Downloads (Last 6 weeks)4
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Modeling Analytics for Computational Storage

ICPE '20: Proceedings of the ACM/SPEC International Conference on Performance Engineering

ABSTRACT

References

Cited By

Index Terms

Recommendations

Evaluating Presto and SparkSQL with TPC-DS

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

Presto: A Decade of SQL Analytics at Meta