poster

Big data skipping in the cloud

Authors:
Oshrit Feder

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

,
Guy Khazma

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

,
Gal Lushi

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

,
Yosef Moatti

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

,
Paula Ta-Shma

IBM Research, Haifa, Israel

IBM Research, Haifa, Israel
View Profile

SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and StorageMay 2019Pages 193https://doi.org/10.1145/3319647.3325854

Published:22 May 2019Publication History

SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and Storage

Pages 193

ABSTRACT

According to today's best practices, cloud compute and storage services should be deployed and managed independently. However, this generates a problem for big data analytics in the cloud: potentially huge datasets need to be shipped from the storage service to the compute service to analyse the data. To address this, minimizing the amount of data sent across the network is critical to achieve good performance and low cost. Data skipping is a technique which achieves this for SQL style analytics on structured data.

Data skipping stores summary metadata for each object (or file) in a dataset. For each column in the object, the summary might include minimum and maximum values, a list or bloom filter of the appearing values, or other metadata which succinctly represents the data in that column. This metadata can then be indexed to support efficient retrieval, although since it can be orders of magnitude smaller than the data itself, this step may not be essential. This metadata can be used during query evaluation to skip over objects which have no relevant data. False positives for object relevance are acceptable since the query execution engine will ultimately filter the data at the row level. However false negatives must be avoided to ensure correctness of query results.

Unlike fully inverted database indexes, data skipping indexes are much smaller than the data itself. This property is critical in the cloud, since otherwise a full index scan could increase the amount of data sent across the network instead of reducing it. In the context of database systems, data skipping is used as an additional technique which complements classical indexes. It is referred to as synopsis in DB2 [6] and zone maps in Oracle [9], where in both cases it is limited to min/max metadata. Data skipping and the associated topic of data layout, has been addressed in recent research papers [7, 8] and is also used in cloud analytics platforms [3,4]. Data skipping can also be built into specific data formats [1].

We implemented data skipping support for Apache Spark SQL [2] without changing core Spark, in the form of an addon Scala library which can be added to the classpath and used in Spark applications. Our work applies to storage systems which implement the Hadoop FileSystem API, which includes various object storage systems as well as HDFS. Metadata is stored in Elasticsearch (ES) [5], and additional metadata stores can be supported in future using a pluggable API. Our approach prunes the list of candidate objects for any given Spark SQL query according to the associated data skipping metadata, stored and indexed in ES. Our technique applies to all Spark supported native formats e.g. JSON, CSV, Avro, Parquet, ORC, and can benefit from the latest optimizations built in to those formats in Spark. Unlike approaches which embed data skipping metadata inside the data format itself [1], which require reading at least part of the object, our approach avoids touching irrelevant objects altogether.

References

2019. Apache Parquet. https://parquet.apache.org/Google Scholar
2019. Apache Spark. https://spark.apache.org/Google Scholar
2019. Data Skipping for IBM Cloud SQL Query. https://www.ibm.com/blogs/bluemix/2019/03/data-skipping-for-ibm-cloud-sql-query/Google Scholar
2019. Databricks Delta Guide. https://docs.databricks.com/delta/optimizations.html#delta-data-skippingGoogle Scholar
2019. Elasticsearch. https://www.elastic.co/products/elasticsearchGoogle Scholar
Vijayshankar Raman et al. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11 (2013), 1080--1091. Google ScholarDigital Library
Anil Shanbhag, Alekh Jindal, Samuel Madden, Jorge Quiane, and Aaron J Elmore. 2017. A robust partitioning scheme for ad-hoc query workloads. In Proceedings of the 2017 Symposium on Cloud Computing. ACM. Google ScholarDigital Library
Liwen Sun, Michael J Franklin, Sanjay Krishnan, and Reynold S Xin. 2014. Fine-grained partitioning for aggressive data skipping. In Proceedings of the 2014 SIGMOD. ACM. Google ScholarDigital Library
Mohamed Ziauddin, Andrew Witkowski, You Jung Kim, Dmitry Potapov, Janaki Lahorani, and Murali Krishna. 2017. Dimensions based data clustering and zone maps. Proceedings of the VLDB Endowment 10, 12 (2017), 1622--1633. Google ScholarDigital Library

Index Terms

Big data skipping in the cloud
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
        Query optimization
        Query planning

Recommendations

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Published in
SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and Storage
May 2019
211 pages
ISBN:9781450367493
DOI:10.1145/3319647
General Chair:
Moshik Hershcovitch
IBM Research
,
Program Chairs:
Ashvin Goel
University of Toronto
,
Adam Morrison
Tel Aviv University
Copyright © 2019 Owner/Author
Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 22 May 2019
Check for updates
Qualifiers
- poster
Conference

Acceptance Rates
Overall Acceptance Rate94of285submissions,33%
Upcoming Conference
SYSTOR '24

Sponsor:

sigops

The 17th ACM International Systems and Storage Conference

September 23 - 25, 2024

Tel-Aviv , Israel
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 0
  Total Citations
  View Citations
- 193
  Total Downloads
- Downloads (Last 12 months)13
- Downloads (Last 6 weeks)2
Other Metrics
View Author Metrics
Cited By
This publication has not been cited yet

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Big data skipping in the cloud

SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and Storage

ABSTRACT

References

Cited By

Index Terms

Recommendations

Big Data Processing Using Spark in Cloud

Big Data Analytics

Next-Generation Big Data: A Practical Guide to Apache Kudu, Impala, and Spark