skip to main content
10.1145/3319647.3325854acmconferencesArticle/Chapter ViewAbstractPublication PagessystorConference Proceedingsconference-collections
poster

Big data skipping in the cloud

Published:22 May 2019Publication History

ABSTRACT

According to today's best practices, cloud compute and storage services should be deployed and managed independently. However, this generates a problem for big data analytics in the cloud: potentially huge datasets need to be shipped from the storage service to the compute service to analyse the data. To address this, minimizing the amount of data sent across the network is critical to achieve good performance and low cost. Data skipping is a technique which achieves this for SQL style analytics on structured data.

Data skipping stores summary metadata for each object (or file) in a dataset. For each column in the object, the summary might include minimum and maximum values, a list or bloom filter of the appearing values, or other metadata which succinctly represents the data in that column. This metadata can then be indexed to support efficient retrieval, although since it can be orders of magnitude smaller than the data itself, this step may not be essential. This metadata can be used during query evaluation to skip over objects which have no relevant data. False positives for object relevance are acceptable since the query execution engine will ultimately filter the data at the row level. However false negatives must be avoided to ensure correctness of query results.

Unlike fully inverted database indexes, data skipping indexes are much smaller than the data itself. This property is critical in the cloud, since otherwise a full index scan could increase the amount of data sent across the network instead of reducing it. In the context of database systems, data skipping is used as an additional technique which complements classical indexes. It is referred to as synopsis in DB2 [6] and zone maps in Oracle [9], where in both cases it is limited to min/max metadata. Data skipping and the associated topic of data layout, has been addressed in recent research papers [7, 8] and is also used in cloud analytics platforms [3,4]. Data skipping can also be built into specific data formats [1].

We implemented data skipping support for Apache Spark SQL [2] without changing core Spark, in the form of an addon Scala library which can be added to the classpath and used in Spark applications. Our work applies to storage systems which implement the Hadoop FileSystem API, which includes various object storage systems as well as HDFS. Metadata is stored in Elasticsearch (ES) [5], and additional metadata stores can be supported in future using a pluggable API. Our approach prunes the list of candidate objects for any given Spark SQL query according to the associated data skipping metadata, stored and indexed in ES. Our technique applies to all Spark supported native formats e.g. JSON, CSV, Avro, Parquet, ORC, and can benefit from the latest optimizations built in to those formats in Spark. Unlike approaches which embed data skipping metadata inside the data format itself [1], which require reading at least part of the object, our approach avoids touching irrelevant objects altogether.

References

  1. 2019. Apache Parquet. https://parquet.apache.org/Google ScholarGoogle Scholar
  2. 2019. Apache Spark. https://spark.apache.org/Google ScholarGoogle Scholar
  3. 2019. Data Skipping for IBM Cloud SQL Query. https://www.ibm.com/blogs/bluemix/2019/03/data-skipping-for-ibm-cloud-sql-query/Google ScholarGoogle Scholar
  4. 2019. Databricks Delta Guide. https://docs.databricks.com/delta/optimizations.html#delta-data-skippingGoogle ScholarGoogle Scholar
  5. 2019. Elasticsearch. https://www.elastic.co/products/elasticsearchGoogle ScholarGoogle Scholar
  6. Vijayshankar Raman et al. 2013. DB2 with BLU acceleration: So much more than just a column store. Proceedings of the VLDB Endowment 6, 11 (2013), 1080--1091. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Anil Shanbhag, Alekh Jindal, Samuel Madden, Jorge Quiane, and Aaron J Elmore. 2017. A robust partitioning scheme for ad-hoc query workloads. In Proceedings of the 2017 Symposium on Cloud Computing. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Liwen Sun, Michael J Franklin, Sanjay Krishnan, and Reynold S Xin. 2014. Fine-grained partitioning for aggressive data skipping. In Proceedings of the 2014 SIGMOD. ACM. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Mohamed Ziauddin, Andrew Witkowski, You Jung Kim, Dmitry Potapov, Janaki Lahorani, and Murali Krishna. 2017. Dimensions based data clustering and zone maps. Proceedings of the VLDB Endowment 10, 12 (2017), 1622--1633. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Big data skipping in the cloud

      Recommendations

      Comments

      Login options

      Check if you have access through your login credentials or your institution to get full access on this article.

      Sign in
      • Published in

        cover image ACM Conferences
        SYSTOR '19: Proceedings of the 12th ACM International Conference on Systems and Storage
        May 2019
        211 pages
        ISBN:9781450367493
        DOI:10.1145/3319647

        Copyright © 2019 Owner/Author

        Permission to make digital or hard copies of part or all of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for third-party components of this work must be honored. For all other uses, contact the Owner/Author.

        Publisher

        Association for Computing Machinery

        New York, NY, United States

        Publication History

        • Published: 22 May 2019

        Check for updates

        Qualifiers

        • poster

        Acceptance Rates

        Overall Acceptance Rate94of285submissions,33%

        Upcoming Conference

        SYSTOR '24
        The 17th ACM International Systems and Storage Conference
        September 23 - 25, 2024
        Tel-Aviv , Israel
      • Article Metrics

        • Downloads (Last 12 months)13
        • Downloads (Last 6 weeks)2

        Other Metrics

      PDF Format

      View or Download as a PDF file.

      PDF

      eReader

      View online with eReader.

      eReader