skip to main content
10.1145/2463676.2465288acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Shark: SQL and rich analytics at scale

Published: 22 June 2013 Publication History

Abstract

Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g. iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100X faster than Apache Hive, and machine learning programs more than 100X faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engine provides. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.

References

[1]
https://github.com/cloudera/impala.
[2]
http://hadoop.apache.org/.
[3]
http://aws.amazon.com/elasticmapreduce/.
[4]
A. Abouzeid et al. Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. VLDB, 2009.
[5]
S. Agarwal et al. Re-optimizing data-parallel computing. In NSDI'12.
[6]
G. Ananthanarayanan et al. Pacman: Coordinated memory caching for parallel jobs. In NSDI, 2012.
[7]
R. Avnur and J. M. Hellerstein. Eddies: continuously adaptive query processing. In SIGMOD, 2000.
[8]
S. Babu. Towards automatic optimization of mapreduce programs. In SoCC'10.
[9]
A. Behm et al. Asterix: towards a scalable, semistructured data platform for evolving-world models. Distributed and Parallel Databases, 29(3):185--216, 2011.
[10]
V. Borkar et al. Hyracks: A flexible and extensible foundation for data-intensive computing. In ICDE'11.
[11]
Y. Bu et al. HaLoop: efficient iterative data processing on large clusters. Proc. VLDB Endow., 2010.
[12]
R. Chaiken et al. Scope: easy and efficient parallel processing of massive data sets. VLDB, 2008.
[13]
B. Chattopadhyay, et al. Tenzing a sql implementation on the mapreduce framework. PVLDB, 4(12):1318--1327, 2011.
[14]
S. Chen. Cheetah: a high performance, custom data warehouse on top of mapreduce. VLDB, 2010.
[15]
C. Chu et al. Map-reduce for machine learning on multicore. Advances in neural information processing systems, 19:281, 2007.
[16]
J. Cohen, B. Dolan, M. Dunlap, J. Hellerstein, and C.Welton.Mad skills: new analysis practices for big data. VLDB, 2009.
[17]
J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In OSDI, 2004.
[18]
X. Feng et al. Towards a unified architecture for in-rdbms analytics. In SIGMOD, 2012.
[19]
B. Guffler et al. Handling data skew in mapreduce. In CLOSER'11.
[20]
A. Hall et al. Processing a trillion cells per mouse click. VLDB.
[21]
B. Hindman et al. Mesos: A platform for fine-grained resource sharing in the data center. In NSDI'11.
[22]
M. Isard et al. Dryad: distributed data-parallel programs from sequential building blocks. SIGOPS, 2007.
[23]
M. Isard et al. Quincy: Fair scheduling for distributed computing clusters. In SOSP '09, 2009.
[24]
M. Isard and Y. Yu. Distributed data-parallel computing using a high-level programming language. In SIGMOD, 2009.
[25]
N. Kabra and D. J. DeWitt. Efficient mid-query re-optimization of sub-optimal query execution plans. In SIGMOD, 1998.
[26]
Y. Kwon et al. Skewtune: mitigating skew in mapreduce applications. In SIGMOD '12, 2012.
[27]
Y. Low et al. Distributed graphlab: a framework for machine learning and data mining in the cloud. VLDB, 2012.
[28]
G. Malewicz et al. Pregel: a system for large-scale graph processing. In SIGMOD, 2010.
[29]
S. Melnik et al. Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endow., 3:330--339, Sept 2010.
[30]
K. Ousterhout et al. The case for tiny tasks in compute clusters. In HotOS'13.
[31]
A. Pavlo et al. A comparison of approaches to large-scale data analysis. In SIGMOD, 2009.
[32]
M. Stonebraker et al. C-store: a column-oriented dbms. In VLDB'05.
[33]
M. Stonebraker et al. Mapreduce and parallel dbmss: friends or foes? Commun. ACM.
[34]
A. Thusoo et al. Hive-a petabyte scale data warehouse using hadoop. In ICDE, 2010.
[35]
Transaction Processing Performance Council. TPC BENCHMARK H.
[36]
T. Urhan, M. J. Franklin, and L. Amsaleg. Cost-based query scrambling for initial delays. In SIGMOD, 1998.
[37]
C. Yang et al. Osprey: Implementing mapreduce-style fault tolerance in a shared-nothing distributed database. In ICDE, 2010.
[38]
M. Zaharia et al. Delay scheduling: A simple technique for achieving locality and fairness in cluster scheduling. In EuroSys 10, 2010.
[39]
M. Zaharia et al. Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. NSDI, 2012.

Cited By

View all
  • (2024)Adaptive and Robust Query Execution for Lakehouses at ScaleProceedings of the VLDB Endowment10.14778/3685800.368581817:12(3947-3959)Online publication date: 1-Aug-2024
  • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
  • (2024)Enhancing Query Processing in Big Data: Scalability and Performance OptimizationArtificial Intelligence, Big Data, IOT and Block Chain in Healthcare: From Concepts to Applications10.1007/978-3-031-65014-7_5(46-57)Online publication date: 14-Aug-2024
  • Show More Cited By

Index Terms

  1. Shark: SQL and rich analytics at scale

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Conferences
    SIGMOD '13: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data
    June 2013
    1322 pages
    ISBN:9781450320375
    DOI:10.1145/2463676
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Sponsors

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 22 June 2013

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. data warehouse
    2. databases
    3. hadoop
    4. machine learning
    5. shark
    6. spark

    Qualifiers

    • Research-article

    Conference

    SIGMOD/PODS'13
    Sponsor:

    Acceptance Rates

    SIGMOD '13 Paper Acceptance Rate 76 of 372 submissions, 20%;
    Overall Acceptance Rate 785 of 4,003 submissions, 20%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)59
    • Downloads (Last 6 weeks)3
    Reflects downloads up to 26 Dec 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Adaptive and Robust Query Execution for Lakehouses at ScaleProceedings of the VLDB Endowment10.14778/3685800.368581817:12(3947-3959)Online publication date: 1-Aug-2024
    • (2024)A Spark Optimizer for Adaptive, Fine-Grained Parameter TuningProceedings of the VLDB Endowment10.14778/3681954.368202117:11(3565-3579)Online publication date: 1-Jul-2024
    • (2024)Enhancing Query Processing in Big Data: Scalability and Performance OptimizationArtificial Intelligence, Big Data, IOT and Block Chain in Healthcare: From Concepts to Applications10.1007/978-3-031-65014-7_5(46-57)Online publication date: 14-Aug-2024
    • (2023)Towards General and Efficient Online Tuning for SparkProceedings of the VLDB Endowment10.14778/3611540.361154816:12(3570-3583)Online publication date: 12-Sep-2023
    • (2023)JoinBoost: Grow Trees over Normalized Data Using Only SQLProceedings of the VLDB Endowment10.14778/3611479.361150916:11(3071-3084)Online publication date: 24-Aug-2023
    • (2023)Characterizing Distributed Machine Learning Workloads on Apache SparkProceedings of the 24th International Middleware Conference10.1145/3590140.3629112(151-164)Online publication date: 27-Nov-2023
    • (2023)Saba: Rethinking Datacenter Network Allocation from Application's PerspectiveProceedings of the Eighteenth European Conference on Computer Systems10.1145/3552326.3587450(623-638)Online publication date: 8-May-2023
    • (2023)A Network Load Perception Based Task Scheduler for Parallel Distributed Data Processing SystemsIEEE Transactions on Cloud Computing10.1109/TCC.2021.313262711:2(1352-1364)Online publication date: 1-Apr-2023
    • (2023)A Survey on Spark Ecosystem: Big Data Processing Infrastructure, Machine Learning, and Applications (Extended abstract)2023 IEEE 39th International Conference on Data Engineering (ICDE)10.1109/ICDE55515.2023.00316(3779-3780)Online publication date: Apr-2023
    • (2023)Unlocking the Power of Data in Telecom: Building an Effective MLOps Infrastructure for Model Deployment2023 7th Iranian Conference on Advances in Enterprise Architecture (ICAEA)10.1109/ICAEA60387.2023.10414445(78-84)Online publication date: 15-Nov-2023
    • Show More Cited By

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media