research-article

Efficient processing of data warehousing queries in a split execution environment

Authors:

Kamil Bajda-Pawlikowski,

Daniel J. Abadi,

Avi Silberschatz,

Erik PaulsonAuthors Info & Claims

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

Pages 1165 - 1176

https://doi.org/10.1145/1989323.1989447

Published: 12 June 2011 Publication History

Abstract

Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize perfor mance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework.

In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.

References

[1]

Hadapt Inc. Web page. http://www.hadapt.com riptsize.

[2]

Hadoop. Web page. http://hadoop.apache.org riptsize.

[3]

Hadoop TeraSort. http://developer.yahoo.com/blogs/hadoop/Yahoo2009.pdf riptsize.

[4]

Hive. Web page. http://hadoop.apache.org/hive riptsize.

[5]

Running TPC-H queries on Hive. Web page. http://issues.apache.org/jira/browse/HIVE-600 riptsize.

[6]

TPC-H. Web page. http://www.tpc.org/tpch riptsize.

[7]

VectorWise. Web page. http://www.vectorwise.com riptsize.

[8]

D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. R. Madden. Materialization Strategies in a Column-Oriented DBMS. In ICDE, pages 466--475, Istanbul, Turkey, 2007.

[9]

A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In VLDB, 2009.

Digital Library

[10]

A. Abouzied, K. Bajda-Pawlikowski, J. Huang, D. J. Abadi, and A. Silberschatz. Hadoopdb in action: Building real world applications. Demonstration. SIGMOD, 2010.

Digital Library

[11]

E. Albanese. Why Europe's Largest Ad Targeting Platform Uses Hadoop. http://www.cloudera.com/blog/2010/03/why-europes-largest-ad-targeting-platfo%rm-uses-hadoop riptsize.

[12]

S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In Proc. of SIGMOD, pages 975--986, New York, NY, USA, 2010. ACM.

Digital Library

[13]

P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.

[14]

S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. In Proc. of VLDB, 2010.

Digital Library

[15]

C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce for Machine Learning on Multicore. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, NIPS, pages 281--288. MIT Press, 2006.

[16]

J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. PVLDB, 2(2):1481--1492, 2009.

Digital Library

[17]

G. Czajkowski. Sorting 1PB with MapReduce. googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html riptsize.

[18]

G. Czajkowski, G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, and N. Leiser. Pregel: A system for large-scale graph processing. In Proc. of SIGMOD, 2010.

Digital Library

[19]

J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.

Digital Library

[20]

D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.

Digital Library

[21]

D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. DatabaseColumn Blog. http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step%-backwards riptsize.

[22]

J. Dittrich, J.-A. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop

[23]

: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). In Proc. of VLDB, 2010.

[24]

G. Eadon, E. I. Chong, S. Shankar, A. Raghavan, J. Srinivasan, and S. Das. Supporting table partitioning by reference in Oracle. In Proc. of SIGMOD, pages 1111--1122, 2008.

Digital Library

[25]

E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB, 2(2):1402--1413, 2009.

Digital Library

[26]

Hadoop. Poweredby. http://wiki.apache.org/hadoop/PoweredBy riptsize.

[27]

S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In Proc. of SIGMOD, pages 297--308, 2009.

Digital Library

[28]

C. Monash. Cloudera presents the MapReduce bull case. DBMS2 Blog. dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case riptsize.

[29]

T. Nakayama, M. Hirakawa, and T. Ichikawa. Architecture and Algorithm for Parallel Execution of a Join Operation. In Proc. of ICDE, pages 160--166, 1984.

Digital Library

[30]

B. Panda, J. Herbach, S. Basu, and R. Bayard. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. PVLDB, 2(2):1426--1437, 2009.

Digital Library

[31]

A. Pavlo, A. Rasin, S. Madden, M. Stonebraker, D. DeWitt, E. Paulson, L. Shrinivas, and D. J. Abadi. A Comparison of Approaches to Large Scale Data Analysis. In Proc. of SIGMOD, 2009.

Digital Library

[32]

A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murth. Hive -- A Petabyte Scale Data Warehouse Using Hadoop. In Proc. of ICDE, 2010.

[33]

R. Vernica, M. Carey, and C. Li. Efficient Parallel Set-Similarity Joins Using MapReduce. In Proc. of SIGMOD, 2010.

Digital Library

[34]

C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database. In ICDE '10, 2010.

[35]

H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In Proc. of SIGMOD, pages 1029--1040, 2007.

Digital Library

[36]

M. Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, 2009.

Cited By

Rabaaoui SAloui KNaceur MBarkaoui K(2024)Genetic Algorithm-Based Approach for Optimizing Query Performance in Big Data Environments2024 IEEE International Conference on Advanced Systems and Emergent Technologies (IC_ASET)10.1109/IC_ASET61847.2024.10596192(1-6)Online publication date: 27-Apr-2024
https://doi.org/10.1109/IC_ASET61847.2024.10596192
Bian HSha TAilamaki A(2023)Using Cloud Functions as Accelerator for Elastic Data AnalyticsProceedings of the ACM on Management of Data10.1145/35893061:2(1-27)Online publication date: 20-Jun-2023
https://doi.org/10.1145/3589306
Butterstein DMartin DStolze KBeier FZhong JWang L(2020)Replication at the speed of changeProceedings of the VLDB Endowment10.14778/3415478.341554813:12(3245-3257)Online publication date: 1-Aug-2020
https://dl.acm.org/doi/10.14778/3415478.3415548
Show More Cited By

Index Terms

Efficient processing of data warehousing queries in a split execution environment
1. Information systems
  1. Data management systems
    1. Database management system engines
      1. Database query processing
2. Theory of computation
  1. Theory and algorithms for application domains
    1. Database theory
      1. Database query processing and optimization (theory)

Recommendations

Big data analysis and query optimization improve HadoopDB performance
SEM '14: Proceedings of the 10th International Conference on Semantic Systems

High performance and scalability are two essentials requirements for data analytics systems as the amount of data being collected, stored and processed continue to grow rapidly. In this paper, we propose a new approach based on HadoopDB. Our main goal ...
Optimizing RDF(S) queries on cloud platforms
WWW '13 Companion: Proceedings of the 22nd International Conference on World Wide Web

Scalable processing of Semantic Web queries has become a critical need given the rapid upward trend in availability of Semantic Web data. The MapReduce paradigm is emerging as a platform of choice for large scale data processing and analytics due to its ...
Efficient Batch Processing of Related Big Data Tasks using Persistent MapReduce Technique
VisionNet'16: Proceedings of the Third International Symposium on Computer Vision and the Internet

The data generated by today's enterprises has been increasing at exponential rates in size from most recent couple of years. Also, the need to process and break down the substantial volumes of data has likewise expanded. In order to handle this enormous ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data

June 2011

1364 pages

ISBN:9781450306614

DOI:10.1145/1989323

General Chair:
Timos Sellis
IMIS/RC Athena
,
Program Chair:
Renée J. Miller
University of Toronto
,
Publications Chairs:
Anastasios Kementsietsidis
IBM T.J. Watson Research Center
,
Yannis Velegrakis
University of Trento

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGMOD: ACM Special Interest Group on Management of Data

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2011

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Conference

SIGMOD/PODS '11

Sponsor:

SIGMOD

SIGMOD/PODS '11: International Conference on Management of Data

June 12 - 16, 2011

Athens, Greece

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

53
Total Citations
View Citations
1,345
Total Downloads

Downloads (Last 12 months)20
Downloads (Last 6 weeks)2

Reflects downloads up to 03 Mar 2025

Other Metrics

View Author Metrics

Citations

Cited By

Rabaaoui SAloui KNaceur MBarkaoui K(2024)Genetic Algorithm-Based Approach for Optimizing Query Performance in Big Data Environments2024 IEEE International Conference on Advanced Systems and Emergent Technologies (IC_ASET)10.1109/IC_ASET61847.2024.10596192(1-6)Online publication date: 27-Apr-2024
https://doi.org/10.1109/IC_ASET61847.2024.10596192
Bian HSha TAilamaki A(2023)Using Cloud Functions as Accelerator for Elastic Data AnalyticsProceedings of the ACM on Management of Data10.1145/35893061:2(1-27)Online publication date: 20-Jun-2023
https://doi.org/10.1145/3589306
Butterstein DMartin DStolze KBeier FZhong JWang L(2020)Replication at the speed of changeProceedings of the VLDB Endowment10.14778/3415478.341554813:12(3245-3257)Online publication date: 1-Aug-2020
https://dl.acm.org/doi/10.14778/3415478.3415548
Sun YBlelloch GLim WPavlo A(2019)On supporting efficient snapshot isolation for hybrid workloads with multi-versioned indexesProceedings of the VLDB Endowment10.14778/3364324.336433413:2(211-225)Online publication date: 1-Oct-2019
https://dl.acm.org/doi/10.14778/3364324.3364334
Abouzied AAbadi DBajda-Pawlikowski KSilberschatz A(2019)Integration of large-scale data processing systems and traditional parallel database technologyProceedings of the VLDB Endowment10.14778/3352063.335214512:12(2290-2299)Online publication date: 1-Aug-2019
https://dl.acm.org/doi/10.14778/3352063.3352145
Hajeer MDasgupta D(2019)Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering TechniqueIEEE Transactions on Big Data10.1109/TBDATA.2017.27827855:2(134-147)Online publication date: 1-Jun-2019
https://doi.org/10.1109/TBDATA.2017.2782785
Sethi RTraverso MSundstrom DPhillips DXie WSun YYegitbasi NJin HHwang EShingte NBerner C(2019)Presto: SQL on Everything2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00196(1802-1813)Online publication date: Apr-2019
https://doi.org/10.1109/ICDE.2019.00196
Jiafeng QShitai SLonglong LChengqi LDemeng BWenjie Z(2019)High Performance Secondary Index Design for Complex Queries in Smart Grid SystemJournal of Physics: Conference Series10.1088/1742-6596/1346/1/0120041346(012004)Online publication date: 22-Nov-2019
https://doi.org/10.1088/1742-6596/1346/1/012004
Blanas S(2019)Parallel Join Algorithms in MapReduceEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_206(1248-1253)Online publication date: 20-Feb-2019
https://doi.org/10.1007/978-3-319-77525-8_206
Blanas S(2018)Parallel Join Algorithms in MapReduceEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_206-1(1-6)Online publication date: 5-Mar-2018
https://doi.org/10.1007/978-3-319-63962-8_206-1
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten