skip to main content
10.1145/1989323.1989447acmconferencesArticle/Chapter ViewAbstractPublication PagesmodConference Proceedingsconference-collections
research-article

Efficient processing of data warehousing queries in a split execution environment

Published: 12 June 2011 Publication History

Abstract

Hadapt is a start-up company currently commercializing the Yale University research project called HadoopDB. The company focuses on building a platform for Big Data analytics in the cloud by introducing a storage layer optimized for structured data and by providing a framework for executing SQL queries efficiently. This work considers processing data warehousing queries over very large datasets. Our goal is to maximize perfor mance while, at the same time, not giving up fault tolerance and scalability. We analyze the complexity of this problem in the split execution environment of HadoopDB. Here, incoming queries are examined; parts of the query are pushed down and executed inside the higher performing database layer; and the rest of the query is processed in a more generic MapReduce framework.
In this paper, we discuss in detail performance-oriented query execution strategies for data warehouse queries in split execution environments, with particular focus on join and aggregation operations. The efficiency of our techniques is demonstrated by running experiments using the TPC-H benchmark with 3TB of data. In these experiments we compare our results with a standard commercial parallel database and an open-source MapReduce implementation featuring a SQL interface (Hive). We show that HadoopDB successfully competes with other systems.

References

[1]
Hadapt Inc. Web page. http://www.hadapt.com riptsize.
[2]
Hadoop. Web page. http://hadoop.apache.org riptsize.
[3]
Hadoop TeraSort. http://developer.yahoo.com/blogs/hadoop/Yahoo2009.pdf riptsize.
[4]
Hive. Web page. http://hadoop.apache.org/hive riptsize.
[5]
Running TPC-H queries on Hive. Web page. http://issues.apache.org/jira/browse/HIVE-600 riptsize.
[6]
TPC-H. Web page. http://www.tpc.org/tpch riptsize.
[7]
VectorWise. Web page. http://www.vectorwise.com riptsize.
[8]
D. J. Abadi, D. S. Myers, D. J. DeWitt, and S. R. Madden. Materialization Strategies in a Column-Oriented DBMS. In ICDE, pages 466--475, Istanbul, Turkey, 2007.
[9]
A. Abouzeid, K. Bajda-Pawlikowski, D. J. Abadi, A. Silberschatz, and A. Rasin. HadoopDB: An Architectural Hybrid of MapReduce and DBMS Technologies for Analytical Workloads. In VLDB, 2009.
[10]
A. Abouzied, K. Bajda-Pawlikowski, J. Huang, D. J. Abadi, and A. Silberschatz. Hadoopdb in action: Building real world applications. Demonstration. SIGMOD, 2010.
[11]
E. Albanese. Why Europe's Largest Ad Targeting Platform Uses Hadoop. http://www.cloudera.com/blog/2010/03/why-europes-largest-ad-targeting-platfo%rm-uses-hadoop riptsize.
[12]
S. Blanas, J. M. Patel, V. Ercegovac, J. Rao, E. J. Shekita, and Y. Tian. A comparison of join algorithms for log processing in MapReduce. In Proc. of SIGMOD, pages 975--986, New York, NY, USA, 2010. ACM.
[13]
P. A. Boncz, M. Zukowski, and N. Nes. Monetdb/x100: Hyper-pipelining query execution. In CIDR, pages 225--237, 2005.
[14]
S. Chen. Cheetah: A High Performance, Custom Data Warehouse on Top of MapReduce. In Proc. of VLDB, 2010.
[15]
C. T. Chu, S. K. Kim, Y. A. Lin, Y. Yu, G. R. Bradski, A. Y. Ng, and K. Olukotun. Map-Reduce for Machine Learning on Multicore. In B. Schölkopf, J. C. Platt, and T. Hoffman, editors, NIPS, pages 281--288. MIT Press, 2006.
[16]
J. Cohen, B. Dolan, M. Dunlap, J. M. Hellerstein, and C. Welton. MAD skills: new analysis practices for big data. PVLDB, 2(2):1481--1492, 2009.
[17]
G. Czajkowski. Sorting 1PB with MapReduce. googleblog.blogspot.com/2008/11/sorting-1pb-with-mapreduce.html riptsize.
[18]
G. Czajkowski, G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, and N. Leiser. Pregel: A system for large-scale graph processing. In Proc. of SIGMOD, 2010.
[19]
J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. In OSDI, 2004.
[20]
D. DeWitt and J. Gray. Parallel database systems: the future of high performance database systems. Commun. ACM, 35(6):85--98, 1992.
[21]
D. DeWitt and M. Stonebraker. MapReduce: A major step backwards. DatabaseColumn Blog. http://databasecolumn.vertica.com/database-innovation/mapreduce-a-major-step%-backwards riptsize.
[22]
J. Dittrich, J.-A. Quiane-Ruiz, A. Jindal, Y. Kargin, V. Setty, and J. Schad. Hadoop
[23]
: Making a Yellow Elephant Run Like a Cheetah (Without It Even Noticing). In Proc. of VLDB, 2010.
[24]
G. Eadon, E. I. Chong, S. Shankar, A. Raghavan, J. Srinivasan, and S. Das. Supporting table partitioning by reference in Oracle. In Proc. of SIGMOD, pages 1111--1122, 2008.
[25]
E. Friedman, P. Pawlowski, and J. Cieslewicz. SQL/MapReduce: a practical approach to self-describing, polymorphic, and parallelizable user-defined functions. PVLDB, 2(2):1402--1413, 2009.
[26]
Hadoop. Poweredby. http://wiki.apache.org/hadoop/PoweredBy riptsize.
[27]
S. Idreos, M. L. Kersten, and S. Manegold. Self-organizing tuple reconstruction in column-stores. In Proc. of SIGMOD, pages 297--308, 2009.
[28]
C. Monash. Cloudera presents the MapReduce bull case. DBMS2 Blog. dbms2.com/2009/04/15/cloudera-presents-the-mapreduce-bull-case riptsize.
[29]
T. Nakayama, M. Hirakawa, and T. Ichikawa. Architecture and Algorithm for Parallel Execution of a Join Operation. In Proc. of ICDE, pages 160--166, 1984.
[30]
B. Panda, J. Herbach, S. Basu, and R. Bayard. PLANET: Massively Parallel Learning of Tree Ensembles with MapReduce. PVLDB, 2(2):1426--1437, 2009.
[31]
A. Pavlo, A. Rasin, S. Madden, M. Stonebraker, D. DeWitt, E. Paulson, L. Shrinivas, and D. J. Abadi. A Comparison of Approaches to Large Scale Data Analysis. In Proc. of SIGMOD, 2009.
[32]
A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, N. Zhang, S. Antony, H. Liu, and R. Murth. Hive -- A Petabyte Scale Data Warehouse Using Hadoop. In Proc. of ICDE, 2010.
[33]
R. Vernica, M. Carey, and C. Li. Efficient Parallel Set-Similarity Joins Using MapReduce. In Proc. of SIGMOD, 2010.
[34]
C. Yang, C. Yen, C. Tan, and S. Madden. Osprey: Implementing MapReduce-Style Fault Tolerance in a Shared-Nothing Distributed Database. In ICDE '10, 2010.
[35]
H.-c. Yang, A. Dasdan, R.-L. Hsiao, and D. S. Parker. Map-reduce-merge: simplified relational data processing on large clusters. In Proc. of SIGMOD, pages 1029--1040, 2007.
[36]
M. Zukowski. Balancing Vectorized Query Execution with Bandwidth-Optimized Storage. PhD thesis, Universiteit van Amsterdam, Amsterdam, The Netherlands, 2009.

Cited By

View all
  • (2024)Genetic Algorithm-Based Approach for Optimizing Query Performance in Big Data Environments2024 IEEE International Conference on Advanced Systems and Emergent Technologies (IC_ASET)10.1109/IC_ASET61847.2024.10596192(1-6)Online publication date: 27-Apr-2024
  • (2023)Using Cloud Functions as Accelerator for Elastic Data AnalyticsProceedings of the ACM on Management of Data10.1145/35893061:2(1-27)Online publication date: 20-Jun-2023
  • (2020)Replication at the speed of changeProceedings of the VLDB Endowment10.14778/3415478.341554813:12(3245-3257)Online publication date: 1-Aug-2020
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGMOD '11: Proceedings of the 2011 ACM SIGMOD International Conference on Management of data
June 2011
1364 pages
ISBN:9781450306614
DOI:10.1145/1989323
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 June 2011

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Hadoop
  2. mapreduce
  3. query execution

Qualifiers

  • Research-article

Conference

SIGMOD/PODS '11
Sponsor:

Acceptance Rates

Overall Acceptance Rate 785 of 4,003 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)20
  • Downloads (Last 6 weeks)2
Reflects downloads up to 03 Mar 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Genetic Algorithm-Based Approach for Optimizing Query Performance in Big Data Environments2024 IEEE International Conference on Advanced Systems and Emergent Technologies (IC_ASET)10.1109/IC_ASET61847.2024.10596192(1-6)Online publication date: 27-Apr-2024
  • (2023)Using Cloud Functions as Accelerator for Elastic Data AnalyticsProceedings of the ACM on Management of Data10.1145/35893061:2(1-27)Online publication date: 20-Jun-2023
  • (2020)Replication at the speed of changeProceedings of the VLDB Endowment10.14778/3415478.341554813:12(3245-3257)Online publication date: 1-Aug-2020
  • (2019)On supporting efficient snapshot isolation for hybrid workloads with multi-versioned indexesProceedings of the VLDB Endowment10.14778/3364324.336433413:2(211-225)Online publication date: 1-Oct-2019
  • (2019)Integration of large-scale data processing systems and traditional parallel database technologyProceedings of the VLDB Endowment10.14778/3352063.335214512:12(2290-2299)Online publication date: 1-Aug-2019
  • (2019)Handling Big Data Using a Data-Aware HDFS and Evolutionary Clustering TechniqueIEEE Transactions on Big Data10.1109/TBDATA.2017.27827855:2(134-147)Online publication date: 1-Jun-2019
  • (2019)Presto: SQL on Everything2019 IEEE 35th International Conference on Data Engineering (ICDE)10.1109/ICDE.2019.00196(1802-1813)Online publication date: Apr-2019
  • (2019)High Performance Secondary Index Design for Complex Queries in Smart Grid SystemJournal of Physics: Conference Series10.1088/1742-6596/1346/1/0120041346(012004)Online publication date: 22-Nov-2019
  • (2019)Parallel Join Algorithms in MapReduceEncyclopedia of Big Data Technologies10.1007/978-3-319-77525-8_206(1248-1253)Online publication date: 20-Feb-2019
  • (2018)Parallel Join Algorithms in MapReduceEncyclopedia of Big Data Technologies10.1007/978-3-319-63962-8_206-1(1-6)Online publication date: 5-Mar-2018
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media