research-article

Building a Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse

Authors:
Yuanyuan Tian

IBM Research -- Almaden, San Jose, CA, USA

IBM Research -- Almaden, San Jose, CA, USA

0000-0002-6835-8434
View Profile

,
Fatma Özcan

IBM Research -- Almaden, San Jose, CA, USA

IBM Research -- Almaden, San Jose, CA, USA
View Profile

,
Tao Zou

Google, Mountain View, CA, USA

Google, Mountain View, CA, USA
View Profile

,
Romulo Goncalves

The Netherlands eScience Center, Netherlands

The Netherlands eScience Center, Netherlands
View Profile

,
Hamid Pirahesh

IBM Research -- Almaden, San Jose, CA, USA

IBM Research -- Almaden, San Jose, CA, USA
View Profile

Authors Info & Claims

ACM Transactions on Database Systems Volume 41 Issue 4Article No.: 21pp 1–38https://doi.org/10.1145/2972950

Published:02 November 2016Publication History

ACM Transactions on Database Systems

Abstract

The Hadoop Distributed File System (HDFS) has become an important data repository in the enterprise as the center for all business analytics, from SQL queries and machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have efficient SQL support.

In this article, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm and show that it is a robust join algorithm for hybrid warehouses that performs well in almost all cases. We further develop a sophisticated cost model for the various join algorithms and show that it can facilitate query optimization in the hybrid warehouse to correctly choose the right algorithm under different predicate and join selectivities.

References

Sibel Adali, K. Seluk Candan, Yannis Papakonstantinou, and V. S. Subrahmanian. 1996. Query caching and optimization in distributed mediator systems. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD'96). 137--146. Google ScholarDigital Library
Foto Afrati and Jeffrey Ullman. 2011. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23, 9 (2011), 1282--1298. Google ScholarDigital Library
Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939--964. Google ScholarDigital Library
Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, Till Westmann, Inci Cetindil, and Madhusudan Cheelangi. 2014. AsterixDB: A scalable, open source BDMS. PVLDB 7, 14 (2014), 1905--1916. Google ScholarDigital Library
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD'15). 1383--1394. Google ScholarDigital Library
Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, and Erik Paulson. 2011. Efficient processing of data warehousing queries in a split execution environment. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1165--1176. Google ScholarDigital Library
Chaitanya Baru, Gilles Fecteau, Ambuj Goyal, Hui-I Hsiao, Anant Jhingran, Sriram Padmanabhan, and Walter Wilson.1995. DB2 parallel edition. IBM Syst. J. 34, 2 (April 1995), 292--322. Google ScholarDigital Library
Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, and Yuanyuan Tian. 2010. A comparison of join algorithms for log processing in MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 975--986. Google ScholarDigital Library
Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
Ronnie Chaiken, Bob Jenkins, Paul Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB 1, 2 (2008), 1265--1276. Google ScholarDigital Library
Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
David J. DeWitt and Robert H. Gerber. 1985. Multiprocessor hash-based join algorithms. In Proceedings of the 1985 International Conference on Very Large Data Bases (VLDB'85). 151--164. Google ScholarDigital Library
David J. DeWitt, Alan Halverson, Rimma V. Nehme, Srinath Shankar, Josep Aguilar-Saborit, Artin Avanes, Miro Flasza, and Jim Gramling. 2013. Split query processing in PolyBase. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD'13). 1255--1266. Google ScholarDigital Library
David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker, and David A. Wood. 1984. Implementation techniques for main memory database systems. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD'84). 1--8. Google ScholarDigital Library
David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri. 1992. Practical skew handling in parallel joins. In Proceedings of the 1992 International Conference on Very Large Data Bases (VLDB'92). 27--40. Google ScholarDigital Library
Ronald Fagin, Ravi Kumar, and D. Sivakumar. 2003. Comparing top K lists. In Proceedings of the 2003 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03). 28--36. Google ScholarDigital Library
Doug Frazier. 2013. Dynamic Access: The SQL-H feature for the latest Teradata database leverages data in Hadoop. Retrieved from http://www.teradatamagazine.com/v13n02/Tech2Tech/Dynamic-Access.Google Scholar
Sumit Ganguly, Waqar Hasan, and Ravi Krishnamurthy. 1992. Query optimization for parallel execution. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data (SIGMOD'92). 9--18. Google ScholarDigital Library
Scott C. Gray, Fatma Ozcan, Hebert Pereyra, Bert van der Linden, and Adriana Zubiri. 2015. SQL-on-Hadoop without compromise: How Big SQL 3.0 from IBM represents an important leap forward for speed, portability and robust functionality in SQL-on-Hadoop solutions. Retrieved from http://public.dhe.ibm.com/common/ssi/ecm/sw/en/sww14019usen/SWW14019USEN.PDF.Google Scholar
Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: A new flavor of federated query processing for DB2. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD'02). 524--532. Google ScholarDigital Library
Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. 2015. Impala: A modern, open-source SQL engine for hadoop. In Proceedings of the 2015 Conference on Innovative Data Systems Research (CIDR'15).Google Scholar
Donald Kossmann. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 4 (2000), 422--469. Google ScholarDigital Library
Taewhi Lee, Kisung Kim, and Hyoung-Joo Kim. 2012. Join processing using bloom filter in MapReduce. In Proceedings of the 2012 ACM Research in Applied Computation Symposium (RACS'12). 100--105. Google ScholarDigital Library
Zhe Li and Kenneth A. Ross. 1995. PERF join: An alternative to two-way semijoin and bloomjoin. In Proceedings of the 1995 International Conference on Information and Knowledge Management (CIKM'95). 137--144. Google ScholarDigital Library
Lothar F. Mackert and Guy M. Lohman. 1986. R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the 1986 International Conference on Very Large Data Bases (VLDB'86). 149--159. Google ScholarDigital Library
Michael V. Mannino, Paicheng Chu, and Thomas Sager. 1988. Statistical profile estimation in database systems. ACM Comput. Surv. 20, 3 (1988), 191--221. Google ScholarDigital Library
Dan McClary. 2014. Oracle Big Data SQL: One Fast Query, All Your Data. Retrieved from https://blogs.oracle.com/datawarehousing/entry/oracle_big_data_sql_one.Google Scholar
Loizos Michael, Wolfgang Nejdl, Odysseas Papapetrou, and Wolf Siberski. 2007. Improving distributed join efficiency with extended bloom filter operations. In Proceedings of the 2007 International Conference on Advanced Networking and Applications (AINA'07). 187--194. Google ScholarDigital Library
James K. Mullin. 1990. Optimal semijoins for distributed database systems. TSE 16, 5 (1990), 558--560. Google ScholarDigital Library
Oracle. 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database. Retrieved from http://www.oracle.com/technetwork/bdc/hadoop-loader/connectors-hdfs-wp-1674035.pdf.Google Scholar
Fatma Özcan, David Hoa, Kevin S. Beyer, Andrey Balmin, Chuan Jie Liu, and Yu Li. 2011. Emerging trends in the enterprise data analytics: Connecting Hadoop and DB2 warehouse. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1161--1164. Google ScholarDigital Library
M. Tamer Özsu and Patrick Valduriez. 2011. Principles of Distributed Database Systems (3rd ed.). Springer. Google ScholarDigital Library
Yannis Papakonstantinou, Ashish Gupta, Hector Garcia-Molina, and Jeffrey D. Ullman. 1995. A query translation scheme for rapid implementation of wrappers. In Proceedings of the 1995 International Conference on Deductive and Object-Oriented Databases (DOOD'95). 161--186. Google ScholarDigital Library
Pivotal. 2015. Pivotal HD: HAWQ - A True SQL Engine For Hadoop. Retrieved from http://www.gopivotal.com/sites/default/files/Hawq_WP_042313_FINAL.pdf.Google Scholar
Orestis Polychroniou, Rajkumar Sen, and Kenneth A. Ross. 2014. Track join: Distributed joins with minimal network traffic. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD'14). 1483--1494. Google ScholarDigital Library
Viswanath Poosala and Yannis E. Ioannidis. 1996. Estimation of query-result distribution and its application in parallel-join load balancing. In Proceedings of the 1996 International Conference on Very Large Data Bases (VLDB'96). 448--459. Google ScholarDigital Library
Mary Tork Roth, Fatma Özcan, and Laura M. Haas. 1999. Cost models DO matter: Providing cost information for diverse data sources in a federated system. In Proceedings of the 1999 International Conference on Very Large Data Bases (VLDB'99). 599--610. Google ScholarDigital Library
Ming-Chien Shan, Rafi Ahmed, Jim Davis, Weimin Du, and William Kent. 1995. Pegasus: A heterogeneous information management system. In Modern Database Systems, Won Kim (Ed.). ACM Press/Addison-Wesley Publishing, 664--682. Google ScholarDigital Library
Amit P. Sheth and James A. Larson. 1990. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22, 3 (1990), 183--236. Google ScholarDigital Library
Lakshmikant Shrinivas, Sreenath Bodagala, Ramakrishna Varadarajan, Ariel Cary, Vivek Bharathan, and Chuck Bear. 2013. Materialization strategies in the vertica analytic database: Lessons learned. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE'13). 1196--1207. Google ScholarDigital Library
Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: A modular query optimizer architecture for big data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD'14). 337--348. Google ScholarDigital Library
Arun Swami and K. Bernhard Schiefer. 1994. On the estimation of join result sizes. In Proceedings of the 1994 International Conference on Extending Database Technology (EDBT'94). 287--300. Google ScholarDigital Library
Teradata. 2013. Teradata Connector for Hadoop. Retrieved from http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available.Google Scholar
Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. PVLDB 2, 2 (2009), 1626--1629. Google ScholarDigital Library
Yuanyuan Tian, Tao Zou, Fatma Ozcan, Romulo Goncalves, and Hamid Pirahesh. 2015. Joins for hybrid warehouses: Exploiting massive parallelism in hadoop and enterprise data warehouses. In Proceedings of the 2015 International Conference on Extending Database Technology (EDBT'15). 373--384.Google Scholar
Anthony Tomasic, Louiqa Raschid, and Patrick Valduriez. 1998. Scaling access to heterogeneous data sources with DISCO. TKDE 10, 5 (1998), 808--823. Google ScholarDigital Library
Martin Traverso. 2013. Presto: Interacting with petabytes of data at Facebook. Retrieved from https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920.Google Scholar
Yu Xu, Pekka Kostamaa, Xin Zhou, and Liang Chen. 2008. Handling data skew in parallel joins in shared-nothing systems. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1043--1052. Google ScholarDigital Library
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 2012 USENIX Conference on Networked Systems Design and Implementation (NSDI'12). 15--28. Google ScholarDigital Library
Changchun Zhang, Lei Wu, and Jing Li. 2012. Optimizing distributed joins with bloom filters using MapReduce. In Proceedings of the 2012 International Conference on Computer Applications for Graphics, Grid Computing, and Industrial Environment. 88--95.Google ScholarCross Ref
Wei Zheng. 2015. Hybrid Hybrid Grace Hash Join, v1.0. Retrieved from https://cwiki.apache.org/confluence/display/Hive/Hybrid+Hybrid+Grace+Hash+Join,+v1.0#HybridHybridGraceHashJoin,v1.0-BloomFilter.Google Scholar

Index Terms

Building a Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse
1. Information systems
  1. Data management systems
    1. Database management system engines
    2. Information integration
      1. Data warehouses
      2. Federated databases

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications Symposium

Big Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
Read More
CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance
Abstract
This paper presents CirroData, a high-performance SQL-on-Hadoop system designed for Big Data analytics workloads. As a home-grown enterprise-level online analytical processing (OLAP) system with more than seven-year research and development (R&D) ...
Read More
Flying KIWI: Design of Approximate Query Processing Engine for Interactive Data Analytics at Scale
BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and Services

This paper introduces the design of hybrid SQL-on-Hadoop system, which supports dual-mode (interactive and deep) analytics. We present an architecture of approximate query processing engine using horizontal and vertical sampling of the original database ...
Read More

Comments

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

Published in
ACM Transactions on Database Systems Volume 41, Issue 4
Invited Paper from EDBT 2015, Invited Paper from PODS 2015 and Regular Papers
December 2016
309 pages
ISSN:0362-5915
EISSN:1557-4644
DOI:10.1145/3014437
Editor:
Christian S. Jensen
Aalborg University, Denmark
Issue’s Table of Contents
Copyright © 2016 ACM
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]
Sponsors
In-Cooperation
Publisher
Association for Computing Machinery
New York, NY, United States
Publication History
- Published: 2 November 2016
- Accepted: 1 July 2016
- Revised: 1 March 2016
- Received: 1 August 2015
Published in tods Volume 41, Issue 4

Permissions
Request permissions about this article.
Request Permissions

Check for updates
Author Tags
Bloom filter
Distributed join
SQL-on-Hadoop
cost model
federation
hybrid warehouse
join on Hadoop
query push-down
Qualifiers
- research-article
- Research
- Refereed
Conference
Funding Sources
Other Metrics
View Article Metrics

Article Metrics
- 10
  Total Citations
  View Citations
- 728
  Total Downloads
- Downloads (Last 12 months)24
- Downloads (Last 6 weeks)8
Other Metrics
View Author Metrics
Cited By
View all

PDF Format

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Building a Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

Flying KIWI: Design of Approximate Query Processing Engine for Interactive Data Analytics at Scale

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Other Metrics

Article Metrics

Other Metrics

Cited By

PDF Format

eReader

Digital Edition

Caption

Building a Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse

ACM Transactions on Database Systems

Abstract

References

Cited By

Index Terms

Recommendations

Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware

CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance

Flying KIWI: Design of Approximate Query Processing Engine for Interactive Data Analytics at Scale

Comments

Login options

Full Access

Published in

Sponsors

In-Cooperation

Publisher

Publication History

Permissions

Check for updates

Author Tags

Qualifiers

Conference

Funding Sources

Article Metrics

Other Metrics

PDF Format

eReader

Digital Edition

Share this Publication link

Share on Social Media