skip to main content
research-article

Building a Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse

Published:02 November 2016Publication History
Skip Abstract Section

Abstract

The Hadoop Distributed File System (HDFS) has become an important data repository in the enterprise as the center for all business analytics, from SQL queries and machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have efficient SQL support.

In this article, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm and show that it is a robust join algorithm for hybrid warehouses that performs well in almost all cases. We further develop a sophisticated cost model for the various join algorithms and show that it can facilitate query optimization in the hybrid warehouse to correctly choose the right algorithm under different predicate and join selectivities.

References

  1. Sibel Adali, K. Seluk Candan, Yannis Papakonstantinou, and V. S. Subrahmanian. 1996. Query caching and optimization in distributed mediator systems. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD'96). 137--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Foto Afrati and Jeffrey Ullman. 2011. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23, 9 (2011), 1282--1298. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939--964. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, Till Westmann, Inci Cetindil, and Madhusudan Cheelangi. 2014. AsterixDB: A scalable, open source BDMS. PVLDB 7, 14 (2014), 1905--1916. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD'15). 1383--1394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, and Erik Paulson. 2011. Efficient processing of data warehousing queries in a split execution environment. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1165--1176. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Chaitanya Baru, Gilles Fecteau, Ambuj Goyal, Hui-I Hsiao, Anant Jhingran, Sriram Padmanabhan, and Walter Wilson.1995. DB2 parallel edition. IBM Syst. J. 34, 2 (April 1995), 292--322. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, and Yuanyuan Tian. 2010. A comparison of join algorithms for log processing in MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 975--986. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Ronnie Chaiken, Bob Jenkins, Paul Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB 1, 2 (2008), 1265--1276. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. David J. DeWitt and Robert H. Gerber. 1985. Multiprocessor hash-based join algorithms. In Proceedings of the 1985 International Conference on Very Large Data Bases (VLDB'85). 151--164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David J. DeWitt, Alan Halverson, Rimma V. Nehme, Srinath Shankar, Josep Aguilar-Saborit, Artin Avanes, Miro Flasza, and Jim Gramling. 2013. Split query processing in PolyBase. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD'13). 1255--1266. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker, and David A. Wood. 1984. Implementation techniques for main memory database systems. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD'84). 1--8. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri. 1992. Practical skew handling in parallel joins. In Proceedings of the 1992 International Conference on Very Large Data Bases (VLDB'92). 27--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Ronald Fagin, Ravi Kumar, and D. Sivakumar. 2003. Comparing top K lists. In Proceedings of the 2003 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03). 28--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Doug Frazier. 2013. Dynamic Access: The SQL-H feature for the latest Teradata database leverages data in Hadoop. Retrieved from http://www.teradatamagazine.com/v13n02/Tech2Tech/Dynamic-Access.Google ScholarGoogle Scholar
  18. Sumit Ganguly, Waqar Hasan, and Ravi Krishnamurthy. 1992. Query optimization for parallel execution. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data (SIGMOD'92). 9--18. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Scott C. Gray, Fatma Ozcan, Hebert Pereyra, Bert van der Linden, and Adriana Zubiri. 2015. SQL-on-Hadoop without compromise: How Big SQL 3.0 from IBM represents an important leap forward for speed, portability and robust functionality in SQL-on-Hadoop solutions. Retrieved from http://public.dhe.ibm.com/common/ssi/ecm/sw/en/sww14019usen/SWW14019USEN.PDF.Google ScholarGoogle Scholar
  20. Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: A new flavor of federated query processing for DB2. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD'02). 524--532. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. 2015. Impala: A modern, open-source SQL engine for hadoop. In Proceedings of the 2015 Conference on Innovative Data Systems Research (CIDR'15).Google ScholarGoogle Scholar
  22. Donald Kossmann. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 4 (2000), 422--469. Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Taewhi Lee, Kisung Kim, and Hyoung-Joo Kim. 2012. Join processing using bloom filter in MapReduce. In Proceedings of the 2012 ACM Research in Applied Computation Symposium (RACS'12). 100--105. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Zhe Li and Kenneth A. Ross. 1995. PERF join: An alternative to two-way semijoin and bloomjoin. In Proceedings of the 1995 International Conference on Information and Knowledge Management (CIKM'95). 137--144. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. Lothar F. Mackert and Guy M. Lohman. 1986. R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the 1986 International Conference on Very Large Data Bases (VLDB'86). 149--159. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Michael V. Mannino, Paicheng Chu, and Thomas Sager. 1988. Statistical profile estimation in database systems. ACM Comput. Surv. 20, 3 (1988), 191--221. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Dan McClary. 2014. Oracle Big Data SQL: One Fast Query, All Your Data. Retrieved from https://blogs.oracle.com/datawarehousing/entry/oracle_big_data_sql_one.Google ScholarGoogle Scholar
  28. Loizos Michael, Wolfgang Nejdl, Odysseas Papapetrou, and Wolf Siberski. 2007. Improving distributed join efficiency with extended bloom filter operations. In Proceedings of the 2007 International Conference on Advanced Networking and Applications (AINA'07). 187--194. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. James K. Mullin. 1990. Optimal semijoins for distributed database systems. TSE 16, 5 (1990), 558--560. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Oracle. 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database. Retrieved from http://www.oracle.com/technetwork/bdc/hadoop-loader/connectors-hdfs-wp-1674035.pdf.Google ScholarGoogle Scholar
  31. Fatma Özcan, David Hoa, Kevin S. Beyer, Andrey Balmin, Chuan Jie Liu, and Yu Li. 2011. Emerging trends in the enterprise data analytics: Connecting Hadoop and DB2 warehouse. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1161--1164. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. M. Tamer Özsu and Patrick Valduriez. 2011. Principles of Distributed Database Systems (3rd ed.). Springer. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Yannis Papakonstantinou, Ashish Gupta, Hector Garcia-Molina, and Jeffrey D. Ullman. 1995. A query translation scheme for rapid implementation of wrappers. In Proceedings of the 1995 International Conference on Deductive and Object-Oriented Databases (DOOD'95). 161--186. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Pivotal. 2015. Pivotal HD: HAWQ - A True SQL Engine For Hadoop. Retrieved from http://www.gopivotal.com/sites/default/files/Hawq_WP_042313_FINAL.pdf.Google ScholarGoogle Scholar
  35. Orestis Polychroniou, Rajkumar Sen, and Kenneth A. Ross. 2014. Track join: Distributed joins with minimal network traffic. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD'14). 1483--1494. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Viswanath Poosala and Yannis E. Ioannidis. 1996. Estimation of query-result distribution and its application in parallel-join load balancing. In Proceedings of the 1996 International Conference on Very Large Data Bases (VLDB'96). 448--459. Google ScholarGoogle ScholarDigital LibraryDigital Library
  37. Mary Tork Roth, Fatma Özcan, and Laura M. Haas. 1999. Cost models DO matter: Providing cost information for diverse data sources in a federated system. In Proceedings of the 1999 International Conference on Very Large Data Bases (VLDB'99). 599--610. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Ming-Chien Shan, Rafi Ahmed, Jim Davis, Weimin Du, and William Kent. 1995. Pegasus: A heterogeneous information management system. In Modern Database Systems, Won Kim (Ed.). ACM Press/Addison-Wesley Publishing, 664--682. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Amit P. Sheth and James A. Larson. 1990. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22, 3 (1990), 183--236. Google ScholarGoogle ScholarDigital LibraryDigital Library
  40. Lakshmikant Shrinivas, Sreenath Bodagala, Ramakrishna Varadarajan, Ariel Cary, Vivek Bharathan, and Chuck Bear. 2013. Materialization strategies in the vertica analytic database: Lessons learned. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE'13). 1196--1207. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: A modular query optimizer architecture for big data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD'14). 337--348. Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Arun Swami and K. Bernhard Schiefer. 1994. On the estimation of join result sizes. In Proceedings of the 1994 International Conference on Extending Database Technology (EDBT'94). 287--300. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Teradata. 2013. Teradata Connector for Hadoop. Retrieved from http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available.Google ScholarGoogle Scholar
  44. Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. PVLDB 2, 2 (2009), 1626--1629. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. Yuanyuan Tian, Tao Zou, Fatma Ozcan, Romulo Goncalves, and Hamid Pirahesh. 2015. Joins for hybrid warehouses: Exploiting massive parallelism in hadoop and enterprise data warehouses. In Proceedings of the 2015 International Conference on Extending Database Technology (EDBT'15). 373--384.Google ScholarGoogle Scholar
  46. Anthony Tomasic, Louiqa Raschid, and Patrick Valduriez. 1998. Scaling access to heterogeneous data sources with DISCO. TKDE 10, 5 (1998), 808--823. Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Martin Traverso. 2013. Presto: Interacting with petabytes of data at Facebook. Retrieved from https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920.Google ScholarGoogle Scholar
  48. Yu Xu, Pekka Kostamaa, Xin Zhou, and Liang Chen. 2008. Handling data skew in parallel joins in shared-nothing systems. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1043--1052. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 2012 USENIX Conference on Networked Systems Design and Implementation (NSDI'12). 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  50. Changchun Zhang, Lei Wu, and Jing Li. 2012. Optimizing distributed joins with bloom filters using MapReduce. In Proceedings of the 2012 International Conference on Computer Applications for Graphics, Grid Computing, and Industrial Environment. 88--95.Google ScholarGoogle ScholarCross RefCross Ref
  51. Wei Zheng. 2015. Hybrid Hybrid Grace Hash Join, v1.0. Retrieved from https://cwiki.apache.org/confluence/display/Hive/Hybrid+Hybrid+Grace+Hash+Join,+v1.0#HybridHybridGraceHashJoin,v1.0-BloomFilter.Google ScholarGoogle Scholar

Index Terms

  1. Building a Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse

                  Recommendations

                  Comments

                  Login options

                  Check if you have access through your login credentials or your institution to get full access on this article.

                  Sign in

                  Full Access

                  • Published in

                    cover image ACM Transactions on Database Systems
                    ACM Transactions on Database Systems  Volume 41, Issue 4
                    Invited Paper from EDBT 2015, Invited Paper from PODS 2015 and Regular Papers
                    December 2016
                    309 pages
                    ISSN:0362-5915
                    EISSN:1557-4644
                    DOI:10.1145/3014437
                    Issue’s Table of Contents

                    Copyright © 2016 ACM

                    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

                    Publisher

                    Association for Computing Machinery

                    New York, NY, United States

                    Publication History

                    • Published: 2 November 2016
                    • Accepted: 1 July 2016
                    • Revised: 1 March 2016
                    • Received: 1 August 2015
                    Published in tods Volume 41, Issue 4

                    Permissions

                    Request permissions about this article.

                    Request Permissions

                    Check for updates

                    Qualifiers

                    • research-article
                    • Research
                    • Refereed

                  PDF Format

                  View or Download as a PDF file.

                  PDF

                  eReader

                  View online with eReader.

                  eReader