Abstract
The Hadoop Distributed File System (HDFS) has become an important data repository in the enterprise as the center for all business analytics, from SQL queries and machine learning to reporting. At the same time, enterprise data warehouses (EDWs) continue to support critical business analytics. This has created the need for a new generation of a special federation between Hadoop-like big data platforms and EDWs, which we call the hybrid warehouse. There are many applications that require correlating data stored in HDFS with EDW data, such as the analysis that associates click logs stored in HDFS with the sales data stored in the database. All existing solutions reach out to HDFS and read the data into the EDW to perform the joins, assuming that the Hadoop side does not have efficient SQL support.
In this article, we show that it is actually better to do most data processing on the HDFS side, provided that we can leverage a sophisticated execution engine for joins on the Hadoop side. We identify the best hybrid warehouse architecture by studying various algorithms to join database and HDFS tables. We utilize Bloom filters to minimize the data movement and exploit the massive parallelism in both systems to the fullest extent possible. We describe a new zigzag join algorithm and show that it is a robust join algorithm for hybrid warehouses that performs well in almost all cases. We further develop a sophisticated cost model for the various join algorithms and show that it can facilitate query optimization in the hybrid warehouse to correctly choose the right algorithm under different predicate and join selectivities.
- Sibel Adali, K. Seluk Candan, Yannis Papakonstantinou, and V. S. Subrahmanian. 1996. Query caching and optimization in distributed mediator systems. In Proceedings of the 1996 ACM SIGMOD International Conference on Management of Data (SIGMOD'96). 137--146. Google ScholarDigital Library
- Foto Afrati and Jeffrey Ullman. 2011. Optimizing multiway joins in a map-reduce environment. IEEE Transactions on Knowledge and Data Engineering (TKDE) 23, 9 (2011), 1282--1298. Google ScholarDigital Library
- Alexander Alexandrov, Rico Bergmann, Stephan Ewen, Johann-Christoph Freytag, Fabian Hueske, Arvid Heise, Odej Kao, Marcus Leich, Ulf Leser, Volker Markl, Felix Naumann, Mathias Peters, Astrid Rheinländer, Matthias J. Sax, Sebastian Schelter, Mareike Höger, Kostas Tzoumas, and Daniel Warneke. 2014. The stratosphere platform for big data analytics. VLDB J. 23, 6 (2014), 939--964. Google ScholarDigital Library
- Sattam Alsubaiee, Yasser Altowim, Hotham Altwaijry, Alexander Behm, Vinayak Borkar, Yingyi Bu, Michael Carey, Khurram Faraaz, Eugenia Gabrielova, Raman Grover, Zachary Heilbron, Young-Seok Kim, Chen Li, Guangqiang Li, Ji Mahn Ok, Nicola Onose, Pouria Pirzadeh, Vassilis Tsotras, Rares Vernica, Jian Wen, Till Westmann, Inci Cetindil, and Madhusudan Cheelangi. 2014. AsterixDB: A scalable, open source BDMS. PVLDB 7, 14 (2014), 1905--1916. Google ScholarDigital Library
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in spark. In Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data (SIGMOD'15). 1383--1394. Google ScholarDigital Library
- Kamil Bajda-Pawlikowski, Daniel J. Abadi, Avi Silberschatz, and Erik Paulson. 2011. Efficient processing of data warehousing queries in a split execution environment. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1165--1176. Google ScholarDigital Library
- Chaitanya Baru, Gilles Fecteau, Ambuj Goyal, Hui-I Hsiao, Anant Jhingran, Sriram Padmanabhan, and Walter Wilson.1995. DB2 parallel edition. IBM Syst. J. 34, 2 (April 1995), 292--322. Google ScholarDigital Library
- Spyros Blanas, Jignesh M. Patel, Vuk Ercegovac, Jun Rao, Eugene J. Shekita, and Yuanyuan Tian. 2010. A comparison of join algorithms for log processing in MapReduce. In Proceedings of the 2010 ACM SIGMOD International Conference on Management of Data (SIGMOD'10). 975--986. Google ScholarDigital Library
- Burton H. Bloom. 1970. Space/time trade-offs in hash coding with allowable errors. Commun. ACM 13, 7 (1970), 422--426. Google ScholarDigital Library
- Ronnie Chaiken, Bob Jenkins, Paul Larson, Bill Ramsey, Darren Shakib, Simon Weaver, and Jingren Zhou. 2008. SCOPE: Easy and efficient parallel processing of massive data sets. PVLDB 1, 2 (2008), 1265--1276. Google ScholarDigital Library
- Jeffrey Dean and Sanjay Ghemawat. 2008. MapReduce: Simplified data processing on large clusters. Commun. ACM 51, 1 (2008), 107--113. Google ScholarDigital Library
- David J. DeWitt and Robert H. Gerber. 1985. Multiprocessor hash-based join algorithms. In Proceedings of the 1985 International Conference on Very Large Data Bases (VLDB'85). 151--164. Google ScholarDigital Library
- David J. DeWitt, Alan Halverson, Rimma V. Nehme, Srinath Shankar, Josep Aguilar-Saborit, Artin Avanes, Miro Flasza, and Jim Gramling. 2013. Split query processing in PolyBase. In Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data (SIGMOD'13). 1255--1266. Google ScholarDigital Library
- David J. DeWitt, Randy H. Katz, Frank Olken, Leonard D. Shapiro, Michael R. Stonebraker, and David A. Wood. 1984. Implementation techniques for main memory database systems. In Proceedings of the 1984 ACM SIGMOD International Conference on Management of Data (SIGMOD'84). 1--8. Google ScholarDigital Library
- David J. DeWitt, Jeffrey F. Naughton, Donovan A. Schneider, and S. Seshadri. 1992. Practical skew handling in parallel joins. In Proceedings of the 1992 International Conference on Very Large Data Bases (VLDB'92). 27--40. Google ScholarDigital Library
- Ronald Fagin, Ravi Kumar, and D. Sivakumar. 2003. Comparing top K lists. In Proceedings of the 2003 Annual ACM-SIAM Symposium on Discrete Algorithms (SODA'03). 28--36. Google ScholarDigital Library
- Doug Frazier. 2013. Dynamic Access: The SQL-H feature for the latest Teradata database leverages data in Hadoop. Retrieved from http://www.teradatamagazine.com/v13n02/Tech2Tech/Dynamic-Access.Google Scholar
- Sumit Ganguly, Waqar Hasan, and Ravi Krishnamurthy. 1992. Query optimization for parallel execution. In Proceedings of the 1992 ACM SIGMOD International Conference on Management of Data (SIGMOD'92). 9--18. Google ScholarDigital Library
- Scott C. Gray, Fatma Ozcan, Hebert Pereyra, Bert van der Linden, and Adriana Zubiri. 2015. SQL-on-Hadoop without compromise: How Big SQL 3.0 from IBM represents an important leap forward for speed, portability and robust functionality in SQL-on-Hadoop solutions. Retrieved from http://public.dhe.ibm.com/common/ssi/ecm/sw/en/sww14019usen/SWW14019USEN.PDF.Google Scholar
- Vanja Josifovski, Peter Schwarz, Laura Haas, and Eileen Lin. 2002. Garlic: A new flavor of federated query processing for DB2. In Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data (SIGMOD'02). 524--532. Google ScholarDigital Library
- Marcel Kornacker, Alexander Behm, Victor Bittorf, Taras Bobrovytsky, Casey Ching, Alan Choi, Justin Erickson, Martin Grund, Daniel Hecht, Matthew Jacobs, Ishaan Joshi, Lenni Kuff, Dileep Kumar, Alex Leblang, Nong Li, Ippokratis Pandis, Henry Robinson, David Rorke, Silvius Rus, John Russell, Dimitris Tsirogiannis, Skye Wanderman-Milne, and Michael Yoder. 2015. Impala: A modern, open-source SQL engine for hadoop. In Proceedings of the 2015 Conference on Innovative Data Systems Research (CIDR'15).Google Scholar
- Donald Kossmann. 2000. The state of the art in distributed query processing. ACM Comput. Surv. 32, 4 (2000), 422--469. Google ScholarDigital Library
- Taewhi Lee, Kisung Kim, and Hyoung-Joo Kim. 2012. Join processing using bloom filter in MapReduce. In Proceedings of the 2012 ACM Research in Applied Computation Symposium (RACS'12). 100--105. Google ScholarDigital Library
- Zhe Li and Kenneth A. Ross. 1995. PERF join: An alternative to two-way semijoin and bloomjoin. In Proceedings of the 1995 International Conference on Information and Knowledge Management (CIKM'95). 137--144. Google ScholarDigital Library
- Lothar F. Mackert and Guy M. Lohman. 1986. R* optimizer validation and performance evaluation for distributed queries. In Proceedings of the 1986 International Conference on Very Large Data Bases (VLDB'86). 149--159. Google ScholarDigital Library
- Michael V. Mannino, Paicheng Chu, and Thomas Sager. 1988. Statistical profile estimation in database systems. ACM Comput. Surv. 20, 3 (1988), 191--221. Google ScholarDigital Library
- Dan McClary. 2014. Oracle Big Data SQL: One Fast Query, All Your Data. Retrieved from https://blogs.oracle.com/datawarehousing/entry/oracle_big_data_sql_one.Google Scholar
- Loizos Michael, Wolfgang Nejdl, Odysseas Papapetrou, and Wolf Siberski. 2007. Improving distributed join efficiency with extended bloom filter operations. In Proceedings of the 2007 International Conference on Advanced Networking and Applications (AINA'07). 187--194. Google ScholarDigital Library
- James K. Mullin. 1990. Optimal semijoins for distributed database systems. TSE 16, 5 (1990), 558--560. Google ScholarDigital Library
- Oracle. 2012. High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database. Retrieved from http://www.oracle.com/technetwork/bdc/hadoop-loader/connectors-hdfs-wp-1674035.pdf.Google Scholar
- Fatma Özcan, David Hoa, Kevin S. Beyer, Andrey Balmin, Chuan Jie Liu, and Yu Li. 2011. Emerging trends in the enterprise data analytics: Connecting Hadoop and DB2 warehouse. In Proceedings of the 2011 ACM SIGMOD International Conference on Management of Data (SIGMOD'11). 1161--1164. Google ScholarDigital Library
- M. Tamer Özsu and Patrick Valduriez. 2011. Principles of Distributed Database Systems (3rd ed.). Springer. Google ScholarDigital Library
- Yannis Papakonstantinou, Ashish Gupta, Hector Garcia-Molina, and Jeffrey D. Ullman. 1995. A query translation scheme for rapid implementation of wrappers. In Proceedings of the 1995 International Conference on Deductive and Object-Oriented Databases (DOOD'95). 161--186. Google ScholarDigital Library
- Pivotal. 2015. Pivotal HD: HAWQ - A True SQL Engine For Hadoop. Retrieved from http://www.gopivotal.com/sites/default/files/Hawq_WP_042313_FINAL.pdf.Google Scholar
- Orestis Polychroniou, Rajkumar Sen, and Kenneth A. Ross. 2014. Track join: Distributed joins with minimal network traffic. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD'14). 1483--1494. Google ScholarDigital Library
- Viswanath Poosala and Yannis E. Ioannidis. 1996. Estimation of query-result distribution and its application in parallel-join load balancing. In Proceedings of the 1996 International Conference on Very Large Data Bases (VLDB'96). 448--459. Google ScholarDigital Library
- Mary Tork Roth, Fatma Özcan, and Laura M. Haas. 1999. Cost models DO matter: Providing cost information for diverse data sources in a federated system. In Proceedings of the 1999 International Conference on Very Large Data Bases (VLDB'99). 599--610. Google ScholarDigital Library
- Ming-Chien Shan, Rafi Ahmed, Jim Davis, Weimin Du, and William Kent. 1995. Pegasus: A heterogeneous information management system. In Modern Database Systems, Won Kim (Ed.). ACM Press/Addison-Wesley Publishing, 664--682. Google ScholarDigital Library
- Amit P. Sheth and James A. Larson. 1990. Federated database systems for managing distributed, heterogeneous, and autonomous databases. ACM Comput. Surv. 22, 3 (1990), 183--236. Google ScholarDigital Library
- Lakshmikant Shrinivas, Sreenath Bodagala, Ramakrishna Varadarajan, Ariel Cary, Vivek Bharathan, and Chuck Bear. 2013. Materialization strategies in the vertica analytic database: Lessons learned. In Proceedings of the 2013 IEEE International Conference on Data Engineering (ICDE'13). 1196--1207. Google ScholarDigital Library
- Mohamed A. Soliman, Lyublena Antova, Venkatesh Raghavan, Amr El-Helw, Zhongxian Gu, Entong Shen, George C. Caragea, Carlos Garcia-Alvarado, Foyzur Rahman, Michalis Petropoulos, Florian Waas, Sivaramakrishnan Narayanan, Konstantinos Krikellas, and Rhonda Baldwin. 2014. Orca: A modular query optimizer architecture for big data. In Proceedings of the 2014 ACM SIGMOD International Conference on Management of Data (SIGMOD'14). 337--348. Google ScholarDigital Library
- Arun Swami and K. Bernhard Schiefer. 1994. On the estimation of join result sizes. In Proceedings of the 1994 International Conference on Extending Database Technology (EDBT'94). 287--300. Google ScholarDigital Library
- Teradata. 2013. Teradata Connector for Hadoop. Retrieved from http://developer.teradata.com/connectivity/articles/teradata-connector-for-hadoop-now-available.Google Scholar
- Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff, and Raghotham Murthy. 2009. Hive: A warehousing solution over a map-reduce framework. PVLDB 2, 2 (2009), 1626--1629. Google ScholarDigital Library
- Yuanyuan Tian, Tao Zou, Fatma Ozcan, Romulo Goncalves, and Hamid Pirahesh. 2015. Joins for hybrid warehouses: Exploiting massive parallelism in hadoop and enterprise data warehouses. In Proceedings of the 2015 International Conference on Extending Database Technology (EDBT'15). 373--384.Google Scholar
- Anthony Tomasic, Louiqa Raschid, and Patrick Valduriez. 1998. Scaling access to heterogeneous data sources with DISCO. TKDE 10, 5 (1998), 808--823. Google ScholarDigital Library
- Martin Traverso. 2013. Presto: Interacting with petabytes of data at Facebook. Retrieved from https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920.Google Scholar
- Yu Xu, Pekka Kostamaa, Xin Zhou, and Liang Chen. 2008. Handling data skew in parallel joins in shared-nothing systems. In Proceedings of the 2008 ACM SIGMOD International Conference on Management of Data (SIGMOD'08). 1043--1052. Google ScholarDigital Library
- Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 2012 USENIX Conference on Networked Systems Design and Implementation (NSDI'12). 15--28. Google ScholarDigital Library
- Changchun Zhang, Lei Wu, and Jing Li. 2012. Optimizing distributed joins with bloom filters using MapReduce. In Proceedings of the 2012 International Conference on Computer Applications for Graphics, Grid Computing, and Industrial Environment. 88--95.Google ScholarCross Ref
- Wei Zheng. 2015. Hybrid Hybrid Grace Hash Join, v1.0. Retrieved from https://cwiki.apache.org/confluence/display/Hive/Hybrid+Hybrid+Grace+Hash+Join,+v1.0#HybridHybridGraceHashJoin,v1.0-BloomFilter.Google Scholar
Index Terms
- Building a Hybrid Warehouse: Efficient Joins between Data Stored in HDFS and Enterprise Warehouse
Recommendations
Evaluating SQL-on-Hadoop for Big Data Warehousing on Not-So-Good Hardware
IDEAS '17: Proceedings of the 21st International Database Engineering & Applications SymposiumBig Data is currently conceptualized as data whose volume, variety or velocity impose significant difficulties in traditional techniques and technologies. Big Data Warehousing is emerging as a new concept for Big Data analytics. In this context, SQL-on-...
CirroData: Yet Another SQL-on-Hadoop Data Analytics Engine with High Performance
AbstractThis paper presents CirroData, a high-performance SQL-on-Hadoop system designed for Big Data analytics workloads. As a home-grown enterprise-level online analytical processing (OLAP) system with more than seven-year research and development (R&D) ...
Flying KIWI: Design of Approximate Query Processing Engine for Interactive Data Analytics at Scale
BigDAS '15: Proceedings of the 2015 International Conference on Big Data Applications and ServicesThis paper introduces the design of hybrid SQL-on-Hadoop system, which supports dual-mode (interactive and deep) analytics. We present an architecture of approximate query processing engine using horizontal and vertical sampling of the original database ...
Comments