Abstract
Hadoop is now the de facto standard for storing and processing big data, not only for unstructured data but also for some structured data. As a result, providing SQL analysis functionality to the big data resided in HDFS becomes more and more important. Hive is a pioneer system that support SQL-like analysis to the data in HDFS. However, the performance of Hive is not satisfactory for many applications. This leads to the quick emergence of dozens of SQL-on-Hadoop systems that try to support interactive SQL query processing to the data stored in HDFS. This paper firstly gives a brief technical review on recent efforts of SQL-on-Hadoop systems. Then we test and compare the performance of five representative SQL-on-Hadoop systems, based on some queries selected or derived from the TPC-DS benchmark. According to the results, we show that such systems can benefit more from the applications of many parallel query processing techniques that have been widely studied in the traditional MPP analytical databases.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
Due to the page limit, more details about the experimental settings, results, and result analysis are available at http://deke.ruc.edu.cn/sqlonhadoop.
References
http://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm (2013)
Citusdata (2013). http://citusdata.com/docs/SQL-on-Hadoop
Cloudera impala (2013). http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/
Drill proposal (2013). http://wiki.apache.org/incubator/DrillProposal/
Jethro data (2013). http://jethrodata.com/product/
Presto (2013). http://prestodb.io
Rainstor (2013). http://rainstor.com/products/rainstor-database/
Stinger (2013). http://hortonworks.com/stinger/
Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)
Argyros, T.: The enterprise approach to interactive sql on hadoop data: teradata sql-h (2013). http://www.asterdata.com/blog/2013/04/the-enterprise-approach-to-interactive-SQL-on-Hadoop-data-teradata-sql-h/
Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., Lonergan, L., Cohen, J., Welton, C., Sherry, G., Bhandarkar, M.: Hawq: a massively parallel processing sql engine in hadoop. In: SIGMOD Conference, pp. 1223–1234 (2014)
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)
DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: SIGMOD Conference, pp. 1255–1266 (2013)
Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)
Floratou, A., Teletia, N., DeWitt, D.J., Patel, J.M., Zhang, D.: Can the elephants handle the nosql onslaught? PVLDB 5(12), 1712–1723 (2012)
Franklin, M.J.: Making sense of big data with the berkeley data analytics stack. In: SSDBM, p. 1 (2013)
He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: ICDE, pp. 1199–1208 (2011)
Iu, M.-Y., Zwaenepoel, W.: Hadooptosql: a mapreduce query optimizer. In: EuroSys, pp. 251–264 (2010)
Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)
Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.:. Ysmart: yet another sql-to-mapreduce translator. In: ICDCS, pp. 25–36 (2011)
Nambiar, R.O., Poess, M.: The making of tpc-ds. In: VLDB, pp. 1049–1058 (2006)
Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD Conference, pp. 165–178 (2009)
Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: Sql and rich analytics at scale. In: SIGMOD Conference, pp. 13–24 (2013)
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)
Acknowledgements
This work is partially supported by National 863 High-tech Program (Grant No. 2012AA011001), the National Science Foundation of China under grant No. 61472426 and No. 61170013, the Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China No. 14XNLQ06, the Chinese National “111” Project “Attracting International Talents in Data Engineering and Knowledge Engineering Research”, and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2014 Springer International Publishing Switzerland
About this paper
Cite this paper
Chen, Y. et al. (2014). A Study of SQL-on-Hadoop Systems. In: Zhan, J., Han, R., Weng, C. (eds) Big Data Benchmarks, Performance Optimization, and Emerging Hardware. BPOE 2014. Lecture Notes in Computer Science(), vol 8807. Springer, Cham. https://doi.org/10.1007/978-3-319-13021-7_12
Download citation
DOI: https://doi.org/10.1007/978-3-319-13021-7_12
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-13020-0
Online ISBN: 978-3-319-13021-7
eBook Packages: Computer ScienceComputer Science (R0)