Skip to main content

A Study of SQL-on-Hadoop Systems

  • Conference paper
  • First Online:
Book cover Big Data Benchmarks, Performance Optimization, and Emerging Hardware (BPOE 2014)

Abstract

Hadoop is now the de facto standard for storing and processing big data, not only for unstructured data but also for some structured data. As a result, providing SQL analysis functionality to the big data resided in HDFS becomes more and more important. Hive is a pioneer system that support SQL-like analysis to the data in HDFS. However, the performance of Hive is not satisfactory for many applications. This leads to the quick emergence of dozens of SQL-on-Hadoop systems that try to support interactive SQL query processing to the data stored in HDFS. This paper firstly gives a brief technical review on recent efforts of SQL-on-Hadoop systems. Then we test and compare the performance of five representative SQL-on-Hadoop systems, based on some queries selected or derived from the TPC-DS benchmark. According to the results, we show that such systems can benefit more from the applications of many parallel query processing techniques that have been widely studied in the traditional MPP analytical databases.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    http://deke.ruc.edu.cn/yunyuyue.php

  2. 2.

    Due to the page limit, more details about the experimental settings, results, and result analysis are available at http://deke.ruc.edu.cn/sqlonhadoop.

References

  1. http://docs.oracle.com/cd/E37231_01/doc.20/e36961/sqlch.htm (2013)

  2. Citusdata (2013). http://citusdata.com/docs/SQL-on-Hadoop

  3. Cloudera impala (2013). http://blog.cloudera.com/blog/2012/10/cloudera-impala-real-time-queries-in-apache-hadoop-for-real/

  4. Drill proposal (2013). http://wiki.apache.org/incubator/DrillProposal/

  5. Jethro data (2013). http://jethrodata.com/product/

  6. Presto (2013). http://prestodb.io

  7. Rainstor (2013). http://rainstor.com/products/rainstor-database/

  8. Stinger (2013). http://hortonworks.com/stinger/

  9. Abouzeid, A., Bajda-Pawlikowski, K., Abadi, D.J., Rasin, A., Silberschatz, A.: Hadoopdb: an architectural hybrid of mapreduce and dbms technologies for analytical workloads. PVLDB 2(1), 922–933 (2009)

    Google Scholar 

  10. Argyros, T.: The enterprise approach to interactive sql on hadoop data: teradata sql-h (2013). http://www.asterdata.com/blog/2013/04/the-enterprise-approach-to-interactive-SQL-on-Hadoop-data-teradata-sql-h/

  11. Chang, L., Wang, Z., Ma, T., Jian, L., Ma, L., Goldshuv, A., Lonergan, L., Cohen, J., Welton, C., Sherry, G., Bhandarkar, M.: Hawq: a massively parallel processing sql engine in hadoop. In: SIGMOD Conference, pp. 1223–1234 (2014)

    Google Scholar 

  12. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI, pp. 137–150 (2004)

    Google Scholar 

  13. DeWitt, D.J., Halverson, A., Nehme, R.V., Shankar, S., Aguilar-Saborit, J., Avanes, A., Flasza, M., Gramling, J.: Split query processing in polybase. In: SIGMOD Conference, pp. 1255–1266 (2013)

    Google Scholar 

  14. Dittrich, J., Quiané-Ruiz, J.-A., Jindal, A., Kargin, Y., Setty, V., Schad, J.: Hadoop++: making a yellow elephant run like a cheetah (without it even noticing). PVLDB 3(1), 518–529 (2010)

    Google Scholar 

  15. Floratou, A., Teletia, N., DeWitt, D.J., Patel, J.M., Zhang, D.: Can the elephants handle the nosql onslaught? PVLDB 5(12), 1712–1723 (2012)

    Google Scholar 

  16. Franklin, M.J.: Making sense of big data with the berkeley data analytics stack. In: SSDBM, p. 1 (2013)

    Google Scholar 

  17. He, Y., Lee, R., Huai, Y., Shao, Z., Jain, N., Zhang, X., Xu, Z.: Rcfile: a fast and space-efficient data placement structure in mapreduce-based warehouse systems. In: ICDE, pp. 1199–1208 (2011)

    Google Scholar 

  18. Iu, M.-Y., Zwaenepoel, W.: Hadooptosql: a mapreduce query optimizer. In: EuroSys, pp. 251–264 (2010)

    Google Scholar 

  19. Lee, K.-H., Lee, Y.-J., Choi, H., Chung, Y.D., Moon, B.: Parallel data processing with mapreduce: a survey. SIGMOD Rec. 40(4), 11–20 (2011)

    Article  Google Scholar 

  20. Lee, R., Luo, T., Huai, Y., Wang, F., He, Y., Zhang, X.:. Ysmart: yet another sql-to-mapreduce translator. In: ICDCS, pp. 25–36 (2011)

    Google Scholar 

  21. Nambiar, R.O., Poess, M.: The making of tpc-ds. In: VLDB, pp. 1049–1058 (2006)

    Google Scholar 

  22. Pavlo, A., Paulson, E., Rasin, A., Abadi, D.J., DeWitt, D.J., Madden, S., Stonebraker, M.: A comparison of approaches to large-scale data analysis. In: SIGMOD Conference, pp. 165–178 (2009)

    Google Scholar 

  23. Sakr, S., Liu, A., Fayoumi, A.G.: The family of mapreduce and large-scale data processing systems. ACM Comput. Surv. 46(1), 11 (2013)

    Article  Google Scholar 

  24. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: Sql and rich analytics at scale. In: SIGMOD Conference, pp. 13–24 (2013)

    Google Scholar 

  25. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauly, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI, pp. 15–28 (2012)

    Google Scholar 

Download references

Acknowledgements

This work is partially supported by National 863 High-tech Program (Grant No. 2012AA011001), the National Science Foundation of China under grant No. 61472426 and No. 61170013, the Fundamental Research Funds for the Central Universities, the Research Funds of Renmin University of China No. 14XNLQ06, the Chinese National “111” Project “Attracting International Talents in Data Engineering and Knowledge Engineering Research”, and the Scientific Research Foundation for the Returned Overseas Chinese Scholars, State Education Ministry.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yueguo Chen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2014 Springer International Publishing Switzerland

About this paper

Cite this paper

Chen, Y. et al. (2014). A Study of SQL-on-Hadoop Systems. In: Zhan, J., Han, R., Weng, C. (eds) Big Data Benchmarks, Performance Optimization, and Emerging Hardware. BPOE 2014. Lecture Notes in Computer Science(), vol 8807. Springer, Cham. https://doi.org/10.1007/978-3-319-13021-7_12

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-13021-7_12

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-13020-0

  • Online ISBN: 978-3-319-13021-7

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics