Evolution from Shark to Spark SQL: Preliminary Analysis and Qualitative Evaluation

Tian, Xinhui; Lu, Gang; Zhou, Xiexuan; Li, Jingwei

doi:10.1007/978-3-319-29006-5_6

Xinhui Tian^16,17,
Gang Lu^16,17,
Xiexuan Zhou^16,17 &
…
Jingwei Li¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9495))

Included in the following conference series:

BPOE

905 Accesses

Abstract

Spark is a general distributed framework with the abstraction called resilient distributed datasets (RDD). Database analysis is one of the main kinds of workloads supported on Spark. The SQL component on Spark has evolved from Shark to Spark SQL, while the core components of Spark also have evolved a lot comparing with the original version. We analyzed on which aspects Spark have made efforts to support many workloads efficiently and whether the changes make the support for SQL achieve better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 34.99; Price excludes VAT (USA)

Softcover Book: USD 44.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/
Spark JIRA. https://issues.apache.org/jira/browse/SPARK/
Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD 2015, ACM, New York (2015)
Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI 2004, USENIX Association, Berkeley (2004)
Google Scholar
Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: Full circle back to shared-nothing database architectures. Proc. VLDB Endowment 7(12), 1295–1306 (2014)
Article Google Scholar
Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41–51. IEEE (2010)
Google Scholar
Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys 2007, New York (2007)
Google Scholar
Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., et al.: Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the Conference on Innovative Data Systems Research CIDR 2015 (2015)
Google Scholar
Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: CF 2015, pp. 53:1–53:8. ACM, New York (2015)
Google Scholar
Lu, L., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Lu, S.: A study of linux file system evolution. In: FAST 2013, USENIX Association, Berkeley (2013)
Google Scholar
Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endowment 3(1–2), 330–339 (2010)
Article Google Scholar
Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., Curino, C.: Apache tez: a unifying framework for modeling and building data processing applications. In: SIGMOD 2015, New York (2015)
Google Scholar
Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)
Article Google Scholar
Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: Bigdatabench: a big data benchmark suite from internet services. In: 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15–19, 2014, pp. 488–499 (2014)
Google Scholar
Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 13–24. ACM (2013)
Google Scholar
Chen, Y., Qin, X., Bian, H., Chen, J., Dong, Z., Du, X., Gao, Y., Liu, D., Lu, J., Zhang, H.: A study of SQL-on-hadoop systems. In: Zhan, J., Rui, H., Weng, C. (eds.) BPOE 2014. LNCS, vol. 8807, pp. 154–166. Springer, Heidelberg (2014)
Google Scholar
Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012, USENIX Association, Berkeley (2012)
Google Scholar

Download references

Acknowledgements

This work was supported by the National High Technology Research and Development Program of China (Grant No. 2015AA015308), the Major Program of National Natural Science Foundation of China (Grant No. 61432006), and the Key Technology Research and Development Programs of Guangdong Province, China (Grant No. 2015B010108006).

Author information

Authors and Affiliations

State Key Laboratory of Computer Architecture, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
Xinhui Tian, Gang Lu & Xiexuan Zhou
University of Chinese Academy of Sciences, Beijing, China
Xinhui Tian, Gang Lu & Xiexuan Zhou
Beijing Academy of Frontier Science and Technology, Beijing, China
Jingwei Li

Authors

Xinhui Tian
View author publications
You can also search for this author in PubMed Google Scholar
Gang Lu
View author publications
You can also search for this author in PubMed Google Scholar
Xiexuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar
Jingwei Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Xinhui Tian .

Editor information

Editors and Affiliations

Institute of Computing, Chinese Academy of Sciences, Beijing, China
Jianfeng Zhan
ICT, Chinese Academy of Sciences, Beijing, China
Rui Han
FB12 - DBIS (5. Stock), Goethe Universität Frankfurt, Frankfurt, Hessen, Germany
Roberto V. Zicari

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Tian, X., Lu, G., Zhou, X., Li, J. (2016). Evolution from Shark to Spark SQL: Preliminary Analysis and Qualitative Evaluation. In: Zhan, J., Han, R., Zicari, R. (eds) Big Data Benchmarks, Performance Optimization, and Emerging Hardware. BPOE 2015. Lecture Notes in Computer Science(), vol 9495. Springer, Cham. https://doi.org/10.1007/978-3-319-29006-5_6

Download citation

DOI: https://doi.org/10.1007/978-3-319-29006-5_6
Published: 09 January 2016
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-29005-8
Online ISBN: 978-3-319-29006-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics