Skip to main content

Evolution from Shark to Spark SQL: Preliminary Analysis and Qualitative Evaluation

  • Conference paper
  • First Online:
Big Data Benchmarks, Performance Optimization, and Emerging Hardware (BPOE 2015)

Part of the book series: Lecture Notes in Computer Science ((LNISA,volume 9495))

Included in the following conference series:

  • 905 Accesses

Abstract

Spark is a general distributed framework with the abstraction called resilient distributed datasets (RDD). Database analysis is one of the main kinds of workloads supported on Spark. The SQL component on Spark has evolved from Shark to Spark SQL, while the core components of Spark also have evolved a lot comparing with the original version. We analyzed on which aspects Spark have made efforts to support many workloads efficiently and whether the changes make the support for SQL achieve better performance.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 34.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 44.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

References

  1. Big Data Benchmark. https://amplab.cs.berkeley.edu/benchmark/

  2. Spark JIRA. https://issues.apache.org/jira/browse/SPARK/

  3. Armbrust, M., Xin, R.S., Lian, C., Huai, Y., Liu, D., Bradley, J.K., Meng, X., Kaftan, T., Franklin, M.J., Ghodsi, A., Zaharia, M.: Spark SQL: relational data processing in spark. In: SIGMOD 2015, ACM, New York (2015)

    Google Scholar 

  4. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: OSDI 2004, USENIX Association, Berkeley (2004)

    Google Scholar 

  5. Floratou, A., Minhas, U.F., Özcan, F.: SQL-on-hadoop: Full circle back to shared-nothing database architectures. Proc. VLDB Endowment 7(12), 1295–1306 (2014)

    Article  Google Scholar 

  6. Huang, S., Huang, J., Dai, J., Xie, T., Huang, B.: The hibench benchmark suite: characterization of the mapreduce-based data analysis. In: 2010 IEEE 26th International Conference on Data Engineering Workshops (ICDEW), pp. 41–51. IEEE (2010)

    Google Scholar 

  7. Isard, M., Budiu, M., Yu, Y., Birrell, A., Fetterly, D.: Dryad: distributed data-parallel programs from sequential building blocks. In: EuroSys 2007, New York (2007)

    Google Scholar 

  8. Kornacker, M., Behm, A., Bittorf, V., Bobrovytsky, T., Ching, C., Choi, A., Erickson, J., Grund, M., Hecht, D., Jacobs, M., et al.: Impala: a modern, open-source SQL engine for hadoop. In: Proceedings of the Conference on Innovative Data Systems Research CIDR 2015 (2015)

    Google Scholar 

  9. Li, M., Tan, J., Wang, Y., Zhang, L., Salapura, V.: Sparkbench: a comprehensive benchmarking suite for in memory data analytic platform spark. In: CF 2015, pp. 53:1–53:8. ACM, New York (2015)

    Google Scholar 

  10. Lu, L., Arpaci-Dusseau, A.C., Arpaci-Dusseau, R.H., Lu, S.: A study of linux file system evolution. In: FAST 2013, USENIX Association, Berkeley (2013)

    Google Scholar 

  11. Melnik, S., Gubarev, A., Long, J.J., Romer, G., Shivakumar, S., Tolton, M., Vassilakis, T.: Dremel: interactive analysis of web-scale datasets. Proc. VLDB Endowment 3(1–2), 330–339 (2010)

    Article  Google Scholar 

  12. Saha, B., Shah, H., Seth, S., Vijayaraghavan, G., Murthy, A., Curino, C.: Apache tez: a unifying framework for modeling and building data processing applications. In: SIGMOD 2015, New York (2015)

    Google Scholar 

  13. Thusoo, A., Sarma, J.S., Jain, N., Shao, Z., Chakka, P., Anthony, S., Liu, H., Wyckoff, P., Murthy, R.: Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow. 2(2), 1626–1629 (2009)

    Article  Google Scholar 

  14. Wang, L., Zhan, J., Luo, C., Zhu, Y., Yang, Q., He, Y., Gao, W., Jia, Z., Shi, Y., Zhang, S., Zheng, C., Lu, G., Zhan, K., Li, X., Qiu, B.: Bigdatabench: a big data benchmark suite from internet services. In: 20th IEEE International Symposium on High Performance Computer Architecture, HPCA 2014, Orlando, FL, USA, February 15–19, 2014, pp. 488–499 (2014)

    Google Scholar 

  15. Xin, R.S., Rosen, J., Zaharia, M., Franklin, M.J., Shenker, S., Stoica, I.: Shark: SQL and rich analytics at scale. In: Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data, pp. 13–24. ACM (2013)

    Google Scholar 

  16. Chen, Y., Qin, X., Bian, H., Chen, J., Dong, Z., Du, X., Gao, Y., Liu, D., Lu, J., Zhang, H.: A study of SQL-on-hadoop systems. In: Zhan, J., Rui, H., Weng, C. (eds.) BPOE 2014. LNCS, vol. 8807, pp. 154–166. Springer, Heidelberg (2014)

    Google Scholar 

  17. Zaharia, M., Chowdhury, M., Das, T., Dave, A., Ma, J., McCauley, M., Franklin, M.J., Shenker, S., Stoica, I.: Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: NSDI 2012, USENIX Association, Berkeley (2012)

    Google Scholar 

Download references

Acknowledgements

This work was supported by the National High Technology Research and Development Program of China (Grant No. 2015AA015308), the Major Program of National Natural Science Foundation of China (Grant No. 61432006), and the Key Technology Research and Development Programs of Guangdong Province, China (Grant No. 2015B010108006).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Xinhui Tian .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2016 Springer International Publishing Switzerland

About this paper

Cite this paper

Tian, X., Lu, G., Zhou, X., Li, J. (2016). Evolution from Shark to Spark SQL: Preliminary Analysis and Qualitative Evaluation. In: Zhan, J., Han, R., Zicari, R. (eds) Big Data Benchmarks, Performance Optimization, and Emerging Hardware. BPOE 2015. Lecture Notes in Computer Science(), vol 9495. Springer, Cham. https://doi.org/10.1007/978-3-319-29006-5_6

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-29006-5_6

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-29005-8

  • Online ISBN: 978-3-319-29006-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics