Skip to main content

Evaluating Presto and SparkSQL with TPC-DS

  • Conference paper
  • First Online:
  • 1011 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13248))

Abstract

From the perspective of the development trend of database technology and the application of big data, the unified management and analysis of relational data and non-relational data is a new trend. New relational computing engines, such as SparkSQL and Presto, provide parallel processing and analysis of distributed relational data and non-relational data, which can effectively improve the performance of data analysis and data quality maintenance scenarios. The purpose of this work is to compare the performance of Presto and SparkSQL using TPC-DS as a benchmark to determine how well Presto and SparkSQL perform in the same scenario. TPC-DS is a benchmark test developed by the Transaction Processing Performance Council (TPC). It contains complex applications such as data statistics, report generation, online query, and data mining, and also has data skew and can effectively reflect system performance in real scenarios. In test results, Presto performed better than SparkSQL in many query scenarios, and in the most significant test results, Presto performed three times better than SparkSQL.

This work is supported by the National Key Research and Development Program of China (No. 2018YFB1004401), National Natural Science Foundation of China (No. 61972402, 61972275, 61732014 and 62072459).

This is a preview of subscription content, log in via an institution.

Buying options

Chapter
USD   29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD   79.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD   99.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

  1. Apache Hadoop. http://hadoop.apache.org/. Accessed 12 Feb 2022

  2. Apache spark\(^{\rm TM}\) - unified engine for large-scale data analytics. http://spark.apache.com. Accessed 12 Feb 2022

  3. Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Sellis, T.K., Davidson, S.B., Ives, Z.G. (eds.) Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4 2015, pp. 1383–1394. ACM (2015). https://doi.org/10.1145/2723372.2742797

  4. Borthakur, D.: The Hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007), 21 (2007)

    Google Scholar 

  5. Davidson, A., Or, A.: Optimizing shuffle performance in spark. University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Technical report (2013)

    Google Scholar 

  6. Feng, B., Wang, Y., Chen, G., Zhang, W., Xie, Y., Ding, Y.: EGEMM-TC: accelerating scientific computing on tensor cores with extended precision. In: Lee, J., Petrank, E. (eds.) PPoPP 2021: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, 27 February–3 March 2021, pp. 278–291. ACM (2021). https://doi.org/10.1145/3437801.3441599

  7. Feng, B., Wang, Y., Ding, Y.: Saga: sparse adversarial attack on EEG-based brain computer interface. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, 6–11 June 2021, pp. 975–979. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9413507

  8. Feng, B., Wang, Y., Geng, T., Li, A., Ding, Y.: APNN-TC: accelerating arbitrary precision neural networks on ampere GPU tensor cores. In: de Supinski, B.R., Hall, M.W., Gamblin, T. (eds.) SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Missouri, USA, 14–19 November 2021. pp. 37:1–37:13. ACM (2021). https://doi.org/10.1145/3458817.3476157

  9. Feng, B., Wang, Y., Li, G., Xie, Y., Ding, Y.: Palleon: a runtime system for efficient video processing toward dynamic class skew. In: Calciu, I., Kuenning, G. (eds.) 2021 USENIX Annual Technical Conference, USENIX ATC 2021, 14–16 July 2021, pp. 427–441. USENIX Association (2021). https://www.usenix.org/conference/atc21/presentation/feng-boyuan

  10. George, L.: HBase - The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly (2011). http://www.oreilly.de/catalog/9781449396107/index.html

  11. Ivanov, T., Korfiatis, N., Zicari, R.V.: On the inequality of the 3v’s of big data architectural paradigms: a case for heterogeneity. CoRR abs/1311.0805 (2013). http://arxiv.org/abs/1311.0805

  12. Li, X., Zhou, W.: Performance comparison of hive, impala and spark SQL. In: 2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, vol. 1, pp. 418–423. IEEE (2015)

    Google Scholar 

  13. Manyika, J., et al.: Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute (2011)

    Google Scholar 

  14. Margoor, A., Bhosale, M.: Improving join reordering for large scale distributed computing. In: Wu, X., et al. (eds.) 2020 IEEE International Conference on Big Data (IEEE BigData 2020), Atlanta, GA, USA, 10–13 December 2020, pp. 2812–2819. IEEE (2020). https://doi.org/10.1109/BigData50022.2020.9378281

  15. Pan, Z., et al.: Exploring data analytics without decompression on embedded GPU systems. IEEE Trans. Parallel Distrib. Syst. 33(7), 1553–1568 (2022). https://doi.org/10.1109/TPDS.2021.3119402

  16. Poggi, N., Montero, A., Carrera, D.: Characterizing bigbench queries, hive, and spark in multi-cloud environments. In: Nambiar, R., Poess, M. (eds.) TPCTC 2017. LNCS, vol. 10661, pp. 55–74. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-72401-0_5

    Chapter  Google Scholar 

  17. Pöss, M., Floyd, C.: New TPC benchmarks for decision support and web commerce. SIGMOD Rec. 29(4), 64–71 (2000). https://doi.org/10.1145/369275.369291

  18. Pöss, M., Smith, B., Kollár, L., Larson, P.: TPC-DS, taking decision support benchmarking to the next level. In: Franklin, M.J., Moon, B., Ailamaki, A. (eds.) Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, 3–6 June 2002, pp. 582–587. ACM (2002). https://doi.org/10.1145/564691.564759

  19. dos Reis, V.L.M., Li, H.H., Shayesteh, A.: Modeling analytics for computational storage. In: Amaral, J.N., Koziolek, A., Trubiani, C., Iosup, A. (eds.) ICPE 2020: ACM/SPEC International Conference on Performance Engineering, Edmonton, AB, Canada, 20–24 April 2020, pp. 88–99. ACM (2020). https://doi.org/10.1145/3358960.3375794

  20. Sethi, R., et al.: Presto: SQL on everything. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019, pp. 1802–1813. IEEE (2019). https://doi.org/10.1109/ICDE.2019.00196

  21. Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: Cao, L., Zhang, C., Joachims, T., Webb, G.I., Margineantu, D.D., Williams, G. (eds.) Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015, pp. 2323–2324. ACM (2015). https://doi.org/10.1145/2783258.2789993

  22. Thusoo, A., et al.: Hive - a petabyte scale data warehouse using Hadoop. In: Li, F., et al. (eds.) Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, 1–6 March 2010, Long Beach, California, USA, pp. 996–1005. IEEE Computer Society (2010). https://doi.org/10.1109/ICDE.2010.5447738

  23. Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Lohman, G.M. (ed.) ACM Symposium on Cloud Computing, SOCC 2013, Santa Clara, CA, USA, 1–3 October 2013, pp. 5:1–5:16. ACM (2013). https://doi.org/10.1145/2523616.2523633

  24. Wang, Y., Feng, B., Ding, Y.: DSXplore: optimizing convolutional neural networks via sliding-channel convolutions. In: 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, 17–21 May 2021, pp. 619–628. IEEE (2021). https://doi.org/10.1109/IPDPS49936.2021.00070

  25. Wang, Y., et al.: GNNAdvisor: An adaptive and efficient runtime system for GNN acceleration on GPUs. In: Brown, A.D., Lorch, J.R. (eds.) 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, 14–16 July 2021, pp. 515–531. USENIX Association (2021). https://www.usenix.org/conference/osdi21/presentation/wang-yuke

  26. Zhang, F., Chen, Z., Zhang, C., Zhou, A.C., Zhai, J., Du, X.: An efficient parallel secure machine learning framework on GPUs. IEEE Trans. Parallel Distrib. Syst. 32(9), 2262–2276 (2021). https://doi.org/10.1109/TPDS.2021.3059108

  27. Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE Trans. Parallel Distrib. Syst. 28(3), 905–918 (2017). https://doi.org/10.1109/TPDS.2016.2586074

  28. Zhang, F., Zhai, J., Shen, X., Mutlu, O., Du, X.: POCLib: a high-performance framework for enabling near orthogonal processing on compression. IEEE Trans. Parallel Distrib. Syst. 33(2), 459–475 (2022). https://doi.org/10.1109/TPDS.2021.3093234

  29. Zhang, F., et al.: TADOC: text analytics directly on compression. VLDB J. 30(2), 163–188 (2021). https://doi.org/10.1007/s00778-020-00636-3

  30. Zhang, M., Liu, F., Lu, Y., Chen, Z.: Workload driven comparison and optimization of hive and spark SQL. In: 2017 4th International Conference on Information Science and Control Engineering (ICISCE), pp. 777–782. IEEE (2017)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yinhao Hong .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Hong, Y., Du, S., Leng, J. (2022). Evaluating Presto and SparkSQL with TPC-DS. In: Rage, U.K., Goyal, V., Reddy, P.K. (eds) Database Systems for Advanced Applications. DASFAA 2022 International Workshops. DASFAA 2022. Lecture Notes in Computer Science, vol 13248. Springer, Cham. https://doi.org/10.1007/978-3-031-11217-1_23

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-11217-1_23

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-11216-4

  • Online ISBN: 978-3-031-11217-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics