Evaluating Presto and SparkSQL with TPC-DS

Hong, Yinhao; Du, Sheng; Leng, Jianquan

doi:10.1007/978-3-031-11217-1_23

Evaluating Presto and SparkSQL with TPC-DS

Yinhao Hong^10,11,
Sheng Du¹¹ &
Jianquan Leng¹¹

Conference paper
First Online: 16 July 2022

1011 Accesses

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13248))

Abstract

From the perspective of the development trend of database technology and the application of big data, the unified management and analysis of relational data and non-relational data is a new trend. New relational computing engines, such as SparkSQL and Presto, provide parallel processing and analysis of distributed relational data and non-relational data, which can effectively improve the performance of data analysis and data quality maintenance scenarios. The purpose of this work is to compare the performance of Presto and SparkSQL using TPC-DS as a benchmark to determine how well Presto and SparkSQL perform in the same scenario. TPC-DS is a benchmark test developed by the Transaction Processing Performance Council (TPC). It contains complex applications such as data statistics, report generation, online query, and data mining, and also has data skew and can effectively reflect system performance in real scenarios. In test results, Presto performed better than SparkSQL in many query scenarios, and in the most significant test results, Presto performed three times better than SparkSQL.

This work is supported by the National Key Research and Development Program of China (No. 2018YFB1004401), National Natural Science Foundation of China (No. 61972402, 61972275, 61732014 and 62072459).

This is a preview of subscription content, log in via an institution.

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 79.99; Price excludes VAT (USA)

Softcover Book: USD 99.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Learn about institutional subscriptions

References

Apache Hadoop. http://hadoop.apache.org/. Accessed 12 Feb 2022
Apache spark\(^{\rm TM}\) - unified engine for large-scale data analytics. http://spark.apache.com. Accessed 12 Feb 2022
Armbrust, M., et al.: Spark SQL: relational data processing in spark. In: Sellis, T.K., Davidson, S.B., Ives, Z.G. (eds.) Proceedings of the 2015 ACM SIGMOD International Conference on Management of Data, Melbourne, Victoria, Australia, May 31–June 4 2015, pp. 1383–1394. ACM (2015). https://doi.org/10.1145/2723372.2742797
Borthakur, D.: The Hadoop distributed file system: architecture and design. Hadoop Project Website 11(2007), 21 (2007)
Google Scholar
Davidson, A., Or, A.: Optimizing shuffle performance in spark. University of California, Berkeley-Department of Electrical Engineering and Computer Sciences, Technical report (2013)
Google Scholar
Feng, B., Wang, Y., Chen, G., Zhang, W., Xie, Y., Ding, Y.: EGEMM-TC: accelerating scientific computing on tensor cores with extended precision. In: Lee, J., Petrank, E. (eds.) PPoPP 2021: 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, Virtual Event, Republic of Korea, 27 February–3 March 2021, pp. 278–291. ACM (2021). https://doi.org/10.1145/3437801.3441599
Feng, B., Wang, Y., Ding, Y.: Saga: sparse adversarial attack on EEG-based brain computer interface. In: IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2021, Toronto, ON, Canada, 6–11 June 2021, pp. 975–979. IEEE (2021). https://doi.org/10.1109/ICASSP39728.2021.9413507
Feng, B., Wang, Y., Geng, T., Li, A., Ding, Y.: APNN-TC: accelerating arbitrary precision neural networks on ampere GPU tensor cores. In: de Supinski, B.R., Hall, M.W., Gamblin, T. (eds.) SC 2021: The International Conference for High Performance Computing, Networking, Storage and Analysis, St. Louis, Missouri, USA, 14–19 November 2021. pp. 37:1–37:13. ACM (2021). https://doi.org/10.1145/3458817.3476157
Feng, B., Wang, Y., Li, G., Xie, Y., Ding, Y.: Palleon: a runtime system for efficient video processing toward dynamic class skew. In: Calciu, I., Kuenning, G. (eds.) 2021 USENIX Annual Technical Conference, USENIX ATC 2021, 14–16 July 2021, pp. 427–441. USENIX Association (2021). https://www.usenix.org/conference/atc21/presentation/feng-boyuan
George, L.: HBase - The Definitive Guide: Random Access to Your Planet-Size Data. O’Reilly (2011). http://www.oreilly.de/catalog/9781449396107/index.html
Ivanov, T., Korfiatis, N., Zicari, R.V.: On the inequality of the 3v’s of big data architectural paradigms: a case for heterogeneity. CoRR abs/1311.0805 (2013). http://arxiv.org/abs/1311.0805
Li, X., Zhou, W.: Performance comparison of hive, impala and spark SQL. In: 2015 7th International Conference on Intelligent Human-Machine Systems and Cybernetics, vol. 1, pp. 418–423. IEEE (2015)
Google Scholar
Manyika, J., et al.: Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute (2011)
Google Scholar
Margoor, A., Bhosale, M.: Improving join reordering for large scale distributed computing. In: Wu, X., et al. (eds.) 2020 IEEE International Conference on Big Data (IEEE BigData 2020), Atlanta, GA, USA, 10–13 December 2020, pp. 2812–2819. IEEE (2020). https://doi.org/10.1109/BigData50022.2020.9378281
Pan, Z., et al.: Exploring data analytics without decompression on embedded GPU systems. IEEE Trans. Parallel Distrib. Syst. 33(7), 1553–1568 (2022). https://doi.org/10.1109/TPDS.2021.3119402
Poggi, N., Montero, A., Carrera, D.: Characterizing bigbench queries, hive, and spark in multi-cloud environments. In: Nambiar, R., Poess, M. (eds.) TPCTC 2017. LNCS, vol. 10661, pp. 55–74. Springer, Cham (2018). https://doi.org/10.1007/978-3-319-72401-0_5
Chapter Google Scholar
Pöss, M., Floyd, C.: New TPC benchmarks for decision support and web commerce. SIGMOD Rec. 29(4), 64–71 (2000). https://doi.org/10.1145/369275.369291
Pöss, M., Smith, B., Kollár, L., Larson, P.: TPC-DS, taking decision support benchmarking to the next level. In: Franklin, M.J., Moon, B., Ailamaki, A. (eds.) Proceedings of the 2002 ACM SIGMOD International Conference on Management of Data, Madison, Wisconsin, USA, 3–6 June 2002, pp. 582–587. ACM (2002). https://doi.org/10.1145/564691.564759
dos Reis, V.L.M., Li, H.H., Shayesteh, A.: Modeling analytics for computational storage. In: Amaral, J.N., Koziolek, A., Trubiani, C., Iosup, A. (eds.) ICPE 2020: ACM/SPEC International Conference on Performance Engineering, Edmonton, AB, Canada, 20–24 April 2020, pp. 88–99. ACM (2020). https://doi.org/10.1145/3358960.3375794
Sethi, R., et al.: Presto: SQL on everything. In: 35th IEEE International Conference on Data Engineering, ICDE 2019, Macao, China, 8–11 April 2019, pp. 1802–1813. IEEE (2019). https://doi.org/10.1109/ICDE.2019.00196
Shanahan, J.G., Dai, L.: Large scale distributed data science using apache spark. In: Cao, L., Zhang, C., Joachims, T., Webb, G.I., Margineantu, D.D., Williams, G. (eds.) Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Sydney, NSW, Australia, 10–13 August 2015, pp. 2323–2324. ACM (2015). https://doi.org/10.1145/2783258.2789993
Thusoo, A., et al.: Hive - a petabyte scale data warehouse using Hadoop. In: Li, F., et al. (eds.) Proceedings of the 26th International Conference on Data Engineering, ICDE 2010, 1–6 March 2010, Long Beach, California, USA, pp. 996–1005. IEEE Computer Society (2010). https://doi.org/10.1109/ICDE.2010.5447738
Vavilapalli, V.K., et al.: Apache Hadoop YARN: yet another resource negotiator. In: Lohman, G.M. (ed.) ACM Symposium on Cloud Computing, SOCC 2013, Santa Clara, CA, USA, 1–3 October 2013, pp. 5:1–5:16. ACM (2013). https://doi.org/10.1145/2523616.2523633
Wang, Y., Feng, B., Ding, Y.: DSXplore: optimizing convolutional neural networks via sliding-channel convolutions. In: 35th IEEE International Parallel and Distributed Processing Symposium, IPDPS 2021, Portland, OR, USA, 17–21 May 2021, pp. 619–628. IEEE (2021). https://doi.org/10.1109/IPDPS49936.2021.00070
Wang, Y., et al.: GNNAdvisor: An adaptive and efficient runtime system for GNN acceleration on GPUs. In: Brown, A.D., Lorch, J.R. (eds.) 15th USENIX Symposium on Operating Systems Design and Implementation, OSDI 2021, 14–16 July 2021, pp. 515–531. USENIX Association (2021). https://www.usenix.org/conference/osdi21/presentation/wang-yuke
Zhang, F., Chen, Z., Zhang, C., Zhou, A.C., Zhai, J., Du, X.: An efficient parallel secure machine learning framework on GPUs. IEEE Trans. Parallel Distrib. Syst. 32(9), 2262–2276 (2021). https://doi.org/10.1109/TPDS.2021.3059108
Zhang, F., Zhai, J., He, B., Zhang, S., Chen, W.: Understanding co-running behaviors on integrated CPU/GPU architectures. IEEE Trans. Parallel Distrib. Syst. 28(3), 905–918 (2017). https://doi.org/10.1109/TPDS.2016.2586074
Zhang, F., Zhai, J., Shen, X., Mutlu, O., Du, X.: POCLib: a high-performance framework for enabling near orthogonal processing on compression. IEEE Trans. Parallel Distrib. Syst. 33(2), 459–475 (2022). https://doi.org/10.1109/TPDS.2021.3093234
Zhang, F., et al.: TADOC: text analytics directly on compression. VLDB J. 30(2), 163–188 (2021). https://doi.org/10.1007/s00778-020-00636-3
Zhang, M., Liu, F., Lu, Y., Chen, Z.: Workload driven comparison and optimization of hive and spark SQL. In: 2017 4th International Conference on Information Science and Control Engineering (ICISCE), pp. 777–782. IEEE (2017)
Google Scholar

Download references

Author information

Authors and Affiliations

Renmin University of China, Beijing, China
Yinhao Hong
Beijing Kingbase Information Technology Co., Ltd., Beijing, China
Yinhao Hong, Sheng Du & Jianquan Leng

Authors

Yinhao Hong
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Du
View author publications
You can also search for this author in PubMed Google Scholar
Jianquan Leng
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yinhao Hong .

Editor information

Editors and Affiliations

University of Aizu, Aizu, Japan
Uday Kiran Rage
Indraprastha Institute of Information Technology, Delhi, India
Vikram Goyal
Data Sciences and Analytics Center, International Institute of Information Technology, Hyderabad, Telangana, India
P. Krishna Reddy

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Hong, Y., Du, S., Leng, J. (2022). Evaluating Presto and SparkSQL with TPC-DS. In: Rage, U.K., Goyal, V., Reddy, P.K. (eds) Database Systems for Advanced Applications. DASFAA 2022 International Workshops. DASFAA 2022. Lecture Notes in Computer Science, vol 13248. Springer, Cham. https://doi.org/10.1007/978-3-031-11217-1_23

Download citation

DOI: https://doi.org/10.1007/978-3-031-11217-1_23
Published: 16 July 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-11216-4
Online ISBN: 978-3-031-11217-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics