Abstract
To support various application scenarios, big data processing frameworks (BDPFs) such as Spark usually provide users with a large number of performance-critical configuration parameters. Since manually configuring is both labor-intensive and time-consuming, automatically tuning configurations parameters for BDPFs to achieve better performance has been an urgent need. To simultaneously address the corresponding challenges such as high dimensional configuration space, we propose ATConf-a new black-box approach of automatically tuning the internal and external configuration parameters for BDPFs. Experimental results based on our local distributed Spark cluster show that the best execution time achieved by ATConf is as much as 46.52% less than the default configuration. Besides, compared with the four baselines, ATConf is able to further reduce the relative execution time over default by at least 4.10% under the same constraint of observation times.
Similar content being viewed by others
Data availability
The data used to support the findings of this study are available from the corresponding author upon request.
References
Herodotou, H., Chen, Y., Jiaheng, L.: A survey on automatic parameter tuning for big data processing systems. ACM Comput. Surv. (CSUR) 53(2), 1–37 (2020)
Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., Yang, Y.: Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: Proceedings of the 2017 symposium on cloud computing, pp. 338–350 (2017)
Van Aken, D., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: Proceedings of the 2017 ACM international conference on management of data, pp 1009–1024 (2017)
Yu, Z., Bei, Z., Qian, X.: Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In: Proceedings of the twenty-third international conference on architectural support for programming languages and operating systems, pp. 564–577 (2018)
Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., Xing, J., Wang, Y., Cheng, T., et al.: An end-to-end automatic cloud database tuning system using deep reinforcement learning. In: Proceedings of the 2019 international conference on management of data, pp. 415–432 (2019)
Li, M., Liu, Z., Shi, X., Jin, H.: Atcs: auto-tuning configurations of big data frameworks based on generative adversarial nets. IEEE Access 8, 50485–50496 (2020)
Tooley, R.: Auto-tuning spark with bayesian optimisation (2021)
Xiong, W., Bei, Z., Chengzhong, X., Zhibin, Yu.: Ath: auto-tuning hbase’s configuration via ensemble learning. IEEE Access 5, 13157–13170 (2017)
Bei, Z., Zhibin, Yu., Zhang, H., Xiong, W., Chengzhong, X., Eeckhout, L., Feng, S.: Rfhoc: a random-forest approach to auto-tuning hadoop’s configuration. IEEE Trans. Parallel Distrib. Syst. 27(5), 1470–1483 (2015)
Guo, Y., Shan, H., Huang, S., Hwang, K., Fan, J., Yu, Z.: Gml: Efficiently auto-tuning flink’s configurations via guided machine learning. In: Proceedings of the IEEE transactions on parallel and distributed systems (2021)
Jamshidi, P., Casale, G.: An uncertainty-aware approach to optimal configuration of stream processing systems. In: Proceedings of the 2016 IEEE 24th international symposium on modeling, analysis and simulation of computer and telecommunication systems (MASCOTS), pp. 39–48. IEEE (2016)
Spark configuration. http://spark.apache.org/docs/2.2.2/configuration.html. Accessed 8 July 2021
ZWang, Z., Zoghi, M., Hutter, F., Matheson, D., De Freitas, N.: Bayesian optimization in high dimensions via random embeddings. In: Proceedings of the twenty-third international joint conference on artificial intelligence (2013)
Wang, Z., Hutter, F., Zoghi, M., Matheson, D., de Feitas, N.: Bayesian optimization in a billion dimensions via random embeddings. J. Artif. Intell. Res. 55, 361–387 (2016)
Khan, M.M., Yu, W.: Robotune: high-dimensional configuration tuning for cluster-based data analytics. In: Proceedings of the 50th international conference on parallel processing, pp. 1–10 (2021)
Li, C., Gupta, S., Rana, S., Nguyen, V., Venkatesh, S., Shilton, A.: High dimensional bayesian optimization using dropout. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp. 2096–2102 (2017)
Hibench. https://github.com/Intel-bigdata/HiBench. Accessed 8 June 2021
Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(2), 281–305 (2012)
Dou, H., Chen, P., Zheng, Z.: Hdconfigor: automatically tuning high dimensional configuration parameters for log search engines. IEEE Access 8, 80638–80653 (2020)
Atconf. https://cn.hk.uy/SrV
Bao, L., Liu, X., Xu, Z., Fang, B.: Autoconfig: automatic configuration tuning for distributed message systems. In: Proceedings of the 2018 33rd IEEE/ACM international conference on automated software engineering (ASE), pp. 29–40. IEEE (2018)
Wang, S., Li, C., Hoffmann, H., Lu, S., Sentosa, W., Kistijantoro, A.I.: Understanding and auto-adjusting performance-sensitive configurations. ACM SIGPLAN Not. 53(2), 154–168 (2018)
Cereda, S., Valladares, S., Cremonesi, P., Doni, S.: Cgptuner: a contextual gaussian process bandit approach for the automatic tuning of it configurations under varying workload conditions. Proc. VLDB Endow. 14(8), 1401–1413 (2021)
Zhang, X., Wu, H., Chang, Z., Jin, S., Tan, J., Li, F., Zhang, T., Cui, B.: Restune: resource oriented tuning boosted by meta-learning for cloud databases. In: Proceedings of the 2021 international conference on management of data, pp. 2102–2114 (2021)
Cao, Z., Kuenning, G., Zadok, E.: Carver: finding important parameters for storage system tuning. In: Proceedings of the 18th USENIX conference on file and storage technologies (FAST 20), pp. 43–57 (2020)
Kanellis, K., Alagappan, R., Venkataraman, S.: Too many knobs to tune? Towards faster database tuning by pre-selecting important knobs. In: Proceedings of the 12th USENIX workshop on hot topics in storage and file systems (HotStorage 20) (2020)
Heinze, T., Roediger, L., Meister, A., Ji, Y., Jerzak, Z., Fetzer, C.: Online parameter optimization for elastic data stream processing. In: Proceedings of the sixth ACM symposium on cloud computing, pp. 276–287 (2015)
Cao, P., Fan, Z., Gao, R.X., Tang, J.: Solving configuration optimization problem with multiple hard constraints: an enhanced multi-objective simulated annealing approach. http://arxiv.org/abs/1706.03141 (2017)
Zhang, J., Zhou, K., Li, G., Liu, Y., Xie, M., Cheng, B., Xing, J.: Cdbtune+: an efficient deep reinforcement learning-based automatic cloud database tuning system. VLDB J. 30, 1–29 (2021)
Sagaama, H., Slimane, N.B., Marwani, M., Skhiri, S.: Automatic parameter tuning for big data pipelines with deep reinforcement learning. In: Proceedings of the 2021 IEEE symposium on computers and communications (ISCC), pp. 1–7 (2021)
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)
Fekry, A., Carata, L., Pasquier, T., Rice, A.: Accelerating the configuration tuning of big data analytics with similarity-aware multitask bayesian optimization. In: Proceedings of the 2020 IEEE international conference on big data (Big Data), pp. 266–275. IEEE (2020)
Kandasamy, K., Schneider, J., Póczos, B.: High dimensional bayesian optimisation and bandits via additive models. In: Proceedings of the international conference on machine learning, pp. 295–304. PMLR (2015)
Li, C.L., Kandasamy, K., Póczos, B., Schneider, J.: High dimensional bayesian optimization via restricted projection pursuit models. In: Proceedings of the Artificial intelligence and statistics, pp. 884–892. PMLR (2016)
Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: a review of bayesian optimization. Proc. IEEE 104(1), 148–175 (2015)
Ansible-playbook. https://docs.ansible.com/ansible/latest/cli/ansible-playbook.html. Accessed 8 June 2021
Cheng, G., Ying, S., Wang, B.: Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model. J. Syst. Softw. 180, 111028 (2021)
Kim, S., Sim, A., Wu, K., Byna, S., Wang, T., Son, Y., Eom, H.: DCA-IO: a dynamic I/O control scheme for parallel and distributed file systems. In: Proceedings of the 19th IEEE/ACM international symposium on cluster, cloud and grid computing, CCGRID 2019, Larnaca, Cyprus, May 14–17, 2019, pp. 351–360. IEEE (2019)
Acknowledgements
The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China under Grants 61902440, 61872002 and 61802448.
Funding
This work was supported by the National Natural Science Foundation of China under Grants 61902440, 61872002 and 61802448.
Author information
Authors and Affiliations
Contributions
HD contributed significantly to data analysis and manuscript written; KW performed most of the experiments and draw the figures; YZ and PC contributed a lot to the conception of the study.
Corresponding author
Ethics declarations
Competing Interests
The authors have no relevant financial or non-financial interests to disclose.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Dou, H., Wang, K., Zhang, Y. et al. ATConf: auto-tuning high dimensional configuration parameters for big data processing frameworks. Cluster Comput 26, 2737–2755 (2023). https://doi.org/10.1007/s10586-022-03767-0
Received:
Revised:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s10586-022-03767-0