Skip to main content
Log in

ATConf: auto-tuning high dimensional configuration parameters for big data processing frameworks

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

To support various application scenarios, big data processing frameworks (BDPFs) such as Spark usually provide users with a large number of performance-critical configuration parameters. Since manually configuring is both labor-intensive and time-consuming, automatically tuning configurations parameters for BDPFs to achieve better performance has been an urgent need. To simultaneously address the corresponding challenges such as high dimensional configuration space, we propose ATConf-a new black-box approach of automatically tuning the internal and external configuration parameters for BDPFs. Experimental results based on our local distributed Spark cluster show that the best execution time achieved by ATConf is as much as 46.52% less than the default configuration. Besides, compared with the four baselines, ATConf is able to further reduce the relative execution time over default by at least 4.10% under the same constraint of observation times.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

The data used to support the findings of this study are available from the corresponding author upon request.

References

  1. Herodotou, H., Chen, Y., Jiaheng, L.: A survey on automatic parameter tuning for big data processing systems. ACM Comput. Surv. (CSUR) 53(2), 1–37 (2020)

    Article  Google Scholar 

  2. Zhu, Y., Liu, J., Guo, M., Bao, Y., Ma, W., Liu, Z., Song, K., Yang, Y.: Bestconfig: tapping the performance potential of systems via automatic configuration tuning. In: Proceedings of the 2017 symposium on cloud computing, pp. 338–350 (2017)

  3. Van Aken, D., Pavlo, A., Gordon, G.J., Zhang, B.: Automatic database management system tuning through large-scale machine learning. In: Proceedings of the 2017 ACM international conference on management of data, pp 1009–1024 (2017)

  4. Yu, Z., Bei, Z., Qian, X.: Datasize-aware high dimensional configurations auto-tuning of in-memory cluster computing. In: Proceedings of the twenty-third international conference on architectural support for programming languages and operating systems, pp. 564–577 (2018)

  5. Zhang, J., Liu, Y., Zhou, K., Li, G., Xiao, Z., Cheng, B., Xing, J., Wang, Y., Cheng, T., et al.: An end-to-end automatic cloud database tuning system using deep reinforcement learning. In: Proceedings of the 2019 international conference on management of data, pp. 415–432 (2019)

  6. Li, M., Liu, Z., Shi, X., Jin, H.: Atcs: auto-tuning configurations of big data frameworks based on generative adversarial nets. IEEE Access 8, 50485–50496 (2020)

    Article  Google Scholar 

  7. Tooley, R.: Auto-tuning spark with bayesian optimisation (2021)

  8. Xiong, W., Bei, Z., Chengzhong, X., Zhibin, Yu.: Ath: auto-tuning hbase’s configuration via ensemble learning. IEEE Access 5, 13157–13170 (2017)

    Article  Google Scholar 

  9. Bei, Z., Zhibin, Yu., Zhang, H., Xiong, W., Chengzhong, X., Eeckhout, L., Feng, S.: Rfhoc: a random-forest approach to auto-tuning hadoop’s configuration. IEEE Trans. Parallel Distrib. Syst. 27(5), 1470–1483 (2015)

    Article  Google Scholar 

  10. Guo, Y., Shan, H., Huang, S., Hwang, K., Fan, J., Yu, Z.: Gml: Efficiently auto-tuning flink’s configurations via guided machine learning. In: Proceedings of the IEEE transactions on parallel and distributed systems (2021)

  11. Jamshidi, P., Casale, G.: An uncertainty-aware approach to optimal configuration of stream processing systems. In: Proceedings of the 2016 IEEE 24th international symposium on modeling, analysis and simulation of computer and telecommunication systems (MASCOTS), pp. 39–48. IEEE (2016)

  12. Spark configuration. http://spark.apache.org/docs/2.2.2/configuration.html. Accessed 8 July 2021

  13. ZWang, Z., Zoghi, M., Hutter, F., Matheson, D., De Freitas, N.: Bayesian optimization in high dimensions via random embeddings. In: Proceedings of the twenty-third international joint conference on artificial intelligence (2013)

  14. Wang, Z., Hutter, F., Zoghi, M., Matheson, D., de Feitas, N.: Bayesian optimization in a billion dimensions via random embeddings. J. Artif. Intell. Res. 55, 361–387 (2016)

    Article  MathSciNet  MATH  Google Scholar 

  15. Khan, M.M., Yu, W.: Robotune: high-dimensional configuration tuning for cluster-based data analytics. In: Proceedings of the 50th international conference on parallel processing, pp. 1–10 (2021)

  16. Li, C., Gupta, S., Rana, S., Nguyen, V., Venkatesh, S., Shilton, A.: High dimensional bayesian optimization using dropout. In: Proceedings of the twenty-sixth international joint conference on artificial intelligence, IJCAI-17, pp. 2096–2102 (2017)

  17. Hibench. https://github.com/Intel-bigdata/HiBench. Accessed 8 June 2021

  18. Bergstra, J., Bengio, Y.: Random search for hyper-parameter optimization. J. Mach. Learn. Res. 13(2), 281–305 (2012)

    MathSciNet  MATH  Google Scholar 

  19. Dou, H., Chen, P., Zheng, Z.: Hdconfigor: automatically tuning high dimensional configuration parameters for log search engines. IEEE Access 8, 80638–80653 (2020)

    Article  Google Scholar 

  20. Atconf. https://cn.hk.uy/SrV

  21. Bao, L., Liu, X., Xu, Z., Fang, B.: Autoconfig: automatic configuration tuning for distributed message systems. In: Proceedings of the 2018 33rd IEEE/ACM international conference on automated software engineering (ASE), pp. 29–40. IEEE (2018)

  22. Wang, S., Li, C., Hoffmann, H., Lu, S., Sentosa, W., Kistijantoro, A.I.: Understanding and auto-adjusting performance-sensitive configurations. ACM SIGPLAN Not. 53(2), 154–168 (2018)

    Article  Google Scholar 

  23. Cereda, S., Valladares, S., Cremonesi, P., Doni, S.: Cgptuner: a contextual gaussian process bandit approach for the automatic tuning of it configurations under varying workload conditions. Proc. VLDB Endow. 14(8), 1401–1413 (2021)

    Article  Google Scholar 

  24. Zhang, X., Wu, H., Chang, Z., Jin, S., Tan, J., Li, F., Zhang, T., Cui, B.: Restune: resource oriented tuning boosted by meta-learning for cloud databases. In: Proceedings of the 2021 international conference on management of data, pp. 2102–2114 (2021)

  25. Cao, Z., Kuenning, G., Zadok, E.: Carver: finding important parameters for storage system tuning. In: Proceedings of the 18th USENIX conference on file and storage technologies (FAST 20), pp. 43–57 (2020)

  26. Kanellis, K., Alagappan, R., Venkataraman, S.: Too many knobs to tune? Towards faster database tuning by pre-selecting important knobs. In: Proceedings of the 12th USENIX workshop on hot topics in storage and file systems (HotStorage 20) (2020)

  27. Heinze, T., Roediger, L., Meister, A., Ji, Y., Jerzak, Z., Fetzer, C.: Online parameter optimization for elastic data stream processing. In: Proceedings of the sixth ACM symposium on cloud computing, pp. 276–287 (2015)

  28. Cao, P., Fan, Z., Gao, R.X., Tang, J.: Solving configuration optimization problem with multiple hard constraints: an enhanced multi-objective simulated annealing approach. http://arxiv.org/abs/1706.03141 (2017)

  29. Zhang, J., Zhou, K., Li, G., Liu, Y., Xie, M., Cheng, B., Xing, J.: Cdbtune+: an efficient deep reinforcement learning-based automatic cloud database tuning system. VLDB J. 30, 1–29 (2021)

    Article  Google Scholar 

  30. Sagaama, H., Slimane, N.B., Marwani, M., Skhiri, S.: Automatic parameter tuning for big data pipelines with deep reinforcement learning. In: Proceedings of the 2021 IEEE symposium on computers and communications (ISCC), pp. 1–7 (2021)

  31. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., Salakhutdinov, R.: Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15(1), 1929–1958 (2014)

    MathSciNet  MATH  Google Scholar 

  32. Fekry, A., Carata, L., Pasquier, T., Rice, A.: Accelerating the configuration tuning of big data analytics with similarity-aware multitask bayesian optimization. In: Proceedings of the 2020 IEEE international conference on big data (Big Data), pp. 266–275. IEEE (2020)

  33. Kandasamy, K., Schneider, J., Póczos, B.: High dimensional bayesian optimisation and bandits via additive models. In: Proceedings of the international conference on machine learning, pp. 295–304. PMLR (2015)

  34. Li, C.L., Kandasamy, K., Póczos, B., Schneider, J.: High dimensional bayesian optimization via restricted projection pursuit models. In: Proceedings of the Artificial intelligence and statistics, pp. 884–892. PMLR (2016)

  35. Shahriari, B., Swersky, K., Wang, Z., Adams, R.P., De Freitas, N.: Taking the human out of the loop: a review of bayesian optimization. Proc. IEEE 104(1), 148–175 (2015)

    Article  Google Scholar 

  36. Ansible-playbook. https://docs.ansible.com/ansible/latest/cli/ansible-playbook.html. Accessed 8 June 2021

  37. Cheng, G., Ying, S., Wang, B.: Tuning configuration of apache spark on public clouds by combining multi-objective optimization and performance prediction model. J. Syst. Softw. 180, 111028 (2021)

    Article  Google Scholar 

  38. Kim, S., Sim, A., Wu, K., Byna, S., Wang, T., Son, Y., Eom, H.: DCA-IO: a dynamic I/O control scheme for parallel and distributed file systems. In: Proceedings of the 19th IEEE/ACM international symposium on cluster, cloud and grid computing, CCGRID 2019, Larnaca, Cyprus, May 14–17, 2019, pp. 351–360. IEEE (2019)

Download references

Acknowledgements

The authors would like to thank the anonymous reviewers for their valuable comments and suggestions. This work was supported by the National Natural Science Foundation of China under Grants 61902440, 61872002 and 61802448.

Funding

This work was supported by the National Natural Science Foundation of China under Grants 61902440, 61872002 and 61802448.

Author information

Authors and Affiliations

Authors

Contributions

HD contributed significantly to data analysis and manuscript written; KW performed most of the experiments and draw the figures; YZ and PC contributed a lot to the conception of the study.

Corresponding author

Correspondence to Yiwen Zhang.

Ethics declarations

Competing Interests

The authors have no relevant financial or non-financial interests to disclose.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Dou, H., Wang, K., Zhang, Y. et al. ATConf: auto-tuning high dimensional configuration parameters for big data processing frameworks. Cluster Comput 26, 2737–2755 (2023). https://doi.org/10.1007/s10586-022-03767-0

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-022-03767-0

Keywords

Navigation