Skip to main content
Log in

Apollo: Rapidly Picking the Optimal Cloud Configurations for Big Data Analytics Using a Data-Driven Approach

  • Regular Paper
  • Published:
Journal of Computer Science and Technology Aims and scope Submit manuscript

Abstract

Big data analytics applications are increasingly deployed on cloud computing infrastructures, and it is still a big challenge to pick the optimal cloud configurations in a cost-effective way. In this paper, we address this problem with a high accuracy and a low overhead. We propose Apollo, a data-driven approach that can rapidly pick the optimal cloud configurations by reusing data from similar workloads. We first classify 12 typical workloads in BigDataBench by characterizing pairwise correlations in our offline benchmarks. When a new workload comes, we run it with several small datasets to rank its key characteristics and get its similar workloads. Based on the rank, we then limit the search space of cloud configurations through a classification mechanism. At last, we leverage a hierarchical regression model to measure which cluster is more suitable and use a local search strategy to pick the optimal cloud configurations in a few extra tests. Our evaluation on 12 typical workloads in HiBench shows that compared with state-of-the-art approaches, Apollo can improve up to 30% search accuracy, while reducing as much as 50% overhead for picking the optimal cloud configurations.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Similar content being viewed by others

References

  1. Bilal M, Canini M, Rodrigues R. Finding the right cloud configuration for analytics clusters. In Proc. the 11th ACM Symposium on Cloud Computing, October 2020, pp.208-222. https://doi.org/10.1145/3419111.3421305.

  2. Alipourfard O, Liu H H, Chen J, Venkataraman S, Yu M, Zhang M. Cherrypick: Adaptively unearthing the best cloud configurations for big data analytics. In Proc. the 14th USENIX Symposium on Networked Systems Design and Implementation, March 2017, pp.469-482.

  3. Delimitrou C, Kozyrakis C. QoS-aware scheduling in heterogeneous datacenters with paragon. ACM Transactions on Computer Systems, 2013, 31(4): Article No. 12. https://doi.org/10.1145/2556583.

  4. Venkataraman S, Yang Z, Franklin M, Recht B, Stoica I. Ernest: Efficient performance prediction for large-scale advanced analytics. In Proc. the 13th USENIX Symposium on Networked Systems Design and Implementation, March 2016, pp.363-378.

  5. Hsu C J, Nair V, Freeh V W, Menzies T. Arrow: Low-level augmented Bayesian optimization for finding the best cloud VM. In Proc. the 38th IEEE International Conference on Distributed Computing Systems, July 2018, pp.660-670. https://doi.org/10.1109/ICDCS.2018.00070.

  6. Wang H, Wang N, Yeung D Y. Collaborative deep learning for recommender systems. In Proc. the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 2015, pp.1235-1244. https://doi.org/10.1145/2783258.2783273.

  7. Abdi H. The Kendall rank correlation coefficient. In Encyclopedia of Measurement and Statistics, Salkind N J (ed.), SAGE, 2007, pp.508-510.

  8. Leevy J L, Khoshgoftaar T M, Bauder R A, Seliya N. A survey on addressing high-class imbalance in big data. Journal of Big Data, 2018, 5(1): Article No. 42. https://doi.org/10.1186/s40537-018-0151-6.

  9. Quinton C, Haderer N, Rouvoy R, Duchien L. Towards multi-cloud configurations using feature models and ontologies. In Proc. the 2013 International Workshop on Multi-Cloud Applications and Federated Clouds, April 2013, pp.21-26. https://doi.org/10.1145/2462326.2462332.

  10. Herodotou H, Dong F, Babu S. No one (cluster) size fits all: Automatic cluster sizing for data-intensive analytics. In Proc. the 2nd ACM Symposium on Cloud Computing, October 2011, Article No. 18. https://doi.org/10.1145/2038916.2038934.

  11. Jung G, Mukherjee T, Kunde S, Kim H, Sharma N, Goetz F. CloudAdvisor: A recommendation-as-a-service platform for cloud configuration and pricing. In Proc. the 9th IEEE World Congress on Services, June 28-July 3, 2013, pp.456-463. https://doi.org/10.1109/SERVICES.2013.55.

  12. Grandl R, Chowdhury M, Akella A, Ananthanarayanan G. Altruistic scheduling in multi-resource clusters. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.65-80.

  13. Wang L, Zhan J, Luo C, Zhu Y, Yang Q, He Y, Gao W, Jia Z, Shi Y, Zhang S. BigDataBench: A big data benchmark suite from Internet services. In Proc. the 20th IEEE International Symposium on High Performance Computer Architecture, Feb. 2014, pp.488-499. https://doi.org/10.1109/HPCA.2014.6835958.

  14. Yadwadkar N J, Hariharan B, Gonzalez J E, Katz R. Multitask learning for straggler avoiding predictive job scheduling. The Journal of Machine Learning Research, 2016, 17(106): 1-37.

    MATH  Google Scholar 

  15. Zhang Z, Cherkasova L, Verma A, Loo B T. Automated profiling and resource management of pig programs for meeting service level objectives. In Proc. the 9th International Conference on Autonomic Computing, September 2012, pp.53-62. https://doi.org/10.1145/2371536.2371546.

  16. Wagstaff K, Cardie C, Rogers S, Schrödl S. Constrained k-means clustering with background knowledge. In Proc. the 18th International Conference on Machine Learning, June 28-July 1, 2001, pp.577-584.

  17. Yadwadkar N J, Hariharan B, Gonzalez J E, Smith B, Katz R H. Selecting the best VM across multiple public clouds: A data-driven performance modeling approach. In Proc. the 2017 Symposium on Cloud Computing, September 2017, pp. 452-465. https://doi.org/10.1145/3127479.3131614.

  18. Lama P, Zhou X. AROMA: Automated resource allocation and configuration of MapReduce environment in the cloud. In Proc. the 9th International Conference on Autonomic Computing, September 2012, pp.63-72. https://doi.org/10.1145/2371536.2371547.

  19. Kodinariya T M, Makwana P R. Review on determining number of cluster in K-means clustering. International Journal of Advance Research in Computer Science and Management Studies, 2013, 1(6): 90-95.

    Google Scholar 

  20. Abadi M, Barham P, Chen J, Chen Z, Davis A, Dean J, Devin M, Ghemawat S, Irving G, Isard M. TensorFlow: A system for large-scale machine learning. In Proc. the 12th USENIX Symposium on Operating Systems Design and Implementation, November 2016, pp.265-283.

  21. Paszke A, Gross S, Massa F et al. Pytorch: An imperative style, high-performance deep learning library. In Proc. the 2019 Annual Conference on Neural Information Processing Systems, December 2019, pp.8026-8037.

  22. Cortez E, Bonde A, Muzio A, Russinovich M, Fontoura M, Bianchini R. Resource central: Understanding and predicting workloads for improved resource management in large cloud platforms. In Proc. the 26th Symposium on Operating Systems Principles, October 2017, pp.153-167. https://doi.org/10.1145/3132747.3132772.

  23. Foga S, Scaramuzza P L, Guo S, Zhu Z, Dilley Jr R D, Beckmann T, Schmidt G L, Dwyer J L, Hughes M J, Laue B. Cloud detection algorithm comparison and validation for operational Landsat data products. Remote Sensing of Environment, 2017, 194: 379-390. https://doi.org/10.1016/j.rse.2017.03.026.

    Article  Google Scholar 

  24. Basaru R R, Child C, Alonso E, Slabaugh G. Data-driven recovery of hand depth using CRRF on stereo images. IET Computer Vision, 2018, 12(5): 666-678. https://doi.org/10.1049/ietcvi.2017.0227.

    Article  Google Scholar 

  25. Maricq A, Duplyakin D, Jimenez I, Maltzahn C, Stutsman R, Ricci R. Taming performance variability. In Proc. the 13th USENIX Symposium on Operating Systems Design and Implementation, October 2018, pp.409-425.

  26. Uta A, Custura A, Duplyakin D, Jimenez I, Rellermeyer J, Maltzahn C, Ricci R, Iosup A. Is big data performance reproducible in modern cloud networks? In Proc. the 17th USENIX Symposium on Networked Systems Design and Implementation, February 2020, pp.513-527.

  27. Baccarelli E, Cordeschi N, Mei A, Panella M, Shojafar M, Stefa J. Energy-efficient dynamic traffic offloading and reconfiguration of networked data centers for big data stream mobile computing: Review, challenges, and a case study. IEEE Network, 2016, 30(2): 54-61. https://doi.org/10.1109/MNET.2016.7437025.

  28. Cohen M B, Elder S, Musco C, Musco C, Persu M. Dimensionality reduction for k-means clustering and low rank approximation. In Proc. the 47th Annual ACM Symposium on Theory of Computing, June 2015, pp.163-172. https://doi.org/10.1145/2746539.2746569.

  29. Shi J, Zou J, Lu J, Cao Z, Li S, Wang C. MRTuner: A toolkit to enable holistic optimization for MapReduce jobs. Proceedings of the VLDB Endowment, 2014, 7(13): 1319-1330. https://doi.org/10.14778/2733004.2733005.

    Article  Google Scholar 

  30. Delimitrou C, Kozyrakis C. Quasar: Resource-efficient and QoS-aware cluster management. ACM SIGPLAN Notices, 2014, 49(4): 127-144. https://doi.org/10.1145/2644865.2541941.

    Article  Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Heng Wu.

Supplementary Information

ESM 1

(PDF 133 kb)

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wu, YW., Xu, YJ., Wu, H. et al. Apollo: Rapidly Picking the Optimal Cloud Configurations for Big Data Analytics Using a Data-Driven Approach. J. Comput. Sci. Technol. 36, 1184–1199 (2021). https://doi.org/10.1007/s11390-021-0232-4

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11390-021-0232-4

Keywords

Navigation