skip to main content
10.1145/3652892.3700753acmotherconferencesArticle/Chapter ViewAbstractPublication PagesmiddlewareConference Proceedingsconference-collections
research-article

Dexter: A Performance-Cost Efficient Resource Allocation Manager for Serverless Data Analytics

Published: 02 December 2024 Publication History

Abstract

Leveraging serverless platforms for the efficient execution of distributed data analytics frameworks, such as Apache Spark [3], has gained substantial interest since early 2022. The elasticity, free-of-management, and on-demand scalability of serverless have motivated the effort in deploying distributed data analytics applications to serverless platforms. However, effectively auto-scaling resources for such complex workloads so that we can fully benefit from the resource elasticity of serverless remains challenging. Mis-configuration can result in severe performance and cost issues arising from resource under- and over-provisioning.
In this paper, we present Dexter, a robust resource allocation manager dynamically allocating resources at a fine-grained level to guarantee performance-cost efficiency (optimizing total runtime cost). Dexter is novel in combining predictive and reactive strategies that fully leverage the elasticity of serverless to enhance the performance-cost efficiency for workflow executions. Unlike black-box ML models, Dexter quickly reaches a sufficiently good solution, prioritizing simplicity, generality, and ease of understanding. Our experimental evaluation shows that, compared with the default serverless Spark resource allocation that dynamically requests exponentially more executors to accommodate pending tasks, our solution achieves a cost reduction of up to 4.65×, while improving performance-cost efficiency up to 3.50×. Dexter also enables a substantial resource saving, demanding up to 5.75× fewer resources.

References

[1]
Ahsan Ali, Riccardo Pinciroli, Feng Yan, and Evgenia Smirni. Batch: Machine learning inference serving on serverless platforms with adaptive batching. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1--15, 2020.
[2]
Amazon. URL https://aws.amazon.com/it/blogs/machine-learning/code-free-machine-learning-automl-with-autogluon-amazon-sagemaker-and-aws-lambda/.
[3]
Apache Spark: Unified engine for large-scale data analytics. URL https://spark.apache.org.
[4]
AWS Lambda. URL https://aws.amazon.com/lambda/.
[5]
Daniel Barcelona-Pons, Marc Sánchez-Artigas, Gerard París, Pierre Sutra, and Pedro García-López. On the faas track: Building stateful distributed applications with serverless architectures. In Proceedings of the 20th International Middleware Conference, Middleware '19, page 41--54, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450370097.
[6]
Luiz André Barroso, Jimmy Clidaras, and Urs Hölzle. The datacenter as a computer: An introduction to the design of warehouse-scale machines. Synthesis lectures on computer architecture, 8(3):1--154, 2013.
[7]
Anirban Bhattacharjee, Yogesh Barve, Shweta Khare, Shunxing Bao, Aniruddha Gokhale, and Thomas Damiano. Stratum: A serverless framework for the lifecycle management of machine learning-based data analytics tasks. In 2019 USENIX Conference on Operational Machine Learning (OpML 19), pages 59--61, Santa Clara, CA, May 2019. USENIX Association. ISBN 978-1-939133-00-7. URL https://www.usenix.org/conference/opml19/presentation/bhattacharjee.
[8]
Anirban Bhattacharjee, Ajay Dev Chhokra, Zhuangwei Kang, Hongyang Sun, Aniruddha Gokhale, and Gabor Karsai. BARISTA: efficient and scalable serverless serving system for deep learning prediction services. CoRR, abs/1904.01576, 2019. URL http://arxiv.org/abs/1904.01576.
[9]
Joao Carreira. A case for serverless machine learning. 2018.
[10]
Joao Carreira, Pedro Fonseca, Alexey Tumanov, Andrew Zhang, and Randy Katz. Cirrus: A serverless framework for end-to-end ml workflows. In Proceedings of the ACM Symposium on Cloud Computing, SoCC '19, page 13--24, New York, NY, USA, 2019. Association for Computing Machinery. ISBN 9781450369732.
[11]
Databricks SQL Serverless. URL https://www.databricks.com/blog/announcing-general-availability-databricks-sql-serverless.
[12]
Dataproc Serverless. URL https://cloud.google.com/dataproc-serverless/docs.
[13]
Christina Delimitrou and Christos Kozyrakis. Quasar: Resource-efficient and qos-aware cluster management. SIGPLAN Not., 49(4):127--144, feb 2014. ISSN 0362-1340.
[14]
Stratos Dimopoulos, Chandra Krintz, and Rich Wolski. Justice: A deadline-aware, fair-share resource allocator for implementing multi-analytics. In 2017 IEEE International Conference on Cluster Computing (CLUSTER), pages 233--244, 2017.
[15]
Jonatan Enes, Roberto R. Expósito, and Juan Touriño. Real-time resource scaling platform for big data workloads on serverless environments. Future Generation Computer Systems, 105:361--379, 2020. ISSN 0167-739X. URL https://www.sciencedirect.com/science/article/pii/S0167739X19310015.
[16]
Lang Feng, Prabhakar Kudva, Dilma Da Silva, and Jiang Hu. Exploring serverless computing for neural network training. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 334--341, 2018.
[17]
Jerome H. Friedman. Greedy function approximation: A gradient boosting machine. Ann. Statist., 29(5):3326--3339, aug 2001. ISSN 1189-1232. URL https://doi.org/10.1023/A:1010933404324.
[18]
Han Gao, Zhengyu Yang, Janki Bhimani, Teng Wang, Jiayin Wang, Bo Sheng, and Ningfang Mi. Autopath: Harnessing parallel execution paths for efficient resource allocation in multi-stage big data frameworks. In 2017 26th International Conference on Computer Communication and Networks (ICCCN), pages 1--9, 2017.
[19]
gRPC. URL https://grpc.io.
[20]
Vipul Gupta, Swanand Kadhe, Thomas A. Courtade, Michael W. Mahoney, and Kannan Ramchandran. Oversketched newton: Fast convex optimization for serverless systems. CoRR, abs/1903.08857, 2019. URL http://arxiv.org/abs/1903.08857.
[21]
Rui Han, Chi Harold Liu, Zan Zong, Lydia Y. Chen, Wending Liu, Siyi Wang, and Jianfeng Zhan. Workload-adaptive configuration tuning for hierarchical cloud schedulers. IEEE Transactions on Parallel and Distributed Systems, 30(12): 2879--2895, 2019.
[22]
Joseph M. Hellerstein, Jose M. Faleiro, Joseph E. Gonzalez, Johann Schleier-Smith, Vikram Sreekanti, Alexey Tumanov, and Chenggang Wu. Serverless computing: One step forward, two steps back. CoRR, abs/1812.03651, 2018. URL http://arxiv.org/abs/1812.03651.
[23]
Wayne Iba and Pat Langley. Induction of one-level decision trees. In Derek Sleeman and Peter Edwards, editors, Machine Learning Proceedings 1992, pages 233--240. Morgan Kaufmann, San Francisco (CA), 1992. ISBN 978-1-55860-247-2. URL https://www.sciencedirect.com/science/article/pii/B9781558602472500358.
[24]
IBM Analytics Engine. URL https://cloud.ibm.com/docs/AnalyticsEngine?topic=AnalyticsEngine-getting-started.
[25]
Jananie Jarachanthan, Li Chen, Fei Xu, and Bo Li. Astrea: Auto-serverless analytics towards cost-efficiency and qos-awareness. IEEE Transactions on Parallel and Distributed Systems, 33(12):3833--3849, 2022.
[26]
Jiawei Jiang, Shaoduo Gan, Yue Liu, Fanlin Wang, Gustavo Alonso, Ana Klimovic, Ankit Singla, Wentao Wu, and Ce Zhang. Towards demystifying serverless machine learning training. In Proceedings of the 2021 International Conference on Management of Data, SIGMOD '21, page 857--871, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450383431.
[27]
Chao Jin, Zili Zhang, Xingyu Xiang, Songyun Zou, Gang Huang, Xuanzhe Liu, and Xin Jin. Ditto: Efficient serverless analytics with elastic parallelism. In Proceedings of the ACM SIGCOMM 2023 Conference, ACM SIGCOMM '23, page 406--419, New York, NY, USA, 2023. Association for Computing Machinery. ISBN 9798400702365.
[28]
Eric Jonas, Qifan Pu, Shivaram Venkataraman, Ion Stoica, and Benjamin Recht. Occupy the cloud: Distributed computing for the 99%. In Proceedings of the 2017 Symposium on Cloud Computing, SoCC '17, page 445--451, New York, NY, USA, 2017. Association for Computing Machinery. ISBN 9781450350280.
[29]
Eric Jonas, Johann Schleier-Smith, Vikram Sreekanti, Chia-che Tsai, Anurag Khandelwal, Qifan Pu, Vaishaal Shankar, João Carreira, Karl Krauth, Neeraja Jayant Yadwadkar, Joseph E. Gonzalez, Raluca Ada Popa, Ion Stoica, and David A. Patterson. Cloud programming simplified: A berkeley view on serverless computing. CoRR, abs/1902.03383, 2019. URL http://arxiv.org/abs/1902.03383.
[30]
Sangeetha Abdu Jyothi, Carlo Curino, Ishai Menache, Shravan Matthur Narayanamurthy, Alexey Tumanov, Jonathan Yaniv, Ruslan Mavlyutov, Inigo Goiri, Subru Krishnan, Janardhan Kulkarni, and Sriram Rao. Morpheus: Towards automated SLOs for enterprise clusters. In 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16), pages 117--134, Savannah, GA, November 2016. USENIX Association. ISBN 978-1-931971-33-1. URL https://www.usenix.org/conference/osdi16/technical-sessions/presentation/jyothi.
[31]
Simon Kassing, Ingo Müller, and Gustavo Alonso. Resource allocation in serverless query processing, 2022.
[32]
Youngbin Kim and Jimmy Lin. Serverless data analytics with flint. In 2018 IEEE 11th International Conference on Cloud Computing (CLOUD), pages 451--455, 2018.
[33]
Ana Klimovic, Yawen Wang, Patrick Stuedi, Animesh Trivedi, Jonas Pfefferle, and Christos Kozyrakis. Pocket: Elastic ephemeral storage for serverless analytics. In 13th USENIX Symposium on Operating Systems Design and Implementation (OSDI 18), pages 427--444, Carlsbad, CA, October 2018. USENIX Association. ISBN 978-1-939133-08-3. URL https://www.usenix.org/conference/osdi18/presentation/klimovic.
[34]
Rashmi Korlakai Vinayak and Ran Gilad-Bachrach. DART: Dropouts meet Multiple Additive Regression Trees. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the Eighteenth International Conference on Artificial Intelligence and Statistics, volume 38 of Proceedings of Machine Learning Research, pages 489--497, San Diego, California, USA, 09--12 May 2015. PMLR. URL https://proceedings.mlr.press/v38/korlakaivinayak15.html.
[35]
Zijun Li, Yushi Liu, Linsong Guo, Quan Chen, Jiagan Cheng, Wenli Zheng, and Minyi Guo. Faasflow: enable efficient workflow execution for function-as-a-service. In Proceedings of the 27th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, ASPLOS '22, page 782--796, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450392051.
[36]
Ashraf Mahgoub, Karthick Shankar, Subrata Mitra, Ana Klimovic, Somali Chaterji, and Saurabh Bagchi. SONIC: Application-aware data passing for chained serverless applications. In 2021 USENIX Annual Technical Conference (USENIX ATC 21), pages 285--301. USENIX Association, July 2021. ISBN 978-1-939133-23-6. URL https://www.usenix.org/conference/atc21/presentation/mahgoub.
[37]
Hongzi Mao, Malte Schwarzkopf, Shaileshh Bojja Venkatakrishnan, Zili Meng, and Mohammad Alizadeh. Learning scheduling algorithms for data processing clusters. CoRR, abs/1810.01963, 2018. URL http://arxiv.org/abs/1810.01963.
[38]
MinIO Object Storage. URL https://min.io.
[39]
Ingo Müller, Renato Marroquín, and Gustavo Alonso. Lambada: Interactive data analytics on cold data using serverless cloud infrastructure. CoRR, abs/1912.00937, 2019. URL http://arxiv.org/abs/1912.00937.
[40]
Ingo Müller, Renato Marroquín, and Gustavo Alonso. Lambada: Interactive data analytics on cold data using serverless cloud infrastructure. In Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data, SIGMOD '20, page 115--130, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450367356.
[41]
Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The PageRank Citation Ranking: Bringing Order to the Web. Technical report, Stanford Digital Library Technologies Project, 1998. URL http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.31.1768.
[42]
Jun Woo Park, Alexey Tumanov, Angela Jiang, Michael A. Kozuch, and Gregory R. Ganger. 3sigma: Distribution-based cluster scheduling for runtime uncertainty. In Proceedings of the Thirteenth EuroSys Conference, EuroSys '18, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450355841.
[43]
Matthew Perron, Raul Castro Fernandez, David J. DeWitt, and Samuel Madden. Starling: A scalable query engine on cloud function services. CoRR, abs/1911.11727, 2019. URL http://arxiv.org/abs/1911.11727.
[44]
Anish Pimpley, Shuo Li, Anubha Srivastava, Vishal Rohra, Yi Zhu, Soundararajan Srinivasan, Alekh Jindal, Hiren Patel, Shi Qiao, and Rathijit Sen. Optimal resource allocation for serverless queries. arXiv, July 2021. URL https://www.microsoft.com/en-us/research/publication/optimal-resource-allocation-for-serverless-queries/.
[45]
Conor Power, Hiren Patel, Alekh Jindal, Jyoti Leeka, Bob Jenkins, Michael Rys, Ed Triou, Dexin Zhu, Lucky Katahanas, Chakrapani Bhat Talapady, Joshua Rowe, Fan Zhang, Rich Draves, Marc Friedman, Ivan Santa Maria Filho, and Amrish Kumar. The cosmos big data platform at microsoft: over a decade of progress and a decade to look forward. Proc. VLDB Endow., 14(12):3148--3161, jul 2021. ISSN 2150-8097.
[46]
Conor Power, Hiren Patel, Alekh Jindal, Jyoti Leeka, Bob Jenkins, Michael Rys, Ed Triou, Dexin Zhu, Lucky Katahanas, Chakrapani Bhat Talapady, et al. The cosmos big data platform at microsoft: Over a decade of progress and a decade to look forward. Proceedings of the VLDB Endowment, 14(12):3148--3161, 2021.
[47]
Protocol buffers. URL https://protobuf.dev;https//github.com/protocolbuffers/protobuf.
[48]
Qifan Pu, Shivaram Venkataraman, and Ion Stoica. Shuffling, fast and slow: Scalable analytics on serverless infrastructure. In 16th USENIX Symposium on Networked Systems Design and Implementation (NSDI 19), pages 193--206, Boston, MA, February 2019. USENIX Association. ISBN 978-1-931971-49-2. URL https://www.usenix.org/conference/nsdi19/presentation/pu.
[49]
Stuart J Russell and Peter Norvig. Artificial intelligence: a modern approach. Pearson, 2016.
[50]
Josep Sampé, Gil Vernik, Marc Sánchez-Artigas, and Pedro García-López. Serverless data analytics in the ibm cloud. In Proceedings of the 19th International Middleware Conference Industry, Middleware '18, page 1--8, New York, NY, USA, 2018. Association for Computing Machinery. ISBN 9781450360166.
[51]
Marc Sánchez-Artigas and Germán T. Eizaguirre. A seer knows best: Optimized object storage shuffling for serverless analytics. In Proceedings of the 23rd ACM/IFIP International Middleware Conference, Middleware '22, page 148--160, New York, NY, USA, 2022. Association for Computing Machinery. ISBN 9781450393409.
[52]
Marc Sánchez-Artigas and Pablo Gimeno Sarroca. Experience paper: Towards enhancing cost efficiency in serverless machine learning training. In Proceedings of the 22nd International Middleware Conference, Middleware '21, page 210--222, New York, NY, USA, 2021. Association for Computing Machinery. ISBN 9781450385343.
[53]
Marc Sánchez-Artigas, Germán T. Eizaguirre, Gil Vernik, Lachlan Stuart, and Pedro García-López. Primula: A practical shuffle/sort operator for serverless computing. In Proceedings of the 21st International Middleware Conference Industrial Track, Middleware '20, page 31--37, New York, NY, USA, 2020. Association for Computing Machinery. ISBN 9781450382014.
[54]
Pablo Gimeno Sarroca and Marc Sánchez-Artigas. On data processing through the lenses of s3 object lambda. In IEEE INFOCOM 2023 - IEEE Conference on Computer Communications, pages 1--10, 2023.
[55]
Malte Schwarzkopf, Andy Konwinski, Michael Abd-El-Malek, and John Wilkes. Omega: Flexible, scalable schedulers for large compute clusters. In Proceedings of the 8th ACM European Conference on Computer Systems, EuroSys '13, page 351--364, New York, NY, USA, 2013. Association for Computing Machinery. ISBN 9781450319942.
[56]
Rathijit Sen, Alekh Jindal, Hiren Patel, and Shi Qiao. Autotoken: Predicting peak parallelism for big data analytics at microsoft. Proc. VLDB Endow., 13(12): 3326--3339, aug 2020. ISSN 2150-8097. URL https://doi-org.recursos.biblioteca.upc.edu/10.14778/3415478.3415554.
[57]
Rathijit Sen, Abhishek Roy, and Alekh Jindal. Predictive price-performance optimization for serverless query processing. CoRR, abs/2112.08572, 2021. URL https://arxiv.org/abs/2112.08572.
[58]
Mohammad Shahrad, Rodrigo Fonseca, Inigo Goiri, Gohar Chaudhry, Paul Batum, Jason Cooke, Eduardo Laureano, Colby Tresness, Mark Russinovich, and Ricardo Bianchini. Serverless in the wild: Characterizing and optimizing the serverless workload at a large cloud provider. In 2020 USENIX Annual Technical Conference (USENIX ATC 20), pages 205--218. USENIX Association, July 2020. ISBN 978-1-939133-14-4. URL https://www.usenix.org/conference/atc20/presentation/shahrad.
[59]
Subhajit Sidhanta, Wojciech Golab, and Supratik Mukhopadhyay. Optex: A deadline-aware cost optimization model for spark. In 2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid), pages 193--202, 2016.
[60]
Vikram Sreekanti, Harikaran Subbaraj, Chenggang Wu, Joseph E. Gonzalez, and Joseph M. Hellerstein. Optimizing prediction serving on low-latency serverless dataflow. CoRR, abs/2007.05832, 2020. URL https://arxiv.org/abs/2007.05832.
[61]
TPC-DS Benchmark. URL https://www.tpc.org/tpcds/.
[62]
TPC-H Benchmark. URL https://www.tpc.org/tpch/.
[63]
Verified Market Research. URL https://www.verifiedmarketresearch.com/product/serverless-architecture-market/.
[64]
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at google with borg. In Proceedings of the Tenth European Conference on Computer Systems, EuroSys '15, New York, NY, USA, 2015. Association for Computing Machinery. ISBN 9781450332385.
[65]
Hao Wang, Di Niu, and Baochun Li. Distributed machine learning with a serverless architecture. In IEEE INFOCOM 2019 - IEEE Conference on Computer Communications, pages 1288--1296, 2019.
[66]
Liang Wang, Mengyuan Li, Yinqian Zhang, Thomas Ristenpart, and Michael Swift. Peeking behind the curtains of serverless platforms. In 2018 USENIX Annual Technical Conference (USENIX ATC 18), pages 133--146, Boston, MA, July 2018. USENIX Association. ISBN ISBN 978-1-939133-01-4. URL https://www.usenix.org/conference/atc18/presentation/wang-liang.
[67]
Owen O'Malley Yahoo! Terabyte sort on apache hadoop. Mai 2008 2008. URL http://www.hpl.hp.com/hosted/sortbenchmark/YahooHadoop.pdf.
[68]
Yanan Yang, Laiping Zhao, Yiming Li, Huanyu Zhang, Jie Li, Mingyang Zhao, Xingzhen Chen, and Keqiu Li. Infless: a native serverless system for low-latency, high-throughput inference. pages 768--781, 02 2022.
[69]
Hanfei Yu, Hao Wang, Jian Li, and Seung-Jong Park. Harvesting idle resources in serverless computing via reinforcement learning. CoRR, abs/2108.12717, 2021. URL https://arxiv.org/abs/2108.12717.
[70]
Hong Zhang, Yupeng Tang, Anurag Khandelwal, Jingrong Chen, and Ion Stoica. Caerus: NIMBLE task scheduling for serverless analytics. In 18th USENIX Symposium on Networked Systems Design and Implementation (NSDI 21), pages 653--669. USENIX Association, April 2021. ISBN 978-1-939133-21-2. URL https://www.usenix.org/conference/nsdi21/presentation/zhang-hong.
[71]
Hong Zhang, Yupeng Tang, Anurag Khandelwal, Jingrong Chen, and Ion Stoica. Caerus: Nimble task scheduling for serverless analytics. In Symposium on Networked Systems Design and Implementation, 2021.

Index Terms

  1. Dexter: A Performance-Cost Efficient Resource Allocation Manager for Serverless Data Analytics

    Recommendations

    Comments

    Information & Contributors

    Information

    Published In

    cover image ACM Other conferences
    Middleware '24: Proceedings of the 25th International Middleware Conference
    December 2024
    515 pages
    ISBN:9798400706233
    DOI:10.1145/3652892
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

    In-Cooperation

    • IFIP
    • Usenix

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 02 December 2024

    Check for updates

    Author Tags

    1. serverless
    2. resource allocation
    3. data analytics
    4. spark
    5. stage

    Qualifiers

    • Research-article

    Funding Sources

    • EU
    • MICINN
    • AGAUR

    Conference

    Middleware '24
    Middleware '24: 25th International Middleware Conference
    December 2 - 6, 2024
    Hong Kong, Hong Kong

    Acceptance Rates

    Overall Acceptance Rate 203 of 948 submissions, 21%

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • 0
      Total Citations
    • 113
      Total Downloads
    • Downloads (Last 12 months)113
    • Downloads (Last 6 weeks)22
    Reflects downloads up to 19 Feb 2025

    Other Metrics

    Citations

    View Options

    Login options

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Figures

    Tables

    Media

    Share

    Share

    Share this Publication link

    Share on social media