research-article

Proteus: agile ML elasticity through tiered reliability in dynamic resource markets

Authors:

Alexey Tumanov,

Gregory R. Ganger,

Phillip B. GibbonsAuthors Info & Claims

EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems

Pages 589 - 604

https://doi.org/10.1145/3064176.3064182

Published: 23 April 2017 Publication History

Abstract

Many shared computing clusters allow users to utilize excess idle resources at lower cost or priority, with the proviso that some or all may be taken away at any time. But, exploiting such dynamic resource availability and the often fluctuating markets for them requires agile elasticity and effective acquisition strategies. Proteus aggressively exploits such transient revocable resources to do machine learning (ML) cheaper and/or faster. Its parameter server framework, AgileML, efficiently adapts to bulk additions and revocations of transient machines, through a novel 3-stage active-backup approach, with minimal use of more costly non-transient resources. Its BidBrain component adaptively allocates resources from multiple EC2 spot markets to minimize average cost per work as transient resource availability and cost change over time. Our evaluations show that Proteus reduces cost by 85% relative to non-transient pricing, and by 43% relative to previous approaches, while simultaneously reducing runtimes by up to 37%.

References

[1]

AWS EC2. http://aws.amazon.com/ec2/.

[2]

Spot Bid Advisor. https://aws.amazon.com/ec2/spot/bid-advisor/.

[3]

Google Compute Engine. https://cloud.google.com/compute/.

[4]

Apache Hadoop. http://hadoop.apache.org/.

[5]

New York Times dataset. http://www.ldc.upenn.edu/.

[6]

M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker, V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale machine learning. In Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation (OSDI 16).

[7]

O. Agmon Ben-Yehuda, M. Ben-Yehuda, A. Schuster, and D. Tsafrir. Deconstructing Amazon EC2 spot instance pricing. ACM Transactions on Economics and Computation, 1(3):16, 2013.

Digital Library

[8]

C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006. ISBN 0387310738.

Digital Library

[9]

T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. arXiv preprint arXiv.1512.01274, 2015.

[10]

T. Chilimbi, Y. Suzue, J. Apacible, and K. Kalyanaraman. Project Adam: Building an efficient and scalable deep learning training system. In Proceedings of the 11th USENIX Conference on Operating Systems Design and Implementation (OSDI 14), pages 571--582, 2014.

[11]

H. Cui, J. Cipar, Q. Ho, J. K. Kim, S. Lee, A. Kumar, J. Wei, W. Dai, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting bounded staleness to speed up big data analytics. In Proceedings of the 2014 USENIX Annual Technical Conference (USENIX ATC 14), pages 37--18, 2014.

[12]

H. Cui, A. Tumanov, J. Wei, L. Xu, W. Dai, J. Haber-Kucharsky, Q. Ho, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Exploiting iterative-ness for parallel ML computations. In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC 14), pages 1--14. ACM, 2014.

Digital Library

[13]

C. Curino, D. E. Difallah, C. Douglas, S. Krishnan, R. Ramakrishnan, and S. Rao. Reservation-based scheduling: If you're late don't blame us! In Proceedings of the 5th ACM Symposium on Cloud Computing (SoCC 14), pages 1--14. ACM, 2014.

Digital Library

[14]

R. Gemulla, E. Nijkamp, P. J. Haas, and Y. Sismanis. Large-scale matrix factorization with distributed stochastic gradient descent. In Proceedings of the 17th Conference on Knowledge Discovery and Data Mining (KDD 11), 2011.

Digital Library

[15]

G. Gibson, G. Grider, A. Jacobson, and W. Lloyd. Probe: A thousand-node experimental cluster for computer systems research. USENIX, 38(3), 2013.

[16]

J. E. Gonzalez, Y. Low, H. Gu, D. Bickson, and C. Guestrin. PowerGraph: Distributed graph-parallel computation on natural graphs. In Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI 12), 2012.

[17]

J. E. Gonzalez, R. S. Xin, A. Dave, D. Crankshaw, M. J. Franklin, and I. Stoica. GraphX: Graph processing in a distributed dataflow framework. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 599--613, 2014.

[18]

T. L. Griffiths and M. Steyvers. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America, 2004.

[19]

A. Gupta, B. Acun, O. Sarood, and L. V. Kalé. Towards realizing the potential of malleable jobs. In Proceedings of the 21st International Conference on High Performance Computing (HiPC 14), pages 1--10. IEEE, 2014.

[20]

A. Harlap, H. Cui, W. Dai, J. Wei, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Addressing the straggler problem for iterative convergent parallel ml. In Proceedings of the 7th ACM Symposium on Cloud Computing (SoCC 16), pages 98--111. ACM, 2016.

Digital Library

[21]

B. Hindman, A. Konwinski, M. Zaharia, A. Ghodsi, A. D. Joseph, R. H. Katz, S. Shenker, and I. Stoica. Mesos: A platform for fine-grained resource sharing in the data center. In Proceedings of the 8th USENIX Symposium on Networked Systems Design and Implementation (NSDI 11), volume 11, pages 22--22, 2011.

[22]

Q. Ho, J. Cipar, H. Cui, S. Lee, J. K. Kim, P. B. Gibbons, G. A. Gibson, G. R. Ganger, and E. P. Xing. More effective distributed ML via a Stale Synchronous Parallel parameter server. In Proceedings of the Advances in Neural Information Processing Systems (NIPS 13), 2013.

[23]

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pages 1097--1105, 2012.

Digital Library

[24]

J. Langford, A. J. Smola, and M. Zinkevich. Slow learners are fast. In Proceedings of the 23rd Annual Conference on Neural Information Processing Systems (NIPS 09), 2009.

[25]

M. Li, D. G. Andersen, J. W. Park, A. J. Smola, A. Ahmed, V. Josifovski, J. Long, E. J. Shekita, and B.-Y. Su. Scaling distributed machine learning with the parameter server. In Proceedings of the 11th USENIX Symposium on Operating Systems Design and Implementation (OSDI 14), pages 583--598, 2014.

Digital Library

[26]

J. Liu, J. Chen, and J. Ye. Large-scale sparse logistic regression. In Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, pages 547--556. ACM, 2009.

Digital Library

[27]

Y. Low, J. Gonzalez, A. Kyrola, D. Bickson, C. Guestrin, and J. M. Hellerstein. GraphLab: A new parallel framework for machine learning. In Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI), 2010.

[28]

A. Marathe, R. Harris, D. Lowenthal, B. R. De Supinski, B. Rountree, and M. Schulz. Exploiting redundancy for cost-effective, time-constrained execution of HPC applications on Amazon EC2. In Proceedings of the 23rd international symposium on High-performance parallel and distributed computing, pages 279--290. ACM, 2014.

Digital Library

[29]

O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge, 2014.

[30]

P. Sharma, T. Guo, X. He, D. Irwin, and P. Shenoy. Flint: Batch-interactive data-intensive processing on transient servers. In Proceedings of the 11th European Conference on Computer Systems (EuroSys 16), page 6. ACM, 2016.

[31]

S. Subramanya, T. Guo, P. Sharma, D. Irwin, and P. Shenoy. Spoton: a batch computing service for the spot market. In Proceedings of the Sixth ACM Symposium on Cloud Computing (SoCC 15), pages 329--341. ACM, 2015.

Digital Library

[32]

S. Tang, J. Yuan, and X.-Y. Li. Towards optimal bidding strategy for Amazon EC2 cloud spot instance. In Proceedings of the 5th IEEE International Conference on Cloud Computing (CLOUD 12), pages 91--98. IEEE, 2012.

Digital Library

[33]

V. K. Vavilapalli, A. C. Murthy, C. Douglas, S. Agarwal, M. Konar, R. Evans, T. Graves, J. Lowe, H. Shah, S. Seth, B. Saha, C. Curino, O. O'Malley, S. Radia, B. Reed, and E. Baldeschwieler. Apache hadoop yarn: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SoCC 13), page 5. ACM, 2013.

Digital Library

[34]

A. Verma, L. Pedrosa, M. Korupolu, D. Oppenheimer, E. Tune, and J. Wilkes. Large-scale cluster management at Google with Borg. In Proceedings of the 10th European Conference on Computer Systems (EuroSys 15), page 18. ACM, 2015.

Digital Library

[35]

J. Wang, J. Yang, K. Yu, F. Lv, T. Huang, and Y. Gong. Locality-constrained linear coding for image classification. In Proceedings of the Computer Vision and Pattern Recognition (CVPR), 2010 IEEE Conference on, pages 3360--3367. IEEE, 2010.

[36]

J. Wei, W. Dai, A. Qiao, Q. Ho, H. Cui, G. R. Ganger, P. B. Gibbons, G. A. Gibson, and E. P. Xing. Managed communication and consistency for fast data-parallel iterative analytics. In Proceedings of the 6th ACM Symposium on Cloud Computing (SoCC 15), pages 381--394. ACM, 2015.

Digital Library

[37]

E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee, X. Zheng, P. Xie, A. Kumar, and Y. Yu. Petuum: A new platform for distributed machine learning on big data. In Proceedings of the 22th Conference on Knowledge Discovery and Data Mining (KDD 15), pages 1335--1344. ACM, 2015.

[38]

M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica. Spark: Cluster computing with working sets. Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 10), 10:10--10, 2010.

[39]

L. Zheng, C. Joe-Wong, C. W. Tan, M. Chiang, and X. Wang. How to bid the cloud. In Proceedings of the 2015 ACM Conference on Special Interest Group on Data Communication, pages 71--84. ACM, 2015.

Digital Library

Cited By

Duan JSong ZMiao XXi XLin DXu HZhang MJia ZVanbever LZhang I(2024)ParcaeProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691887(1121-1139)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691887
Sen TShen H(2024)Fault Tolerant Data and Model Parallel Deep Learning in Edge Computing Networks2024 IEEE 21st International Conference on Mobile Ad-Hoc and Smart Systems (MASS)10.1109/MASS62177.2024.00067(460-468)Online publication date: 23-Sep-2024
https://doi.org/10.1109/MASS62177.2024.00067
Xiao YJu LZhou ZLi SHuan ZZhang DJiang RWang LZhang XLiang LZhou J(2024)AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00394(5238-5251)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00394
Show More Cited By

Recommendations

Proteus: Autonomous Adaptive Storage for Mixed Workloads
SIGMOD '22: Proceedings of the 2022 International Conference on Management of Data

Enterprises use distributed database systems to meet the demands of mixed or hybrid transaction/analytical processing (HTAP) workloads that contain both transactional (OLTP) and analytical (OLAP) requests. Distributed HTAP systems typically maintain a ...
Proteus: Power Proportional Memory Cache Cluster in Data Centers
ICDCS '13: Proceedings of the 2013 IEEE 33rd International Conference on Distributed Computing Systems

In this paper, we describe the design, implementation and evaluation of Proteus, a power-proportional cache cluster which eliminates the delay penalty during server provisioning dynamics. To speed up data center services, a cache cluster is used in ...
Proteus: A Flexible Infrastructure to Implement Adaptive Fault Tolerance in AQuA
DCCA '99: Proceedings of the conference on Dependable Computing for Critical Applications

Building dependable distributed systems from commercial off-the-shelf components is of growing practical importance. For both cost and produc-tion reasons, there is interest in approaches and architectures that facilitate building such systems. The AQuA ...

Comments

Information & Contributors

Information

Published In

cover image ACM Conferences

EuroSys '17: Proceedings of the Twelfth European Conference on Computer Systems

April 2017

648 pages

ISBN:9781450349383

DOI:10.1145/3064176

Copyright © 2017 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

SIGOPS: ACM Special Interest Group on Operating Systems

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 23 April 2017

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Qualifiers

Research-article
Research
Refereed limited

Conference

EuroSys '17

Sponsor:

SIGOPS

EuroSys '17: Twelfth EuroSys Conference 2017

April 23 - 26, 2017

Belgrade, Serbia

Acceptance Rates

Overall Acceptance Rate 241 of 1,308 submissions, 18%

Upcoming Conference

EuroSys '25

Sponsor:
sigops

Twentieth European Conference on Computer Systems

March 30 - April 3, 2025

Rotterdam , Netherlands

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

59
Total Citations
View Citations
793
Total Downloads

Downloads (Last 12 months)27
Downloads (Last 6 weeks)4

Reflects downloads up to 18 Feb 2025

Other Metrics

View Author Metrics

Citations

Cited By

Duan JSong ZMiao XXi XLin DXu HZhang MJia ZVanbever LZhang I(2024)ParcaeProceedings of the 21st USENIX Symposium on Networked Systems Design and Implementation10.5555/3691825.3691887(1121-1139)Online publication date: 16-Apr-2024
https://dl.acm.org/doi/10.5555/3691825.3691887
Sen TShen H(2024)Fault Tolerant Data and Model Parallel Deep Learning in Edge Computing Networks2024 IEEE 21st International Conference on Mobile Ad-Hoc and Smart Systems (MASS)10.1109/MASS62177.2024.00067(460-468)Online publication date: 23-Sep-2024
https://doi.org/10.1109/MASS62177.2024.00067
Xiao YJu LZhou ZLi SHuan ZZhang DJiang RWang LZhang XLiang LZhou J(2024)AntDT: A Self-Adaptive Distributed Training Framework for Leader and Straggler Nodes2024 IEEE 40th International Conference on Data Engineering (ICDE)10.1109/ICDE60146.2024.00394(5238-5251)Online publication date: 13-May-2024
https://doi.org/10.1109/ICDE60146.2024.00394
Kim YKim KCho YKim JKhan AKang KAn BCha MKim HKim Y(2024)DeepVM: Integrating Spot and On-Demand VMs for Cost-Efficient Deep Learning Clusters in the Cloud2024 IEEE 24th International Symposium on Cluster, Cloud and Internet Computing (CCGrid)10.1109/CCGrid59990.2024.00034(227-235)Online publication date: 6-May-2024
https://doi.org/10.1109/CCGrid59990.2024.00034
Ryabinin MDettmers TDiskin MBorzunov AKrause ABrunskill ECho KEngelhardt BSabato SScarlett J(2023)SWARM parallelismProceedings of the 40th International Conference on Machine Learning10.5555/3618408.3619631(29416-29440)Online publication date: 23-Jul-2023
https://dl.acm.org/doi/10.5555/3618408.3619631
Jiang HZhang XJoe-Wong C(2023)DOLL: Distributed OnLine Learning Using Preemptible Cloud Instances2023 21st International Symposium on Modeling and Optimization in Mobile, Ad Hoc, and Wireless Networks (WiOpt)10.23919/WiOpt58741.2023.10349831(175-182)Online publication date: 24-Aug-2023
https://doi.org/10.23919/WiOpt58741.2023.10349831
Ye ZGao WHu QSun PWang XLuo YZhang TWen Y(2023)Deep Learning Workload Scheduling in GPU Datacenters: A SurveyACM Computing Surveys10.1145/3638757Online publication date: 27-Dec-2023
https://doi.org/10.1145/3638757
Cai SZhou ZZhao KChen X(2023)Cost-Efficient Serverless Inference Serving with Joint Batching and Multi-ProcessingProceedings of the 14th ACM SIGOPS Asia-Pacific Workshop on Systems10.1145/3609510.3609816(43-49)Online publication date: 24-Aug-2023
https://dl.acm.org/doi/10.1145/3609510.3609816
Gu DZhao YZhong YXiong YHan ZCheng PYang FHuang GJin XLiu XAamodt TJerger NSwift M(2023)ElasticFlow: An Elastic Serverless Training Platform for Distributed Deep LearningProceedings of the 28th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 210.1145/3575693.3575721(266-280)Online publication date: 27-Jan-2023
https://dl.acm.org/doi/10.1145/3575693.3575721
Shang RXu FBai ZChen LZhou ZLiu F(2023)spotDNN: Provisioning Spot Instances for Predictable Distributed DNN Training in the Cloud2023 IEEE/ACM 31st International Symposium on Quality of Service (IWQoS)10.1109/IWQoS57198.2023.10188717(1-10)Online publication date: 19-Jun-2023
https://doi.org/10.1109/IWQoS57198.2023.10188717
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Figures

Tables

Media

View Table of Conten