Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning

Shen, Huanxing; Li, Cong

doi:10.1007/978-3-319-92040-5_8

Huanxing Shen¹⁷ &
Cong Li¹⁷

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10876))

Included in the following conference series:

International Conference on High Performance Computing

2166 Accesses
5 Citations

Abstract

Modern distributed computing frameworks for cloud computing and high performance computing typically accelerate job performance by dividing a large job into small tasks for execution parallelism. Some tasks, however, may run far behind others, which jeopardize the job completion time. In this paper, we present Zeno, a novel system which automatically identifies and diagnoses stragglers for jobs by machine learning methods. First, the system identifies stragglers with an unsupervised clustering method which groups the tasks based on their execution time. It then uses a supervised rule learning algorithm to learn diagnosis rules inferring the stragglers with their resource assignment and usage data. Zeno is evaluated on traces from a Google’s Borg system and an Alibaba’s Fuxi system. The results demonstrate that our system is able to generate simple and easy-to-read rules with both valuable insights and decent performance in predicting stragglers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

1.
Zeno was the Greek philosopher who raised the paradox that the quickest runner can never succeed in overtaking a slow-moving tortoise.
2.
https://github.com/google/cluster-data.
3.
The attribute is named as ‘assigned memory usage’ in the trace. However, according to [20] its semantic is the amount of memory assigned, not the amount used.
4.
Network utilization, unfortunately, is not available in the trace, which may probably impact the success rate of straggler diagnosis.
5.
https://github.com/alibaba/clusterdata.
6.
https://grouplens.org/datasets/.
7.
The same conclusion can be reached if the network usage of the tasks (rather than the default Spark task attributes) is monitored and used for diagnosis.
8.
In most cases, a two-level decision tree is equivalent to two rules in inferring stragglers, each with two test conditions connected with the ‘and’ operator. Note that in such cases, a two-level decision tree is more complex than the customized decision stump since two rules are involved.
9.
We have also used the well-recognized state-of-the-art classifier, support vector machine [7]. We have used SVMLight (http://svmlight.joachims.org/) and LibSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/), the two main-stream tools. We have tried the linear kernel and the radius basis function kernel. None of the combinations produces satisfactory results, probably due to the various extent of imbalanced and noisy nature across different jobs, which may require fine-tuning the soft margin parameter for each of the jobs. As a result, we do not report out the results in the paper.

References

Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: attack of the clones. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI 2013), pp. 185–198 (2013)
Google Scholar
Bailey, T., Jain, A.K.: A note on distance-weighted k-nearest neighbor rules. IEEE Trans. Syst. Man Cybern. 8(4), 311–313 (1978)
Article Google Scholar
Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Occam’s razor. Inf. Process. Lett. 24(6), 377–380 (1987)
Article MathSciNet Google Scholar
Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)
Article Google Scholar
Bremer, P.T., Mohr, B., Pascucci, V., Schulz, M. (eds.): Proceedings of the 2nd Workshop on Visual Performance Analysis (VPA 2015) (2015)
Google Scholar
Bremer, P.T., Gimenez, J., Levine, J.A., Schulz, M. (eds.): Proceedings of the 3rd International Workshop on Visual Performance Analysis (VPA 2016) (2016)
Google Scholar
Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)
MATH Google Scholar
Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pp. 137–150 (2004)
Google Scholar
Garraghan, P., Ouyang, X., Yang, R., McKee, D., Xu, J.: Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Trans. Serv. Comput. (2017). https://ieeexplore.ieee.org/document/7572191/
Gupta, S., Fritz, C., Price, R., Hoover, R., de Kleer, J., Witteveen, C.: ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters. In: Proceedings of the 10th International Conference on Autonomic Computing (ICAC 2013), pp. 159–165 (2013)
Google Scholar
Iba, W., Langley, P.: Induction of one-level decision trees. In: Proceedings of the 9th International Workshop on Machine Learning (ML 1992), pp. 233–240 (1992)
Chapter Google Scholar
Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP 2009), pp. 261–276 (2009)
Google Scholar
Li, C., Shen, H., Huang, T.: Learning to diagnose stragglers in distributed computing. In: Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS 2016), pp. 1–6 (2016)
Google Scholar
Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)
Article MathSciNet Google Scholar
Ng, A.Y.: Feature selection, \(L_1\) vs. \(L_2\) regularization, and rotational invariance. In: Proceedings of the 21st International Conference on Machine Learning (ICML 2004) (2004)
Google Scholar
Ouyang, X., Garraghan, P., McKee, D., Townend, P., Xu, J.: Straggler detection in parallel computing systems through dynamic threshold calculation. In: Proceedings of the 30th International Conference on Advanced Information Networking and Applications, (AINA 2016), pp. 414–421 (2000)
Google Scholar
Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 727–734 (2000)
Google Scholar
Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)
Google Scholar
Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC 2013) (2013)
Google Scholar
Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the 10th European Conference on Computer Systems (EuroSys 2015) (2015)
Google Scholar
Yadwadkar, N.J., Hariharan, B., Gonzalez, J.E., Katz, R.: Multi-task learning for straggler avoiding predictive job scheduling. J. Mach. Learn. Res. 17(1), 3692–3728 (2016)
MathSciNet MATH Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2010) (2010)
Google Scholar
Zhang, Z., Li, C., Tao, Y., Yang, R., Tang, H., Xu, J.: Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. Proc. VLDB Endow. 7(13), 1393–1404 (2014)
Article Google Scholar

Download references

Acknowledgements

We thank Tai Huang and Jia Bao for their valuable comments and suggestions on an early draft of the paper. We acknowledge the four anonymous reviewers for their valuable comments and criticisms. We thank Xing Zhao for her checking of the English of the paper. A previous description of the machine learning methods for straggler diagnosis appeared as a 6-page extended abstract on a workshop [13].

Author information

Authors and Affiliations

Intel Corporation, Shanghai, People’s Republic of China
Huanxing Shen & Cong Li

Authors

Huanxing Shen
View author publications
You can also search for this author in PubMed Google Scholar
Cong Li
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Huanxing Shen .

Editor information

Editors and Affiliations

Tokyo Institute of Technology, Tokyo, Japan
Rio Yokota
University of Edinburgh, Edinburgh, United Kingdom
Michèle Weiland
King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
David Keyes
Technische Universität München, Garching bei München, Germany
Carsten Trinitis

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shen, H., Li, C. (2018). Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 10876. Springer, Cham. https://doi.org/10.1007/978-3-319-92040-5_8

Download citation

DOI: https://doi.org/10.1007/978-3-319-92040-5_8
Published: 29 May 2018
Publisher Name: Springer, Cham
Print ISBN: 978-3-319-92039-9
Online ISBN: 978-3-319-92040-5
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics