Skip to main content

Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning

  • Conference paper
  • First Online:
High Performance Computing (ISC High Performance 2018)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 10876))

Included in the following conference series:

Abstract

Modern distributed computing frameworks for cloud computing and high performance computing typically accelerate job performance by dividing a large job into small tasks for execution parallelism. Some tasks, however, may run far behind others, which jeopardize the job completion time. In this paper, we present Zeno, a novel system which automatically identifies and diagnoses stragglers for jobs by machine learning methods. First, the system identifies stragglers with an unsupervised clustering method which groups the tasks based on their execution time. It then uses a supervised rule learning algorithm to learn diagnosis rules inferring the stragglers with their resource assignment and usage data. Zeno is evaluated on traces from a Google’s Borg system and an Alibaba’s Fuxi system. The results demonstrate that our system is able to generate simple and easy-to-read rules with both valuable insights and decent performance in predicting stragglers.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Notes

  1. 1.

    Zeno was the Greek philosopher who raised the paradox that the quickest runner can never succeed in overtaking a slow-moving tortoise.

  2. 2.

    https://github.com/google/cluster-data.

  3. 3.

    The attribute is named as ‘assigned memory usage’ in the trace. However, according to [20] its semantic is the amount of memory assigned, not the amount used.

  4. 4.

    Network utilization, unfortunately, is not available in the trace, which may probably impact the success rate of straggler diagnosis.

  5. 5.

    https://github.com/alibaba/clusterdata.

  6. 6.

    https://grouplens.org/datasets/.

  7. 7.

    The same conclusion can be reached if the network usage of the tasks (rather than the default Spark task attributes) is monitored and used for diagnosis.

  8. 8.

    In most cases, a two-level decision tree is equivalent to two rules in inferring stragglers, each with two test conditions connected with the ‘and’ operator. Note that in such cases, a two-level decision tree is more complex than the customized decision stump since two rules are involved.

  9. 9.

    We have also used the well-recognized state-of-the-art classifier, support vector machine [7]. We have used SVMLight (http://svmlight.joachims.org/) and LibSVM (https://www.csie.ntu.edu.tw/~cjlin/libsvm/), the two main-stream tools. We have tried the linear kernel and the radius basis function kernel. None of the combinations produces satisfactory results, probably due to the various extent of imbalanced and noisy nature across different jobs, which may require fine-tuning the soft margin parameter for each of the jobs. As a result, we do not report out the results in the paper.

References

  1. Ananthanarayanan, G., Ghodsi, A., Shenker, S., Stoica, I.: Effective straggler mitigation: attack of the clones. In: Proceedings of the 10th USENIX Conference on Networked Systems Design and Implementation (NSDI 2013), pp. 185–198 (2013)

    Google Scholar 

  2. Bailey, T., Jain, A.K.: A note on distance-weighted k-nearest neighbor rules. IEEE Trans. Syst. Man Cybern. 8(4), 311–313 (1978)

    Article  Google Scholar 

  3. Blumer, A., Ehrenfeucht, A., Haussler, D., Warmuth, M.: Occam’s razor. Inf. Process. Lett. 24(6), 377–380 (1987)

    Article  MathSciNet  Google Scholar 

  4. Breiman, L.: Random forests. Mach. Learn. 45(1), 5–32 (2001)

    Article  Google Scholar 

  5. Bremer, P.T., Mohr, B., Pascucci, V., Schulz, M. (eds.): Proceedings of the 2nd Workshop on Visual Performance Analysis (VPA 2015) (2015)

    Google Scholar 

  6. Bremer, P.T., Gimenez, J., Levine, J.A., Schulz, M. (eds.): Proceedings of the 3rd International Workshop on Visual Performance Analysis (VPA 2016) (2016)

    Google Scholar 

  7. Cortes, C., Vapnik, V.: Support-vector networks. Mach. Learn. 20(3), 273–297 (1995)

    MATH  Google Scholar 

  8. Dean, J., Ghemawat, S.: Mapreduce: simplified data processing on large clusters. In: Proceedings of the 6th Symposium on Operating System Design and Implementation (OSDI 2004), pp. 137–150 (2004)

    Google Scholar 

  9. Garraghan, P., Ouyang, X., Yang, R., McKee, D., Xu, J.: Straggler root-cause and impact analysis for massive-scale virtualized cloud datacenters. IEEE Trans. Serv. Comput. (2017). https://ieeexplore.ieee.org/document/7572191/

  10. Gupta, S., Fritz, C., Price, R., Hoover, R., de Kleer, J., Witteveen, C.: ThroughputScheduler: learning to schedule on heterogeneous Hadoop clusters. In: Proceedings of the 10th International Conference on Autonomic Computing (ICAC 2013), pp. 159–165 (2013)

    Google Scholar 

  11. Iba, W., Langley, P.: Induction of one-level decision trees. In: Proceedings of the 9th International Workshop on Machine Learning (ML 1992), pp. 233–240 (1992)

    Chapter  Google Scholar 

  12. Isard, M., Prabhakaran, V., Currey, J., Wieder, U., Talwar, K., Goldberg, A.: Quincy: fair scheduling for distributed computing clusters. In: Proceedings of the ACM SIGOPS 22nd Symposium on Operating Systems Principles (SOSP 2009), pp. 261–276 (2009)

    Google Scholar 

  13. Li, C., Shen, H., Huang, T.: Learning to diagnose stragglers in distributed computing. In: Proceedings of the 9th Workshop on Many-Task Computing on Clouds, Grids, and Supercomputers (MTAGS 2016), pp. 1–6 (2016)

    Google Scholar 

  14. Lloyd, S.P.: Least squares quantization in PCM. IEEE Trans. Inf. Theory 28(2), 129–137 (1982)

    Article  MathSciNet  Google Scholar 

  15. Ng, A.Y.: Feature selection, \(L_1\) vs. \(L_2\) regularization, and rotational invariance. In: Proceedings of the 21st International Conference on Machine Learning (ICML 2004) (2004)

    Google Scholar 

  16. Ouyang, X., Garraghan, P., McKee, D., Townend, P., Xu, J.: Straggler detection in parallel computing systems through dynamic threshold calculation. In: Proceedings of the 30th International Conference on Advanced Information Networking and Applications, (AINA 2016), pp. 414–421 (2000)

    Google Scholar 

  17. Pelleg, D., Moore, A.W.: X-means: Extending k-means with efficient estimation of the number of clusters. In: Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 727–734 (2000)

    Google Scholar 

  18. Quinlan, J.R.: Induction of decision trees. Mach. Learn. 1(1), 81–106 (1986)

    Google Scholar 

  19. Vavilapalli, V.K., Murthy, A.C., Douglas, C., Agarwal, S., Konar, M., Evans, R., Graves, T., Lowe, J., Shah, H., Seth, S., Saha, B., Curino, C., O’Malley, O., Radia, S., Reed, B., Baldeschwieler, E.: Apache Hadoop YARN: yet another resource negotiator. In: Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC 2013) (2013)

    Google Scholar 

  20. Verma, A., Pedrosa, L., Korupolu, M., Oppenheimer, D., Tune, E., Wilkes, J.: Large-scale cluster management at Google with Borg. In: Proceedings of the 10th European Conference on Computer Systems (EuroSys 2015) (2015)

    Google Scholar 

  21. Yadwadkar, N.J., Hariharan, B., Gonzalez, J.E., Katz, R.: Multi-task learning for straggler avoiding predictive job scheduling. J. Mach. Learn. Res. 17(1), 3692–3728 (2016)

    MathSciNet  MATH  Google Scholar 

  22. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX Workshop on Hot Topics in Cloud Computing (HotCloud 2010) (2010)

    Google Scholar 

  23. Zhang, Z., Li, C., Tao, Y., Yang, R., Tang, H., Xu, J.: Fuxi: a fault-tolerant resource management and job scheduling system at internet scale. Proc. VLDB Endow. 7(13), 1393–1404 (2014)

    Article  Google Scholar 

Download references

Acknowledgements

We thank Tai Huang and Jia Bao for their valuable comments and suggestions on an early draft of the paper. We acknowledge the four anonymous reviewers for their valuable comments and criticisms. We thank Xing Zhao for her checking of the English of the paper. A previous description of the machine learning methods for straggler diagnosis appeared as a 6-page extended abstract on a workshop [13].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Huanxing Shen .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2018 Springer International Publishing AG, part of Springer Nature

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shen, H., Li, C. (2018). Zeno: A Straggler Diagnosis System for Distributed Computing Using Machine Learning. In: Yokota, R., Weiland, M., Keyes, D., Trinitis, C. (eds) High Performance Computing. ISC High Performance 2018. Lecture Notes in Computer Science(), vol 10876. Springer, Cham. https://doi.org/10.1007/978-3-319-92040-5_8

Download citation

  • DOI: https://doi.org/10.1007/978-3-319-92040-5_8

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-319-92039-9

  • Online ISBN: 978-3-319-92040-5

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics