FTLLS: A fault tolerant, low latency, distributed scheduling approach based on sparrow

Li, Wenzhuo; Lin, Chuang

doi:10.1007/s12083-017-0590-4

FTLLS: A fault tolerant, low latency, distributed scheduling approach based on sparrow

Published: 31 July 2017

Volume 11, pages 1129–1140, (2018)
Cite this article

Peer-to-Peer Networking and Applications Aims and scope Submit manuscript

252 Accesses
Explore all metrics

Abstract

Big data processing systems are developing towards larger degrees of parallelism and shorter task durations in order to achieve lower response time. Scheduling highly parallel tasks that complete in sub-seconds poses a great challenge to traditional centralized schedulers. Taking the challenge, researchers turn to distributed scheduling approaches to avoid the throughput limitation of centralized schedulers, among which Sparrow is a leading design. However, little effort is devoted to the fault tolerance of Sparrow and there are problems with Sparrow’s sample-based techniques, which gives rise to incomplete jobs and large scheduling latency. We then present Fault Tolerant, Low Latency Sparrow (FTLLS). It extends Sparrow with an assistant machine to handle worker failures and to make better scheduling decisions. Through simulations, it is proved that FTLLS can detect worker failures more quickly than a naive timeout approach and make better scheduling decisions than native Sparrow. Through implementation, the results show that FTLLS guarantees no incomplete jobs at the presence of worker failures and reduces scheduling latencies by over 1.5 × when compared to native Sparrow. In addition, the simplicity of the idea adopted by FTLLS makes it applicable to a wide variety of distributed scheduling approaches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

DFARM: a deadline-aware fault-tolerant scheduler for cloud computing

Article 20 April 2024

Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

Article 14 April 2015

EPPADS: An Enhanced Phase-Based Performance-Aware Dynamic Scheduler for High Job Execution Performance in Large Scale Clusters

References

Amazon ec2. http://aws.amazon.com/ec2
Sparrow scheduling platform. https://github.com/radlab/sparrow
Borthakur D (2008) Hdfs architecture guide. HADOOP APACHE PROJECT http://hadoop.apache.org/common/docs/current/hdfsdesign.pdf p 39
Boutin E, Ekanayake J, Lin W, Shi B, Zhou J, Qian Z, Wu M, Zhou L (2014) Apollo: scalable and coordinated scheduling for cloud-scale computing 11th USENIX symposium on operating systems design and implementation (OSDI 14), pp 285–300
Google Scholar
Bryant R, Katz RH, Lazowska ED (2008) Big-data computing: creating revolutionary breakthroughs in commerce, science and society
Chen W, Toueg S, Aguilera MK (2002) On the quality of service of failure detectors. IEEE Trans Comput 51(5):561–580
Article MathSciNet Google Scholar
Dean J, Ghemawat S (2008) Mapreduce: simplified data processing on large clusters. Commun ACM 51 (1):107–113
Article Google Scholar
Demers A, Keshav S, Shenker S (1989) Analysis and simulation of a fair queueing algorithm ACM SIGCOMM Computer communication review, vol 19. ACM, pp 1–12
Hindman B, Konwinski A, Zaharia M, Ghodsi A, Joseph AD, Katz RH, Shenker S, Stoica I (2011) Mesos: a platform for fine-grained resource sharing in the data center NSDI, vol 11, pp 22–22
Google Scholar
Ibrahim S, Jin H, Lu L, He B, Antoniu G, Wu S (2013) Handling partitioning skew in mapreduce using leen. Peer-to-Peer Networking and Applications 6(4):409–424
Article Google Scholar
Kornacker M, Behm A, Bittorf V, Bobrovytsky T, Ching C, Choi A, Erickson J, Grund M, Hecht D, Jacobs M et al (2015) Impala: a modern, open-source sql engine for hadoop CIDR
Google Scholar
Li W, Lin C (2014) Design and analysis of fault tolerance mechanism for sparrow. In: 2014 IEEE international performance computing and communications conference (IPCCC). IEEE, pp 1–7
Mitzenmacher M (2001) The power of two choices in randomized load balancing. IEEE Trans Parallel Distrib Syst 12(10):1094–1104
Article Google Scholar
Ousterhout K, Wendell P, Zaharia M, Stoica I (2012) Batch sampling: Low overhead scheduling for sub-second prallel jobs. University of California, Berkeley
Google Scholar
Ousterhout K, Wendell P, Zaharia M, Stoica I (2013) Sparrow: distributed, low latency scheduling. In: Proceedings of the twenty-fourth ACM symposium on operating systems principles. ACM, pp 69–84
Rasley J, Karanasos K, Kandula S, Fonseca R, Vojnovic M, Rao S (2016) Efficient queue management for cluster scheduling. In: Proceedings of the Eleventh european conference on computer systems. ACM, p 36
Ren X, Ananthanarayanan G, Wierman A, Yu M (2015) Hopper: decentralized speculation-aware cluster scheduling at scale. ACM SIGCOMM Computer Communication Review 45(4):379–392
Article Google Scholar
Schwarzkopf M, Konwinski A, Abd-El-Malek M, Wilkes J (2013) Omega: flexible, scalable schedulers for large compute clusters. In: Proceedings of the 8th ACM european conference on computer systems. ACM, pp 351–364
Toshniwal A, Taneja S, Shukla A, Ramasamy K, Patel JM, Kulkarni S, Jackson J, Gade K, Fu M, Donham J et al (2014) Storm@ twitter. In: Proceedings of the 2014 ACM SIGMOD international conference on management of data. ACM, pp 147–156
Vavilapalli VK, Murthy AC, Douglas C, Agarwal S, Konar M, Evans R, Graves T, Lowe J, Shah H, Seth S et al (2013) Apache hadoop yarn: Yet another resource negotiator. In: Proceedings of the 4th annual symposium on cloud computing. ACM, p 5
Xiong N, Vasilakos A, Yang Y et al (2010) An effective failure detector based on general traffic-feature analysis in fault-tolerant networks. Tech. rep., Technical Report
Xiong N, Vasilakos AV, Wu J, Yang YR, Rindos A, Pan Y A class of practical self-tuning failure detection schemes for distributed networks
Yu X, Qiao C, Liu Y (2004) Tcp implementations and false time out detection in obs networks. In: INFOCOM 2004. Twenty-Third annualjoint conference of the IEEE computer and communications societies, vol 2. IEEE, pp 774–784
Zaharia M, Chowdhury M, Das T, Dave A, Ma J, McCauley M, Franklin MJ, Shenker S, Stoica I (2012) Resilient distributed datasets: a fault-tolerant abstraction for in-memory cluster computing. In: Proceedings of the 9th USENIX conference on networked systems design and implementation. USENIX association, pp 2–2
Zaharia M, Chowdhury M, Franklin MJ, Shenker S, Stoica I (2010) Spark: cluster computing with working sets. HotCloud 10:10–10
Google Scholar

Download references

Acknowledgements

It is my first time to submit my paper to a journal and I gain a lot. The research is supported in part by Professor Lin for insightful guidance. Also I am indebted to Yin Li for helpful comments on several drafts of this paper and for help with simulations of our mechanism. They are enthusiastic to offer me help and suggestions when I encountered problems in this research, which keeps me improving. I wish to thank all my teachers and seniors in my research lab for providing such valuable suggestions.

This work was supported by the National Natural Science Foundation of China (No. 61472199); the National Basic Research Program of China (973 Program) under grants 2010CB328105; Tsinghua University Initiative Scientific Research Program (No.20121087999).

Author information

Authors and Affiliations

Tsinghua National Laboratory for Information Science and Technology, Tsinghua University, 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China
Wenzhuo Li & Chuang Lin
Department of Computer Science and Technology, Tsinghua University, 30 Shuangqing Rd, Haidian Qu, Beijing Shi, China
Wenzhuo Li & Chuang Lin

Authors

Wenzhuo Li
View author publications
You can also search for this author inPubMed Google Scholar
Chuang Lin
View author publications
You can also search for this author inPubMed Google Scholar

Corresponding author

Correspondence to Wenzhuo Li.

Additional information

This article is part of the Topical Collection: Special Issue on Big Data Networking

Guest Editors: Xiaofei Liao, Song Guo, Deze Zeng, and Kun Wang

Rights and permissions

Reprints and permissions

About this article

Cite this article

Li, W., Lin, C. FTLLS: A fault tolerant, low latency, distributed scheduling approach based on sparrow. Peer-to-Peer Netw. Appl. 11, 1129–1140 (2018). https://doi.org/10.1007/s12083-017-0590-4

Download citation

Received: 27 December 2016
Accepted: 12 July 2017
Published: 31 July 2017
Issue Date: September 2018
DOI: https://doi.org/10.1007/s12083-017-0590-4

Keywords

Part of a collection:

Special Issue on Big Data Networking

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FTLLS: A fault tolerant, low latency, distributed scheduling approach based on sparrow

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

DFARM: a deadline-aware fault-tolerant scheduler for cloud computing

Fault-Tolerant Dynamic Rescheduling for Heterogeneous Computing Systems

EPPADS: An Enhanced Phase-Based Performance-Aware Dynamic Scheduler for High Job Execution Performance in Large Scale Clusters

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now