Competition-based failure-aware scheduling for High-Throughput Computing systems on peer-to-peer networks

Pérez-Miguel, Carlos; Mendiburu, Alexander; Miguel-Alonso, Jose

doi:10.1007/s10586-015-0473-5

Competition-based failure-aware scheduling for High-Throughput Computing systems on peer-to-peer networks

Published: 28 July 2015

Volume 18, pages 1229–1249, (2015)
Cite this article

Cluster Computing Aims and scope Submit manuscript

Carlos Pérez-Miguel¹,
Alexander Mendiburu¹ &
Jose Miguel-Alonso¹

192 Accesses
1 Citation
Explore all metrics

Abstract

In a High-Throughput Computing (HTC) system, system failures and churning pose an important performance limitation. The time used by tasks running in a node that suddenly fails (or abandons the system) constitutes a waste of resources. These aborted tasks are usually reinserted into the system for automatic re-execution, causing additional overheads. This problem has been partially addressed via fault tolerant techniques such as checkpointing and replication. However, these solutions cause additional overheads. In this work, we present several failure-aware scheduling policies that aim to reduce the waste of resources by means of mechanisms to match the submitted tasks with the best node to run it, taking into consideration the (predicted) duration of the task and the (expected) survival time of the nodes. Experimentation through simulation, in the context of an HTC system built on top of a peer-to-peer network, confirms that our policies, compared to several state-of-the-art alternatives, result in a more effective distribution of workload whose consequence is a higher task throughput.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

A survey of Kubernetes scheduling algorithms

Article Open access 13 June 2023

Dynamic resource allocation in cloud computing: analysis and taxonomies

Article 28 January 2022

Serverless Computing: Current Trends and Open Problems

Notes

http://www.sc.ehu.es/ccwbayes/members/cperezmig/fas/fasw

References

Litzkow, M., Livny, M., Mutka, M.: Condor—a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems, June 1988
Anderson, D.P.: BOINC: A system for public-resource computing and storage. In: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, pp. 4–10 (2004)
Pérez-Miguel, C., Miguel-Alonso, J., Mendiburu, A.: High throughput computing over peer-to-peer networks. Future Gener. Comput. Syst. 29(1), 352–360 (2013)
Article Google Scholar
Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44, 35–40 (2010)
Article Google Scholar
White, T.: Hadoop: The Definitive Guide. “O’Reilly Media, Sebastopol (2009)
Google Scholar
Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pp. 10–10 (2010)
Javadi, B., Abawajy, J., Buyya, R.: Failure-aware resource provisioning for hybrid cloud infrastructure. J Parallel Distrib. Comput. 72, 1318–1331 (2012)
Article Google Scholar
Anglano, C., Canonico, M.: Advances in Grid Computing: EGC 2005. In: Sloot, P.M., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) European Grid Conference, Amsterdam, The Netherlands, February 14–16, 2005, Revised Selected Papers. Lecture Notes in Computer Science. Springer, Berlin (2005)
Cirne, W., Paranhos, D., Costa, L., Santos-Neto, E., Brasileiro, F., Sauvé, J., Silva, F.A.B., Barros, C.O., Silveira, C.: Running bag-of-tasks applications on computational grids: the MyGrid approach. In: Proceedings of the 2003 International Conference on Parallel Processing, pp. 407–416 (2003)
Bansal, Jyoti, Rani, Shaveta, Singh, Paramjit: The WorkQueue with dynamic replication-fault tolerant scheduler in desktop grid environment. Int. J. Comput. Technol. 11(4), 2446–2451 (2013)
Google Scholar
Oliner, A.J., Sahoo, R.K., Moreira, J.E., Gupta, M., Sivasubramaniam, A.: Fault-aware job scheduling for bluegene/l systems. In: Proceedings of the IEEE 18th International in Parallel and Distributed Processing Symposium, p. 64 (2004)
Li, Y., Lan, Z., Gujrati, P., Sun, X.H.: Fault-aware runtime strategies for high-performance computing. IEEE Trans. Parallel Distrib. Syst. 20(4), 460–473 (2009)
Article Google Scholar
Amoon, M.: A fault-tolerant scheduling system for computational grids. Comput. Electr. Eng. 38(2), 399–412 (2012)
Article Google Scholar
Anglano, C., Brevik, J., Canonico, M., Nurmi, D., Wolski, R.: Fault-aware scheduling for bag-of-tasks applications on desktop grids. In: Proceedings of the 7th IEEE/ACM International Conference on Grid Computing, pp. 56–63 (2006)
Brevik, J., Nurmi, D., Wolski, R.: Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems. In: Proceedings of the IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004, pp. 190–199 (2004)
Byun, E., Choi, S., Baik, M., Gil, J., Park, C., Hwang, C.: MJSA: Markov job scheduler based on availability in desktop grid computing environment. Future Gener. Comput. Syst. 23(4), 616–622 (2007)
Article Google Scholar
Ramachandran, Karthick, Lutfiyya, Hanan, Perry, Mark: Decentralized approach to resource availability prediction using group availability in a P2P desktop grid. Future Gener. Comput. Syst. 28(6), 854–860 (2012)
Article Google Scholar
Xiaoping, H., Zhijiang, W., Congming, W., yu, W., Yongshang, C., Ling, S.: Availability-based task monitoring and adaptation mechanism in desktop grid system. In: Proceedings of the Sixth International Conference on Grid and Cooperative Computing, 2007. GCC 2007, pp. 444–450 (2007)
Hyun, J.H.: An effective scheduling method for more reliable execution on desktop grids. In: Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC), 2010, pp. 172–179 (2010)
Hui, L., Groep, D., Wolters, L.: Workload characteristics of a multi-cluster supercomputer. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. Lecture Notes in Computer Science, pp. 176–193. Springer, Berlin (2005)
Google Scholar
Chun, B.G., Dabek, F., Haeberlen, A., Sit, E., Weatherspoon, H., Kaashoek, M.F., Kubiatowicz, J., Morris, R.: Efficient replica maintenance for distributed storage systems. In: Proceedings of the 3rd conference on Networked Systems Design & Implementation, USENIX Association, vol. 3, pp. 4–4 (2006)
Stefan, S., Gummadi, P.K., Gribble, S.D.: Measurement study of peer-to-peer file sharing systems. In: Electronic Imaging 2002, International Society for Optics and Photonics, pp. 156–170 (2001)
Cuenca-Acuna, F.M., Martin, R.P., Nguyen, T.D.: Autonomous replication for high availability in unstructured P2P systems. In: Proceedings of the Symposium on Reliable Distributed Systems (SRDS) (2003)
Yao, Z., Leonard, D., Wang, X., Loguinov, D.: Modeling heterogeneous user churn and local resilience of unstructured p2p networks. In: Proceedings of the 2006 14th IEEE International Conference on Network Protocols, 2006. ICNP’06, pp. 32–41 (2006)
Schroeder, B., Gibson, G.A.: Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. In: Proceedings of the 5th USENIX conference on File and Storage Technologies, FAST ’07, Berkeley, CA, USA, USENIX Association (2007)
Nurmi, D., Brevik, J., Wolski, R.: Modeling machine availability in enterprise and wide-area distributed computing environments. In: In Euro-Par05, pp. 432–441 (2003)
Ford, D., Labelle, F., Popovici, F., Stokely, M., Truong, V.A., Barroso, L., Grimes, C., Quinlan, S.: Availability in globally distributed storage systems. In: Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (2010)
Khan, M.M., Navaridas, J., Palma, L.A., Rast, A.D., Jin, X., Plana, L.A., Lujan, M., Woods, J.V., Miguel-Alonso, J., Furber, S.B.: Event-driven configuration of a neural network cmp system over a homogeneous interconnect fabric. In: Proceedings of the 8th International Symposium on Parallel and Distributed Computing, 2009. ISPDC ’09, pp. 54–61 (2009)
Brown, R.: Calendar queues: A fast 0(1) priority queue implementation for the simulation event set problem. Commun. ACM 31(10), 1220–1227 (1988)
Article Google Scholar
Tang, W., Desai, N., Buettner, D., Lan, Z.: Analyzing and adjusting user runtime estimates to improve job scheduling on the blue gene/p. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010, pp. 1–11 (2010)

Download references

Acknowledgments

This work has been partially supported by the Saiotek and Research Groups 2013-2018 (IT-609-13) programs (Basque Government), TIN2013-41272P (Ministry of Science and Technology), COMBIOMED-RD07/0067/0003 network in computational biomedicine (Carlos III Health Institute) and by the NICaiA Project PIRSES-GA-2009-247619 (European Commission). Mr Pérez-Miguel is supported by a doctoral grant from the Basque Government. Jose Miguel-Alonso and Alexander Mendiburu are members of the European Network of Excellence on High Performance and Embedded Architecture and Compilation (HiPEAC).

Author information

Authors and Affiliations

Intelligent Systems Group, Department of Computer Architecture and Technology, School of Computer Science, University of the Basque Country UPV/EHU, Donostia-San Sebastian, Spain
Carlos Pérez-Miguel, Alexander Mendiburu & Jose Miguel-Alonso

Authors

Carlos Pérez-Miguel
View author publications
You can also search for this author in PubMed Google Scholar
Alexander Mendiburu
View author publications
You can also search for this author in PubMed Google Scholar
Jose Miguel-Alonso
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Carlos Pérez-Miguel.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Pérez-Miguel, C., Mendiburu, A. & Miguel-Alonso, J. Competition-based failure-aware scheduling for High-Throughput Computing systems on peer-to-peer networks. Cluster Comput 18, 1229–1249 (2015). https://doi.org/10.1007/s10586-015-0473-5

Download citation

Received: 04 February 2015
Revised: 06 July 2015
Accepted: 20 July 2015
Published: 28 July 2015
Issue Date: September 2015
DOI: https://doi.org/10.1007/s10586-015-0473-5

Keywords

Access this article

Log in via an institution

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Competition-based failure-aware scheduling for High-Throughput Computing systems on peer-to-peer networks

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Dynamic resource allocation in cloud computing: analysis and taxonomies

Serverless Computing: Current Trends and Open Problems

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Navigation

Competition-based failure-aware scheduling for High-Throughput Computing systems on peer-to-peer networks

Abstract

Access this article

Similar content being viewed by others

A survey of Kubernetes scheduling algorithms

Dynamic resource allocation in cloud computing: analysis and taxonomies

Serverless Computing: Current Trends and Open Problems

Notes

References

Acknowledgments

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Search

Navigation