Skip to main content
Log in

Competition-based failure-aware scheduling for High-Throughput Computing systems on peer-to-peer networks

  • Published:
Cluster Computing Aims and scope Submit manuscript

Abstract

In a High-Throughput Computing (HTC) system, system failures and churning pose an important performance limitation. The time used by tasks running in a node that suddenly fails (or abandons the system) constitutes a waste of resources. These aborted tasks are usually reinserted into the system for automatic re-execution, causing additional overheads. This problem has been partially addressed via fault tolerant techniques such as checkpointing and replication. However, these solutions cause additional overheads. In this work, we present several failure-aware scheduling policies that aim to reduce the waste of resources by means of mechanisms to match the submitted tasks with the best node to run it, taking into consideration the (predicted) duration of the task and the (expected) survival time of the nodes. Experimentation through simulation, in the context of an HTC system built on top of a peer-to-peer network, confirms that our policies, compared to several state-of-the-art alternatives, result in a more effective distribution of workload whose consequence is a higher task throughput.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Notes

  1. http://www.sc.ehu.es/ccwbayes/members/cperezmig/fas/fasw

References

  1. Litzkow, M., Livny, M., Mutka, M.: Condor—a hunter of idle workstations. In: Proceedings of the 8th International Conference of Distributed Computing Systems, June 1988

  2. Anderson, D.P.: BOINC: A system for public-resource computing and storage. In: Proceedings of the 5th IEEE/ACM International Workshop on Grid Computing, pp. 4–10 (2004)

  3. Pérez-Miguel, C., Miguel-Alonso, J., Mendiburu, A.: High throughput computing over peer-to-peer networks. Future Gener. Comput. Syst. 29(1), 352–360 (2013)

    Article  Google Scholar 

  4. Lakshman, A., Malik, P.: Cassandra: a decentralized structured storage system. ACM SIGOPS Oper. Syst. Rev. 44, 35–40 (2010)

    Article  Google Scholar 

  5. White, T.: Hadoop: The Definitive Guide. “O’Reilly Media, Sebastopol (2009)

    Google Scholar 

  6. Zaharia, M., Chowdhury, M., Franklin, M.J., Shenker, S., Stoica, I.: Spark: cluster computing with working sets. In: Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, pp. 10–10 (2010)

  7. Javadi, B., Abawajy, J., Buyya, R.: Failure-aware resource provisioning for hybrid cloud infrastructure. J Parallel Distrib. Comput. 72, 1318–1331 (2012)

    Article  Google Scholar 

  8. Anglano, C., Canonico, M.: Advances in Grid Computing: EGC 2005. In: Sloot, P.M., Hoekstra, A.G., Priol, T., Reinefeld, A., Bubak, M. (eds.) European Grid Conference, Amsterdam, The Netherlands, February 14–16, 2005, Revised Selected Papers. Lecture Notes in Computer Science. Springer, Berlin (2005)

  9. Cirne, W., Paranhos, D., Costa, L., Santos-Neto, E., Brasileiro, F., Sauvé, J., Silva, F.A.B., Barros, C.O., Silveira, C.: Running bag-of-tasks applications on computational grids: the MyGrid approach. In: Proceedings of the 2003 International Conference on Parallel Processing, pp. 407–416 (2003)

  10. Bansal, Jyoti, Rani, Shaveta, Singh, Paramjit: The WorkQueue with dynamic replication-fault tolerant scheduler in desktop grid environment. Int. J. Comput. Technol. 11(4), 2446–2451 (2013)

    Google Scholar 

  11. Oliner, A.J., Sahoo, R.K., Moreira, J.E., Gupta, M., Sivasubramaniam, A.: Fault-aware job scheduling for bluegene/l systems. In: Proceedings of the IEEE 18th International in Parallel and Distributed Processing Symposium, p. 64 (2004)

  12. Li, Y., Lan, Z., Gujrati, P., Sun, X.H.: Fault-aware runtime strategies for high-performance computing. IEEE Trans. Parallel Distrib. Syst. 20(4), 460–473 (2009)

    Article  Google Scholar 

  13. Amoon, M.: A fault-tolerant scheduling system for computational grids. Comput. Electr. Eng. 38(2), 399–412 (2012)

    Article  Google Scholar 

  14. Anglano, C., Brevik, J., Canonico, M., Nurmi, D., Wolski, R.: Fault-aware scheduling for bag-of-tasks applications on desktop grids. In: Proceedings of the 7th IEEE/ACM International Conference on Grid Computing, pp. 56–63 (2006)

  15. Brevik, J., Nurmi, D., Wolski, R.: Automatic methods for predicting machine availability in desktop grid and peer-to-peer systems. In: Proceedings of the IEEE International Symposium on Cluster Computing and the Grid, 2004. CCGrid 2004, pp. 190–199 (2004)

  16. Byun, E., Choi, S., Baik, M., Gil, J., Park, C., Hwang, C.: MJSA: Markov job scheduler based on availability in desktop grid computing environment. Future Gener. Comput. Syst. 23(4), 616–622 (2007)

    Article  Google Scholar 

  17. Ramachandran, Karthick, Lutfiyya, Hanan, Perry, Mark: Decentralized approach to resource availability prediction using group availability in a P2P desktop grid. Future Gener. Comput. Syst. 28(6), 854–860 (2012)

    Article  Google Scholar 

  18. Xiaoping, H., Zhijiang, W., Congming, W., yu, W., Yongshang, C., Ling, S.: Availability-based task monitoring and adaptation mechanism in desktop grid system. In: Proceedings of the Sixth International Conference on Grid and Cooperative Computing, 2007. GCC 2007, pp. 444–450 (2007)

  19. Hyun, J.H.: An effective scheduling method for more reliable execution on desktop grids. In: Proceedings of the 12th IEEE International Conference on High Performance Computing and Communications (HPCC), 2010, pp. 172–179 (2010)

  20. Hui, L., Groep, D., Wolters, L.: Workload characteristics of a multi-cluster supercomputer. In: Feitelson, D.G., Rudolph, L., Schwiegelshohn, U. (eds.) Job Scheduling Strategies for Parallel Processing. Lecture Notes in Computer Science, pp. 176–193. Springer, Berlin (2005)

    Google Scholar 

  21. Chun, B.G., Dabek, F., Haeberlen, A., Sit, E., Weatherspoon, H., Kaashoek, M.F., Kubiatowicz, J., Morris, R.: Efficient replica maintenance for distributed storage systems. In: Proceedings of the 3rd conference on Networked Systems Design & Implementation, USENIX Association, vol. 3, pp. 4–4 (2006)

  22. Stefan, S., Gummadi, P.K., Gribble, S.D.: Measurement study of peer-to-peer file sharing systems. In: Electronic Imaging 2002, International Society for Optics and Photonics, pp. 156–170 (2001)

  23. Cuenca-Acuna, F.M., Martin, R.P., Nguyen, T.D.: Autonomous replication for high availability in unstructured P2P systems. In: Proceedings of the Symposium on Reliable Distributed Systems (SRDS) (2003)

  24. Yao, Z., Leonard, D., Wang, X., Loguinov, D.: Modeling heterogeneous user churn and local resilience of unstructured p2p networks. In: Proceedings of the 2006 14th IEEE International Conference on Network Protocols, 2006. ICNP’06, pp. 32–41 (2006)

  25. Schroeder, B., Gibson, G.A.: Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you?. In: Proceedings of the 5th USENIX conference on File and Storage Technologies, FAST ’07, Berkeley, CA, USA, USENIX Association (2007)

  26. Nurmi, D., Brevik, J., Wolski, R.: Modeling machine availability in enterprise and wide-area distributed computing environments. In: In Euro-Par05, pp. 432–441 (2003)

  27. Ford, D., Labelle, F., Popovici, F., Stokely, M., Truong, V.A., Barroso, L., Grimes, C., Quinlan, S.: Availability in globally distributed storage systems. In: Proceedings of the 9th USENIX Symposium on Operating Systems Design and Implementation (2010)

  28. Khan, M.M., Navaridas, J., Palma, L.A., Rast, A.D., Jin, X., Plana, L.A., Lujan, M., Woods, J.V., Miguel-Alonso, J., Furber, S.B.: Event-driven configuration of a neural network cmp system over a homogeneous interconnect fabric. In: Proceedings of the 8th International Symposium on Parallel and Distributed Computing, 2009. ISPDC ’09, pp. 54–61 (2009)

  29. Brown, R.: Calendar queues: A fast 0(1) priority queue implementation for the simulation event set problem. Commun. ACM 31(10), 1220–1227 (1988)

    Article  Google Scholar 

  30. Tang, W., Desai, N., Buettner, D., Lan, Z.: Analyzing and adjusting user runtime estimates to improve job scheduling on the blue gene/p. In: Proceedings of the IEEE International Symposium on Parallel & Distributed Processing (IPDPS), 2010, pp. 1–11 (2010)

Download references

Acknowledgments

This work has been partially supported by the Saiotek and Research Groups 2013-2018 (IT-609-13) programs (Basque Government), TIN2013-41272P (Ministry of Science and Technology), COMBIOMED-RD07/0067/0003 network in computational biomedicine (Carlos III Health Institute) and by the NICaiA Project PIRSES-GA-2009-247619 (European Commission). Mr Pérez-Miguel is supported by a doctoral grant from the Basque Government. Jose Miguel-Alonso and Alexander Mendiburu are members of the European Network of Excellence on High Performance and Embedded Architecture and Compilation (HiPEAC).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Carlos Pérez-Miguel.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Pérez-Miguel, C., Mendiburu, A. & Miguel-Alonso, J. Competition-based failure-aware scheduling for High-Throughput Computing systems on peer-to-peer networks. Cluster Comput 18, 1229–1249 (2015). https://doi.org/10.1007/s10586-015-0473-5

Download citation

  • Received:

  • Revised:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s10586-015-0473-5

Keywords

Navigation