Supervised Workpools for Reliable Massively Parallel Computing

Stewart, Robert; Trinder, Phil; Maier, Patrick

doi:10.1007/978-3-642-40447-4_16

Robert Stewart¹⁸,
Phil Trinder¹⁸ &
Patrick Maier¹⁸

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7829))

Included in the following conference series:

International Symposium on Trends in Functional Programming

818 Accesses
3 Citations

Abstract

The manycore revolution is steadily increasing the performance and size of massively parallel systems, to the point where system reliability becomes a pressing concern. Therefore, massively parallel compute jobs must be able to tolerate failures. For example, in the HPC-GAP project we aim to coordinate symbolic computations in architectures with 10⁶ cores. At that scale, failures are a real issue. Functional languages are well known for advantages both for parallelism and for reliability, e.g. stateless computations can be scheduled and replicated freely.

This paper presents a software level reliability mechanism, namely supervised fault tolerant workpools implemented in a Haskell DSL for parallel programming on distributed memory architectures. The workpool hides task scheduling, failure detection and task replication from the programmer. To the best of our knowledge, this is a novel construct. We demonstrate how to abstract over supervised workpools by providing fault tolerant instances of existing algorithmic skeletons. We evaluate the runtime performance of these skeletons both in the presence and absence of faults, and report low supervision overheads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Log in via an institution

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 39.99; Price excludes VAT (USA)

Softcover Book: USD 54.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

References

Armstrong, J., Virding, R., Williams, M.: Concurrent Programming in ERLANG. Prentice Hall (1993)
Google Scholar
Bialecki, A., Taton, C., Kellerman, J.: Apache Hadoop: a Framework for Running Applications on Large Clusters Built of Commodity Hardware (2010), http://hadoop.apache.org/
Borwein, P.B., Ferguson, R., Mossinghoff, M.J.: Sign changes in Sums of the Liouville Function. Mathematics of Computation 77(263), 1681–1694 (2008)
Article MathSciNet MATH Google Scholar
Bouteiller, A., Cappello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. Super Computing 25 (2003)
Google Scholar
Cappello, F., Geist, A., Gropp, B., Kalé, L.V., Kramer, B., Snir, M.: Toward Exascale Resilience. High Performance Computing Applications 23(4), 374–388 (2009)
Article Google Scholar
Coutts, D., de Vries, E.: The New Cloud Haskell. In: Haskell Implementers Workshop. Well-Typed (September 2012)
Google Scholar
M. development. Feature: Adding -disable-auto-cleanup to mpich2 (2010), http://goo.gl/PNEaO
Epstein, J., Black, A.P., Jones, S.L.P.: Towards Haskell in the Cloud. In: Haskell Symposium, pp. 118–129 (2011)
Google Scholar
Fabre, J.-C., Nicomette, V., Pérennou, T., Stroud, R.J., Wu, Z.: Implementing Fault Tolerant Applications using Reflective Object-Oriented Programming. In: Symposium on Fault-Tolerant Computing, pp. 489–498 (1995)
Google Scholar
Fagg, G.E., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: Euro. PVM/MPI, pp. 346–353 (2000)
Google Scholar
The GAP Group. GAP – Groups, Algorithms, and Programming, http://www.gap-system.org .
Gropp, W., Lusk, E.: Fault Tolerance in MPI Programs. Special Issue of the Journal High Performance Computing Applications 18, 363–372 (2002)
Article Google Scholar
Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. Scientific And Engineering Computation. MIT Press (1994)
Google Scholar
Larson, J.: Erlang for Concurrent Programming. In: ACM Queue, vol. 6, pp. 18–23. ACM (September 2008)
Google Scholar
Liskov, B., Shrira, L.: Promises: Linguistic Support for Efficient Asynchronous Procedure Calls in Distributed Systems. In: PLDI, pp. 260–267. ACM (1988)
Google Scholar
Maier, P., Stewart, R., Trinder, P.: Reliable Scalable Symbolic Computation: The Design of SymGridPar2. In: SAC 2013. ACM (to appear, 2013)
Google Scholar
Maier, P., Trinder, P.: Implementing a high-level distributed-memory parallel Haskell in Haskell. In: Gill, A., Hage, J. (eds.) IFL 2011. LNCS, vol. 7257, pp. 35–50. Springer, Heidelberg (2012)
Chapter Google Scholar
Marlow, S., Jones, S.L.P., Singh, S.: Runtime Support for Multicore Haskell. In: ICFP, pp. 65–78 (2009)
Google Scholar
Marlow, S., Newton, R., Jones, S.L.P.: A Monad for Deterministic Parallelism. In: Haskell Symposium, pp. 71–82 (2011)
Google Scholar
Niehren, J., Schwinghammer, J., Smolka, G.: A Concurrent Lambda Calculus with Futures. Theoretical Computer Science 364(3), 338–356 (2006)
Article MathSciNet MATH Google Scholar
Schroeder, B., Gibson, G.A.: Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In: FAST. USENIX Association (2007)
Google Scholar
Shavit, N., Touitou, D.: Software Transactional Memory. In: PODC 1995, pp. 204–213. ACM (1995)
Google Scholar
Stewart, R., Maier, P., Trinder, P.: Implementation of the HdpH Supervised Workpool (July 2012), http://www.macs.hw.ac.uk/~rs46/papers/tfp2012/SupervisedWorkpool.hs
Stewart, R., Maier, P., Trinder, P.: Supervised Workpools for Reliable Massively Parallel Computing. Technical report, Heriot-Watt University (2012), http://www.macs.hw.ac.uk/~rs46/papers/tfp2012/TFP2012_Robert_Stewart.pdf
Trinder, P.W., Hammond, K., Loidl, H.-W., Peyton Jones, S.L.: Algorithm + Strategy = Parallelism. Journal of Functional Programming 8(1), 23–60 (1998)
Article MathSciNet MATH Google Scholar
Trottier-Hebert, F.: Learn You Some Erlang For Great Good - Building an Application With OTP (2012), http://learnyousomeerlang.com/building-applications-with-otp
Zain, A.A., Hammond, K., Berthold, J., Trinder, P.W., Michaelson, G., Aswad, M.: Low-pain, High-Gain Multicore Programming in Haskell: Coordinating Irregular Symbolic Computations on Multicore Architectures. In: DAMP, pp. 25–36. ACM (2009)
Google Scholar

Download references

Author information

Authors and Affiliations

School of Mathematical and Computer Sciences, Heriot Watt University, Edinburgh, UK
Robert Stewart, Phil Trinder & Patrick Maier

Authors

Robert Stewart
View author publications
You can also search for this author in PubMed Google Scholar
Phil Trinder
View author publications
You can also search for this author in PubMed Google Scholar
Patrick Maier
View author publications
You can also search for this author in PubMed Google Scholar

Editor information

Editors and Affiliations

School of Mathematics and Computer Science, Heriot-Watt University, EH14 4AS, Edinburgh, Scotland, UK
Hans-Wolfgang Loidl
Facultad de Informática, Universidad Complutense de Madrid, c/. Profesor José García Santesmases s/n, 28040, Madrid, Spain
Ricardo Peña

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Stewart, R., Trinder, P., Maier, P. (2013). Supervised Workpools for Reliable Massively Parallel Computing. In: Loidl, HW., Peña, R. (eds) Trends in Functional Programming. TFP 2012. Lecture Notes in Computer Science, vol 7829. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40447-4_16

Download citation

DOI: https://doi.org/10.1007/978-3-642-40447-4_16
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-642-40446-7
Online ISBN: 978-3-642-40447-4
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics