Skip to main content

Supervised Workpools for Reliable Massively Parallel Computing

  • Conference paper
Trends in Functional Programming (TFP 2012)

Part of the book series: Lecture Notes in Computer Science ((LNTCS,volume 7829))

Included in the following conference series:

Abstract

The manycore revolution is steadily increasing the performance and size of massively parallel systems, to the point where system reliability becomes a pressing concern. Therefore, massively parallel compute jobs must be able to tolerate failures. For example, in the HPC-GAP project we aim to coordinate symbolic computations in architectures with 106 cores. At that scale, failures are a real issue. Functional languages are well known for advantages both for parallelism and for reliability, e.g. stateless computations can be scheduled and replicated freely.

This paper presents a software level reliability mechanism, namely supervised fault tolerant workpools implemented in a Haskell DSL for parallel programming on distributed memory architectures. The workpool hides task scheduling, failure detection and task replication from the programmer. To the best of our knowledge, this is a novel construct. We demonstrate how to abstract over supervised workpools by providing fault tolerant instances of existing algorithmic skeletons. We evaluate the runtime performance of these skeletons both in the presence and absence of faults, and report low supervision overheads.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 39.99
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 54.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Preview

Unable to display preview. Download preview PDF.

Unable to display preview. Download preview PDF.

References

  1. Armstrong, J., Virding, R., Williams, M.: Concurrent Programming in ERLANG. Prentice Hall (1993)

    Google Scholar 

  2. Bialecki, A., Taton, C., Kellerman, J.: Apache Hadoop: a Framework for Running Applications on Large Clusters Built of Commodity Hardware (2010), http://hadoop.apache.org/

  3. Borwein, P.B., Ferguson, R., Mossinghoff, M.J.: Sign changes in Sums of the Liouville Function. Mathematics of Computation 77(263), 1681–1694 (2008)

    Article  MathSciNet  MATH  Google Scholar 

  4. Bouteiller, A., Cappello, F., Hérault, T., Krawezik, G., Lemarinier, P., Magniette, F.: MPICH-V2: a Fault Tolerant MPI for Volatile Nodes based on Pessimistic Sender Based Message Logging. Super Computing 25 (2003)

    Google Scholar 

  5. Cappello, F., Geist, A., Gropp, B., Kalé, L.V., Kramer, B., Snir, M.: Toward Exascale Resilience. High Performance Computing Applications 23(4), 374–388 (2009)

    Article  Google Scholar 

  6. Coutts, D., de Vries, E.: The New Cloud Haskell. In: Haskell Implementers Workshop. Well-Typed (September 2012)

    Google Scholar 

  7. M. development. Feature: Adding -disable-auto-cleanup to mpich2 (2010), http://goo.gl/PNEaO

  8. Epstein, J., Black, A.P., Jones, S.L.P.: Towards Haskell in the Cloud. In: Haskell Symposium, pp. 118–129 (2011)

    Google Scholar 

  9. Fabre, J.-C., Nicomette, V., Pérennou, T., Stroud, R.J., Wu, Z.: Implementing Fault Tolerant Applications using Reflective Object-Oriented Programming. In: Symposium on Fault-Tolerant Computing, pp. 489–498 (1995)

    Google Scholar 

  10. Fagg, G.E., Dongarra, J.: FT-MPI: Fault Tolerant MPI, Supporting Dynamic Applications in a Dynamic World. In: Euro. PVM/MPI, pp. 346–353 (2000)

    Google Scholar 

  11. The GAP Group. GAP – Groups, Algorithms, and Programming, http://www.gap-system.org .

  12. Gropp, W., Lusk, E.: Fault Tolerance in MPI Programs. Special Issue of the Journal High Performance Computing Applications 18, 363–372 (2002)

    Article  Google Scholar 

  13. Gropp, W., Lusk, E., Skjellum, A.: Using MPI: Portable Parallel Programming with the Message-Passing Interface. Scientific And Engineering Computation. MIT Press (1994)

    Google Scholar 

  14. Larson, J.: Erlang for Concurrent Programming. In: ACM Queue, vol. 6, pp. 18–23. ACM (September 2008)

    Google Scholar 

  15. Liskov, B., Shrira, L.: Promises: Linguistic Support for Efficient Asynchronous Procedure Calls in Distributed Systems. In: PLDI, pp. 260–267. ACM (1988)

    Google Scholar 

  16. Maier, P., Stewart, R., Trinder, P.: Reliable Scalable Symbolic Computation: The Design of SymGridPar2. In: SAC 2013. ACM (to appear, 2013)

    Google Scholar 

  17. Maier, P., Trinder, P.: Implementing a high-level distributed-memory parallel Haskell in Haskell. In: Gill, A., Hage, J. (eds.) IFL 2011. LNCS, vol. 7257, pp. 35–50. Springer, Heidelberg (2012)

    Chapter  Google Scholar 

  18. Marlow, S., Jones, S.L.P., Singh, S.: Runtime Support for Multicore Haskell. In: ICFP, pp. 65–78 (2009)

    Google Scholar 

  19. Marlow, S., Newton, R., Jones, S.L.P.: A Monad for Deterministic Parallelism. In: Haskell Symposium, pp. 71–82 (2011)

    Google Scholar 

  20. Niehren, J., Schwinghammer, J., Smolka, G.: A Concurrent Lambda Calculus with Futures. Theoretical Computer Science 364(3), 338–356 (2006)

    Article  MathSciNet  MATH  Google Scholar 

  21. Schroeder, B., Gibson, G.A.: Disk Failures in the Real World: What Does an MTTF of 1,000,000 Hours Mean to You? In: FAST. USENIX Association (2007)

    Google Scholar 

  22. Shavit, N., Touitou, D.: Software Transactional Memory. In: PODC 1995, pp. 204–213. ACM (1995)

    Google Scholar 

  23. Stewart, R., Maier, P., Trinder, P.: Implementation of the HdpH Supervised Workpool (July 2012), http://www.macs.hw.ac.uk/~rs46/papers/tfp2012/SupervisedWorkpool.hs

  24. Stewart, R., Maier, P., Trinder, P.: Supervised Workpools for Reliable Massively Parallel Computing. Technical report, Heriot-Watt University (2012), http://www.macs.hw.ac.uk/~rs46/papers/tfp2012/TFP2012_Robert_Stewart.pdf

  25. Trinder, P.W., Hammond, K., Loidl, H.-W., Peyton Jones, S.L.: Algorithm + Strategy = Parallelism. Journal of Functional Programming 8(1), 23–60 (1998)

    Article  MathSciNet  MATH  Google Scholar 

  26. Trottier-Hebert, F.: Learn You Some Erlang For Great Good - Building an Application With OTP (2012), http://learnyousomeerlang.com/building-applications-with-otp

  27. Zain, A.A., Hammond, K., Berthold, J., Trinder, P.W., Michaelson, G., Aswad, M.: Low-pain, High-Gain Multicore Programming in Haskell: Coordinating Irregular Symbolic Computations on Multicore Architectures. In: DAMP, pp. 25–36. ACM (2009)

    Google Scholar 

Download references

Author information

Authors and Affiliations

Authors

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2013 Springer-Verlag Berlin Heidelberg

About this paper

Cite this paper

Stewart, R., Trinder, P., Maier, P. (2013). Supervised Workpools for Reliable Massively Parallel Computing. In: Loidl, HW., Peña, R. (eds) Trends in Functional Programming. TFP 2012. Lecture Notes in Computer Science, vol 7829. Springer, Berlin, Heidelberg. https://doi.org/10.1007/978-3-642-40447-4_16

Download citation

  • DOI: https://doi.org/10.1007/978-3-642-40447-4_16

  • Publisher Name: Springer, Berlin, Heidelberg

  • Print ISBN: 978-3-642-40446-7

  • Online ISBN: 978-3-642-40447-4

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics