skip to main content
research-article
Open Access

Failure Recovery in Resilient X10

Published:02 July 2019Publication History
Skip Abstract Section

Abstract

Cloud computing has made the resources needed to execute large-scale in-memory distributed computations widely available. Specialized programming models, e.g., MapReduce, have emerged to offer transparent fault tolerance and fault recovery for specific computational patterns, but they sacrifice generality. In contrast, the Resilient X10 programming language adds failure containment and failure awareness to a general purpose, distributed programming language. A Resilient X10 application spans over a number of places. Its formal semantics precisely specify how it continues executing after a place failure. Thanks to failure awareness, the X10 programmer can in principle build redundancy into an application to recover from failures. In practice, however, correctness is elusive, as redundancy and recovery are often complex programming tasks.

This article further develops Resilient X10 to shift the focus from failure awareness to failure recovery, from both a theoretical and a practical standpoint. We rigorously define the distinction between recoverable and catastrophic failures. We revisit the happens-before invariance principle and its implementation. We shift most of the burden of redundancy and recovery from the programmer to the runtime system and standard library. We make it easy to protect critical data from failure using resilient stores and harness elasticity—dynamic place creation—to persist not just the data but also its spatial distribution.

We demonstrate the flexibility and practical usefulness of Resilient X10 by building several representative high-performance in-memory parallel application kernels and frameworks. These codes are 10× to 25× larger than previous Resilient X10 benchmarks. For each application kernel, the average runtime overhead of resiliency is less than 7%. By comparing application kernels written in the Resilient X10 and Spark programming models, we demonstrate that Resilient X10’s more general programming model can enable significantly better application performance for resilient in-memory distributed computations.

References

  1. Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kalé. 2014. Parallel programming with migratable objects: Charm++ in practice. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE, 647--658. Google ScholarGoogle ScholarDigital LibraryDigital Library
  2. Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant stream processing at Internet scale. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1033--1044. Google ScholarGoogle ScholarDigital LibraryDigital Library
  3. Md Mohsin Ali, James Southern, Peter Strazdins, and Brendan Harding. 2014. Application level fault recovery: Using fault-tolerant Open MPI in a PDE solver. In Proceedings of the International Parallel 8 Distributed Processing Symposium Workshops. IEEE, 1169--1178. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’15). 1383--1394. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and Jack J. Dongarra. 2012. An evaluation of user-level failure mitigation support in MPI. In Proceedings of the 19th European MPI Users’ Group Meeting on Recent Advances in Message Passing Interface (EuroMPI’12). Springer, 193--203. Google ScholarGoogle ScholarDigital LibraryDigital Library
  6. George Bosilca, Rémi Delmas, Jack Dongarra, and Julien Langou. 2009. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69, 4 (Apr. 2009), 410--416. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 1--2 (2010), 285--296. Google ScholarGoogle ScholarDigital LibraryDigital Library
  8. Sergey Bykov, Alan Geller, Gabriel Kliot, James R. Larus, Ravi Pandya, and Jorgen Thelin. 2011. Orleans: Cloud computing for everyone. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC’11). ACM, New York, NY, Article 16, 14 pages. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. 2011. Habanero-Java: The new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java (PPPJ’11). 51--61. Google ScholarGoogle ScholarDigital LibraryDigital Library
  10. Chapel 2016. Chapel Language Specification version 0.982. Technical Report. Cray Inc.Google ScholarGoogle Scholar
  11. Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA’05). 519--538. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. Andrew Chien, Pavan Balaji, Peter Beckman, Nan Dun, Aiman Fang, Hajime Fujita, Kamil Iskra, Zachary Rubenstein, Ziming Zheng, Rob Schreiber et al. 2015. Versioned distributed arrays for resilience in scientific applications: Global view resilience. Procedia Comput. Sci. 51 (2015), 29--38. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. Wei-Chiu Chuang, Bo Sang, Sunghwan Yoo, Rui Gu, Milind Kulkarni, and Charles Killian. 2013. EventWave: Programming model and runtime support for tightly-coupled elastic cloud applications. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC’13). ACM, New York, NY. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Silvia Crafa, David Cunningham, Vijay Saraswat, Avraham Shinnar, and Olivier Tardieu. 2014. Semantics of (Resilient) X10. In Proceedings of the 28th European Conference on Object-Oriented Programming. 670--696. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. David Cunningham, David Grove, Benjamin Herta, Arun Iyengar, Kiyokuni Kawachiya, Hiroki Murata, Vijay Saraswat, Mikio Takeuchi, and Olivier Tardieu. 2014. Resilient X10: Efficient failure-aware programming. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, 67--80. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. Doug Cutting and Eric Baldeschwieler. 2007. Meet Hadoop. In Proceedings of the O’Reilly Open Software Convention.Google ScholarGoogle Scholar
  17. Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design 8 Implementation (OSDI’04). 10--10. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. E. N. Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Survey 34, 3 (2002), 375--408. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Claudia Fohry and Marco Bungart. 2016. A robust fault tolerance scheme for lifeline-based taskpools. In Proceedings of the 45th International Conference on Parallel Processing Workshops (ICPPW’16). 200--209.Google ScholarGoogle ScholarCross RefCross Ref
  20. Claudia Fohry, Marco Bungart, and Jonas Posner. 2015. Towards an efficient fault-tolerance scheme for GLB. In Proceedings of the ACM SIGPLAN Workshop on X10 (X10’15). ACM, New York, NY, 27--32. Google ScholarGoogle ScholarDigital LibraryDigital Library
  21. Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, and Franck Cappello. 2011. Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). 989--1000. Google ScholarGoogle ScholarDigital LibraryDigital Library
  22. Sara S. Hamouda, Benjamin Herta, Josh Milthorpe, David Grove, and Olivier Tardieu. 2016. Resilient X10 over MPI User Level Failure Mitigation. In Proceedings of the ACM SIGPLAN Workshop on X10 (X10’16). Google ScholarGoogle ScholarDigital LibraryDigital Library
  23. Sara S. Hamouda, Josh Milthorpe, Peter E. Strazdins, and Vijay Saraswat. 2015. A resilient framework for iterative linear algebra applications in X10. In Proceedings of the 16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC’15). Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. Hazelcast, Inc. 2014. Hazelcast 3.4. Retrieved from https://hazelcast.com/.Google ScholarGoogle Scholar
  25. Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference. 11--11. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. Maja Kabiljo, Dionysis Logothetis, Sergey Edunov, and Avery Ching. 2016. A Comparison of State-of-the-Art Graph Processing Systems. Technical Report. Facebook. Retrieved from https://code.facebook.com/posts/319004238457019/a-comparison-of-state-of-the-art-graph-processing-systems/.Google ScholarGoogle Scholar
  27. Laxmikant V. Kalé, Anshu Arya, Abhinav Bhatele, Abhishek Gupta, Nikhil Jain, Pritish Jetley, Jonathan Lifflander, Phil Miller, Yanhua Sun, Ramprasad Venkataraman, Lukasz Wesolowski, and Gengbin Zheng. 2011. Charm++ for Productivity and Performance: A Submission to the 2011 HPC Class II Challenge. Technical Report. Parallel Programming Laboratory.Google ScholarGoogle Scholar
  28. Ian Karlin, Jeff Keasler, and Rob Neely. 2013. LULESH 2.0 Updates and Changes. Technical Report LLNL-TR-641973.Google ScholarGoogle Scholar
  29. Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A compiler-free PGAS library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. Article 5. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Jonathan Lifflander, Esteban Meneses, Harshitha Menon, Phil Miller, Sriram Krishnamoorthy, and Laxmikant V. Kalé. 2014. Scalable replay with partial-order dependencies for message-logging fault tolerance. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’14). IEEE, Madrid, Spain, 19--28.Google ScholarGoogle Scholar
  31. Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 2 (Mar. 1982), 129--137. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the Cloud. Proc. VLDB Endow. 5, 8 (Apr. 2012), 716--727. Google ScholarGoogle ScholarDigital LibraryDigital Library
  33. Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 135--146. Google ScholarGoogle ScholarDigital LibraryDigital Library
  34. Josh Milthorpe, David Grove, Benjamin Herta, and Olivier Tardieu. 2015. Exploring the APGAS Programming Model Using the LULESH Proxy Application. Technical Report RC25555. IBM Research.Google ScholarGoogle Scholar
  35. Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2007. UTS: An unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC’06). Springer-Verlag, Berlin, 235--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  36. Konstantina Panagiotopoulou and Hans-Wolfgang Loidl. 2016. Transparently resilient task parallelism for Chapel. In Proceedings of the International Parallel 8 Distributed Processing Symposium Workshops. IEEE, 1586--1595.Google ScholarGoogle ScholarCross RefCross Ref
  37. John T. Richards, Jonathan Brezin, Calvin B. Swart, and Christine A. Halverson. 2014. A decade of progress in parallel programming productivity. Commun. ACM 57, 11 (Oct. 2014), 60--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  38. Martin Rinard. 2006. Probabilistic accuracy bounds for fault-tolerant computations that discard tasks. In Proceedings of the 20th Annual International Conference on Supercomputing (ICS’06). 324--334. Google ScholarGoogle ScholarDigital LibraryDigital Library
  39. Vijay Saraswat, Gheorghe Almasi, Ganesh Bikshandi, Calin Cascaval, David Cunningham, David Grove, Sreedhar Kodali, Igor Peshansky, and Olivier Tardieu. 2010. The asynchronous partitioned global address space model. In Proceedings of the 1st Workshop on Advances in Message Passing (AMP’10).Google ScholarGoogle Scholar
  40. Vijay A. Saraswat, Prabhanjan Kambadur, Sreedhar Kodali, David Grove, and Sriram Krishnamoorthy. 2011. Lifeline-based global load balancing. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). 201--212. Google ScholarGoogle ScholarDigital LibraryDigital Library
  41. Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis R. de Supinski, and Satoshi Matsuoka. 2012. Design and modeling of a non-blocking checkpointing system. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2012 (SC’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  42. Richard D. Schlichting and Fred B. Schneider. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1, 3 (Aug. 1983), 222--238. Google ScholarGoogle ScholarDigital LibraryDigital Library
  43. Avraham Shinnar, David Cunningham, Benjamin Herta, and Vijay Saraswat. 2012. M3R: Increased performance for in-memory Hadoop jobs. In Proceedings of the VLDB Endowment (VLDB’12). Google ScholarGoogle ScholarDigital LibraryDigital Library
  44. Olivier Tardieu, Benjamin Herta, David Cunningham, David Grove, Prabhanjan Kambadur, Vijay Saraswat, Avraham Shinnar, Mikio Takeuchi, and Mandana Vaziri. 2014. X10 and APGAS at Petascale. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice Of Parallel Programming (PPoPP’14). ACM, 53--66. Google ScholarGoogle ScholarDigital LibraryDigital Library
  45. The X10 Language 2019. Git Repository. Retrieved from [email protected]:x10-lang/x10.git.Google ScholarGoogle Scholar
  46. Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC’13). Article 5, 16 pages.Google ScholarGoogle ScholarDigital LibraryDigital Library
  47. Steve Vinoski. 2007. Reliability with Erlang. IEEE Internet Comput. 11, 6 (2007), 79--81. Google ScholarGoogle ScholarDigital LibraryDigital Library
  48. Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media. Google ScholarGoogle ScholarDigital LibraryDigital Library
  49. X10 Applications 2019. Git Repository. Retrieved from [email protected]:x10-lang/x10-applications.git.Google ScholarGoogle Scholar
  50. X10 Benchmarks 2019. Git Repository. Retrieved from [email protected]:x10-lang/x10-benchmarks.git.Google ScholarGoogle Scholar
  51. X10 v2.6.1. 2017. X10 2.6.1 Release. Retrieved fromGoogle ScholarGoogle Scholar
  52. Reynold S. Xin, Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Unifying data-parallel and graph-parallel analytics. arXiv preprint arXiv:1402.2394.Google ScholarGoogle Scholar
  53. Chaoran Yang, Karthik Murthy, and John Mellor-Crummey. 2013. Managing asynchronous operations in Coarray Fortran 2.0. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’13). 1321--1332. Google ScholarGoogle ScholarDigital LibraryDigital Library
  54. John W. Young. 1974. A first-order approximation to the optimum checkpoint interval. Commun. ACM 17, 9 (1974), 530--531. Google ScholarGoogle ScholarDigital LibraryDigital Library
  55. Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI’12). USENIX Association, 15--28. Google ScholarGoogle ScholarDigital LibraryDigital Library
  56. Wei Zhang, Olivier Tardieu, David Grove, Benjamin Herta, Tomio Kamada, Vijay Saraswat, and Mikio Takeuchi. 2014. GLB: Lifeline-based global load balancing library in X10. In Proceedings of the 1st Workshop on Parallel Programming for Analytics Applications (PPAA’14). ACM, New York, NY, 31--40. Google ScholarGoogle ScholarDigital LibraryDigital Library
  57. Gengbin Zheng, Xiang Ni, and Laxmikant V Kalé. 2012. A scalable double in-memory checkpoint and restart scheme towards exascale. In Proceedings of the IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 1--6.Google ScholarGoogle ScholarCross RefCross Ref
  58. Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: A PGAS extension for C++. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’14). 1105--1114. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. Failure Recovery in Resilient X10

            Recommendations

            Comments

            Login options

            Check if you have access through your login credentials or your institution to get full access on this article.

            Sign in

            Full Access

            • Published in

              cover image ACM Transactions on Programming Languages and Systems
              ACM Transactions on Programming Languages and Systems  Volume 41, Issue 3
              September 2019
              278 pages
              ISSN:0164-0925
              EISSN:1558-4593
              DOI:10.1145/3343145
              Issue’s Table of Contents

              Copyright © 2019 ACM

              Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

              Publisher

              Association for Computing Machinery

              New York, NY, United States

              Publication History

              • Published: 2 July 2019
              • Accepted: 1 April 2019
              • Revised: 1 December 2018
              • Received: 1 August 2017
              Published in toplas Volume 41, Issue 3

              Permissions

              Request permissions about this article.

              Request Permissions

              Check for updates

              Author Tags

              Qualifiers

              • research-article
              • Research
              • Refereed

            PDF Format

            View or Download as a PDF file.

            PDF

            eReader

            View online with eReader.

            eReader

            HTML Format

            View this article in HTML Format .

            View HTML Format