skip to main content
research-article
Open access

Failure Recovery in Resilient X10

Published: 02 July 2019 Publication History

Abstract

Cloud computing has made the resources needed to execute large-scale in-memory distributed computations widely available. Specialized programming models, e.g., MapReduce, have emerged to offer transparent fault tolerance and fault recovery for specific computational patterns, but they sacrifice generality. In contrast, the Resilient X10 programming language adds failure containment and failure awareness to a general purpose, distributed programming language. A Resilient X10 application spans over a number of places. Its formal semantics precisely specify how it continues executing after a place failure. Thanks to failure awareness, the X10 programmer can in principle build redundancy into an application to recover from failures. In practice, however, correctness is elusive, as redundancy and recovery are often complex programming tasks.
This article further develops Resilient X10 to shift the focus from failure awareness to failure recovery, from both a theoretical and a practical standpoint. We rigorously define the distinction between recoverable and catastrophic failures. We revisit the happens-before invariance principle and its implementation. We shift most of the burden of redundancy and recovery from the programmer to the runtime system and standard library. We make it easy to protect critical data from failure using resilient stores and harness elasticity—dynamic place creation—to persist not just the data but also its spatial distribution.
We demonstrate the flexibility and practical usefulness of Resilient X10 by building several representative high-performance in-memory parallel application kernels and frameworks. These codes are 10× to 25× larger than previous Resilient X10 benchmarks. For each application kernel, the average runtime overhead of resiliency is less than 7%. By comparing application kernels written in the Resilient X10 and Spark programming models, we demonstrate that Resilient X10’s more general programming model can enable significantly better application performance for resilient in-memory distributed computations.

References

[1]
Bilge Acun, Abhishek Gupta, Nikhil Jain, Akhil Langer, Harshitha Menon, Eric Mikida, Xiang Ni, Michael Robson, Yanhua Sun, Ehsan Totoni, Lukasz Wesolowski, and Laxmikant Kalé. 2014. Parallel programming with migratable objects: Charm++ in practice. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC’14). IEEE, 647--658.
[2]
Tyler Akidau, Alex Balikov, Kaya Bekiroğlu, Slava Chernyak, Josh Haberman, Reuven Lax, Sam McVeety, Daniel Mills, Paul Nordstrom, and Sam Whittle. 2013. MillWheel: Fault-tolerant stream processing at Internet scale. Proc. VLDB Endow. 6, 11 (Aug. 2013), 1033--1044.
[3]
Md Mohsin Ali, James Southern, Peter Strazdins, and Brendan Harding. 2014. Application level fault recovery: Using fault-tolerant Open MPI in a PDE solver. In Proceedings of the International Parallel 8 Distributed Processing Symposium Workshops. IEEE, 1169--1178.
[4]
Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational data processing in Spark. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’15). 1383--1394.
[5]
Wesley Bland, Aurelien Bouteiller, Thomas Herault, Joshua Hursey, George Bosilca, and Jack J. Dongarra. 2012. An evaluation of user-level failure mitigation support in MPI. In Proceedings of the 19th European MPI Users’ Group Meeting on Recent Advances in Message Passing Interface (EuroMPI’12). Springer, 193--203.
[6]
George Bosilca, Rémi Delmas, Jack Dongarra, and Julien Langou. 2009. Algorithm-based fault tolerance applied to high performance computing. J. Parallel Distrib. Comput. 69, 4 (Apr. 2009), 410--416.
[7]
Yingyi Bu, Bill Howe, Magdalena Balazinska, and Michael D. Ernst. 2010. HaLoop: Efficient iterative data processing on large clusters. Proc. VLDB Endow. 3, 1--2 (2010), 285--296.
[8]
Sergey Bykov, Alan Geller, Gabriel Kliot, James R. Larus, Ravi Pandya, and Jorgen Thelin. 2011. Orleans: Cloud computing for everyone. In Proceedings of the 2nd ACM Symposium on Cloud Computing (SOCC’11). ACM, New York, NY, Article 16, 14 pages.
[9]
Vincent Cavé, Jisheng Zhao, Jun Shirako, and Vivek Sarkar. 2011. Habanero-Java: The new adventures of old X10. In Proceedings of the 9th International Conference on Principles and Practice of Programming in Java (PPPJ’11). 51--61.
[10]
Chapel 2016. Chapel Language Specification version 0.982. Technical Report. Cray Inc.
[11]
Philippe Charles, Christian Grothoff, Vijay Saraswat, Christopher Donawa, Allan Kielstra, Kemal Ebcioglu, Christoph von Praun, and Vivek Sarkar. 2005. X10: An object-oriented approach to non-uniform cluster computing. In Proceedings of the 20th ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages and Applications (OOPSLA’05). 519--538.
[12]
Andrew Chien, Pavan Balaji, Peter Beckman, Nan Dun, Aiman Fang, Hajime Fujita, Kamil Iskra, Zachary Rubenstein, Ziming Zheng, Rob Schreiber et al. 2015. Versioned distributed arrays for resilience in scientific applications: Global view resilience. Procedia Comput. Sci. 51 (2015), 29--38.
[13]
Wei-Chiu Chuang, Bo Sang, Sunghwan Yoo, Rui Gu, Milind Kulkarni, and Charles Killian. 2013. EventWave: Programming model and runtime support for tightly-coupled elastic cloud applications. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC’13). ACM, New York, NY.
[14]
Silvia Crafa, David Cunningham, Vijay Saraswat, Avraham Shinnar, and Olivier Tardieu. 2014. Semantics of (Resilient) X10. In Proceedings of the 28th European Conference on Object-Oriented Programming. 670--696.
[15]
David Cunningham, David Grove, Benjamin Herta, Arun Iyengar, Kiyokuni Kawachiya, Hiroki Murata, Vijay Saraswat, Mikio Takeuchi, and Olivier Tardieu. 2014. Resilient X10: Efficient failure-aware programming. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP’14). ACM, 67--80.
[16]
Doug Cutting and Eric Baldeschwieler. 2007. Meet Hadoop. In Proceedings of the O’Reilly Open Software Convention.
[17]
Jeffrey Dean and Sanjay Ghemawat. 2004. MapReduce: Simplified data processing on large clusters. In Proceedings of the 6th Conference on Symposium on Operating Systems Design 8 Implementation (OSDI’04). 10--10.
[18]
E. N. Elnozahy, Lorenzo Alvisi, Yi-Min Wang, and David B. Johnson. 2002. A survey of rollback-recovery protocols in message-passing systems. ACM Comput. Survey 34, 3 (2002), 375--408.
[19]
Claudia Fohry and Marco Bungart. 2016. A robust fault tolerance scheme for lifeline-based taskpools. In Proceedings of the 45th International Conference on Parallel Processing Workshops (ICPPW’16). 200--209.
[20]
Claudia Fohry, Marco Bungart, and Jonas Posner. 2015. Towards an efficient fault-tolerance scheme for GLB. In Proceedings of the ACM SIGPLAN Workshop on X10 (X10’15). ACM, New York, NY, 27--32.
[21]
Amina Guermouche, Thomas Ropars, Elisabeth Brunet, Marc Snir, and Franck Cappello. 2011. Uncoordinated checkpointing without domino effect for send-deterministic MPI applications. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’11). 989--1000.
[22]
Sara S. Hamouda, Benjamin Herta, Josh Milthorpe, David Grove, and Olivier Tardieu. 2016. Resilient X10 over MPI User Level Failure Mitigation. In Proceedings of the ACM SIGPLAN Workshop on X10 (X10’16).
[23]
Sara S. Hamouda, Josh Milthorpe, Peter E. Strazdins, and Vijay Saraswat. 2015. A resilient framework for iterative linear algebra applications in X10. In Proceedings of the 16th IEEE International Workshop on Parallel and Distributed Scientific and Engineering Computing (PDSEC’15).
[24]
Hazelcast, Inc. 2014. Hazelcast 3.4. Retrieved from https://hazelcast.com/.
[25]
Patrick Hunt, Mahadev Konar, Flavio P. Junqueira, and Benjamin Reed. 2010. ZooKeeper: Wait-free coordination for internet-scale systems. In Proceedings of the USENIX Annual Technical Conference. 11--11.
[26]
Maja Kabiljo, Dionysis Logothetis, Sergey Edunov, and Avery Ching. 2016. A Comparison of State-of-the-Art Graph Processing Systems. Technical Report. Facebook. Retrieved from https://code.facebook.com/posts/319004238457019/a-comparison-of-state-of-the-art-graph-processing-systems/.
[27]
Laxmikant V. Kalé, Anshu Arya, Abhinav Bhatele, Abhishek Gupta, Nikhil Jain, Pritish Jetley, Jonathan Lifflander, Phil Miller, Yanhua Sun, Ramprasad Venkataraman, Lukasz Wesolowski, and Gengbin Zheng. 2011. Charm++ for Productivity and Performance: A Submission to the 2011 HPC Class II Challenge. Technical Report. Parallel Programming Laboratory.
[28]
Ian Karlin, Jeff Keasler, and Rob Neely. 2013. LULESH 2.0 Updates and Changes. Technical Report LLNL-TR-641973.
[29]
Vivek Kumar, Yili Zheng, Vincent Cavé, Zoran Budimlić, and Vivek Sarkar. 2014. HabaneroUPC++: A compiler-free PGAS library. In Proceedings of the 8th International Conference on Partitioned Global Address Space Programming Models. Article 5.
[30]
Jonathan Lifflander, Esteban Meneses, Harshitha Menon, Phil Miller, Sriram Krishnamoorthy, and Laxmikant V. Kalé. 2014. Scalable replay with partial-order dependencies for message-logging fault tolerance. In Proceedings of the IEEE International Conference on Cluster Computing (CLUSTER’14). IEEE, Madrid, Spain, 19--28.
[31]
Stuart Lloyd. 1982. Least squares quantization in PCM. IEEE Trans. Inf. Theor. 28, 2 (Mar. 1982), 129--137.
[32]
Yucheng Low, Danny Bickson, Joseph Gonzalez, Carlos Guestrin, Aapo Kyrola, and Joseph M. Hellerstein. 2012. Distributed GraphLab: A framework for machine learning and data mining in the Cloud. Proc. VLDB Endow. 5, 8 (Apr. 2012), 716--727.
[33]
Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkowski. 2010. Pregel: A system for large-scale graph processing. In Proceedings of the ACM SIGMOD International Conference on Management of Data (SIGMOD’10). 135--146.
[34]
Josh Milthorpe, David Grove, Benjamin Herta, and Olivier Tardieu. 2015. Exploring the APGAS Programming Model Using the LULESH Proxy Application. Technical Report RC25555. IBM Research.
[35]
Stephen Olivier, Jun Huan, Jinze Liu, Jan Prins, James Dinan, P. Sadayappan, and Chau-Wen Tseng. 2007. UTS: An unbalanced tree search benchmark. In Proceedings of the 19th International Conference on Languages and Compilers for Parallel Computing (LCPC’06). Springer-Verlag, Berlin, 235--250.
[36]
Konstantina Panagiotopoulou and Hans-Wolfgang Loidl. 2016. Transparently resilient task parallelism for Chapel. In Proceedings of the International Parallel 8 Distributed Processing Symposium Workshops. IEEE, 1586--1595.
[37]
John T. Richards, Jonathan Brezin, Calvin B. Swart, and Christine A. Halverson. 2014. A decade of progress in parallel programming productivity. Commun. ACM 57, 11 (Oct. 2014), 60--66.
[38]
Martin Rinard. 2006. Probabilistic accuracy bounds for fault-tolerant computations that discard tasks. In Proceedings of the 20th Annual International Conference on Supercomputing (ICS’06). 324--334.
[39]
Vijay Saraswat, Gheorghe Almasi, Ganesh Bikshandi, Calin Cascaval, David Cunningham, David Grove, Sreedhar Kodali, Igor Peshansky, and Olivier Tardieu. 2010. The asynchronous partitioned global address space model. In Proceedings of the 1st Workshop on Advances in Message Passing (AMP’10).
[40]
Vijay A. Saraswat, Prabhanjan Kambadur, Sreedhar Kodali, David Grove, and Sriram Krishnamoorthy. 2011. Lifeline-based global load balancing. In Proceedings of the 16th ACM Symposium on Principles and Practice of Parallel Programming (PPoPP’11). 201--212.
[41]
Kento Sato, Naoya Maruyama, Kathryn Mohror, Adam Moody, Todd Gamblin, Bronis R. de Supinski, and Satoshi Matsuoka. 2012. Design and modeling of a non-blocking checkpointing system. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis 2012 (SC’12).
[42]
Richard D. Schlichting and Fred B. Schneider. 1983. Fail-stop processors: An approach to designing fault-tolerant computing systems. ACM Trans. Comput. Syst. 1, 3 (Aug. 1983), 222--238.
[43]
Avraham Shinnar, David Cunningham, Benjamin Herta, and Vijay Saraswat. 2012. M3R: Increased performance for in-memory Hadoop jobs. In Proceedings of the VLDB Endowment (VLDB’12).
[44]
Olivier Tardieu, Benjamin Herta, David Cunningham, David Grove, Prabhanjan Kambadur, Vijay Saraswat, Avraham Shinnar, Mikio Takeuchi, and Mandana Vaziri. 2014. X10 and APGAS at Petascale. In Proceedings of the 19th ACM SIGPLAN Symposium on Principles and Practice Of Parallel Programming (PPoPP’14). ACM, 53--66.
[45]
The X10 Language 2019. Git Repository. Retrieved from [email protected]:x10-lang/x10.git.
[46]
Vinod Kumar Vavilapalli, Arun C. Murthy, Chris Douglas, Sharad Agarwal, Mahadev Konar, Robert Evans, Thomas Graves, Jason Lowe, Hitesh Shah, Siddharth Seth, Bikas Saha, Carlo Curino, Owen O’Malley, Sanjay Radia, Benjamin Reed, and Eric Baldeschwieler. 2013. Apache Hadoop YARN: Yet another resource negotiator. In Proceedings of the 4th Annual Symposium on Cloud Computing (SOCC’13). Article 5, 16 pages.
[47]
Steve Vinoski. 2007. Reliability with Erlang. IEEE Internet Comput. 11, 6 (2007), 79--81.
[48]
Tom White. 2009. Hadoop: The Definitive Guide (1st ed.). O’Reilly Media.
[49]
X10 Applications 2019. Git Repository. Retrieved from [email protected]:x10-lang/x10-applications.git.
[50]
X10 Benchmarks 2019. Git Repository. Retrieved from [email protected]:x10-lang/x10-benchmarks.git.
[51]
X10 v2.6.1. 2017. X10 2.6.1 Release. Retrieved from
[52]
Reynold S. Xin, Daniel Crankshaw, Ankur Dave, Joseph E. Gonzalez, Michael J. Franklin, and Ion Stoica. 2014. GraphX: Unifying data-parallel and graph-parallel analytics. arXiv preprint arXiv:1402.2394.
[53]
Chaoran Yang, Karthik Murthy, and John Mellor-Crummey. 2013. Managing asynchronous operations in Coarray Fortran 2.0. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’13). 1321--1332.
[54]
John W. Young. 1974. A first-order approximation to the optimum checkpoint interval. Commun. ACM 17, 9 (1974), 530--531.
[55]
Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin, Scott Shenker, and Ion Stoica. 2012. Resilient distributed datasets: A fault-tolerant abstraction for in-memory cluster computing. In Proceedings of the 9th USENIX Symposium on Networked Systems Design and Implementation (NSDI’12). USENIX Association, 15--28.
[56]
Wei Zhang, Olivier Tardieu, David Grove, Benjamin Herta, Tomio Kamada, Vijay Saraswat, and Mikio Takeuchi. 2014. GLB: Lifeline-based global load balancing library in X10. In Proceedings of the 1st Workshop on Parallel Programming for Analytics Applications (PPAA’14). ACM, New York, NY, 31--40.
[57]
Gengbin Zheng, Xiang Ni, and Laxmikant V Kalé. 2012. A scalable double in-memory checkpoint and restart scheme towards exascale. In Proceedings of the IEEE/IFIP 42nd International Conference on Dependable Systems and Networks Workshops (DSN-W). IEEE, 1--6.
[58]
Yili Zheng, Amir Kamil, Michael B. Driscoll, Hongzhang Shan, and Katherine Yelick. 2014. UPC++: A PGAS extension for C++. In Proceedings of the IEEE International Parallel and Distributed Processing Symposium (IPDPS’14). 1105--1114.

Cited By

View all
  • (2024)Exploiting inherent elasticity of serverless in algorithms with unbalanced and irregular workloadsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104891190:COnline publication date: 1-Aug-2024
  • (2023)Elastic deep learning through resilient collective operationsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3626080(44-50)Online publication date: 12-Nov-2023
  • (2023)Reliable Actors with Retry OrchestrationProceedings of the ACM on Programming Languages10.1145/35912737:PLDI(1293-1316)Online publication date: 6-Jun-2023
  • Show More Cited By

Recommendations

Comments

Information & Contributors

Information

Published In

cover image ACM Transactions on Programming Languages and Systems
ACM Transactions on Programming Languages and Systems  Volume 41, Issue 3
September 2019
278 pages
ISSN:0164-0925
EISSN:1558-4593
DOI:10.1145/3343145
Issue’s Table of Contents
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 02 July 2019
Accepted: 01 April 2019
Revised: 01 December 2018
Received: 01 August 2017
Published in TOPLAS Volume 41, Issue 3

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. APGAS
  2. X10

Qualifiers

  • Research-article
  • Research
  • Refereed

Funding Sources

  • U.S. Air Force Office of Scientific Research
  • U.S. Department of Energy, Office of Science, Advanced Scientific Computing Research

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)159
  • Downloads (Last 6 weeks)30
Reflects downloads up to 15 Feb 2025

Other Metrics

Citations

Cited By

View all
  • (2024)Exploiting inherent elasticity of serverless in algorithms with unbalanced and irregular workloadsJournal of Parallel and Distributed Computing10.1016/j.jpdc.2024.104891190:COnline publication date: 1-Aug-2024
  • (2023)Elastic deep learning through resilient collective operationsProceedings of the SC '23 Workshops of the International Conference on High Performance Computing, Network, Storage, and Analysis10.1145/3624062.3626080(44-50)Online publication date: 12-Nov-2023
  • (2023)Reliable Actors with Retry OrchestrationProceedings of the ACM on Programming Languages10.1145/35912737:PLDI(1293-1316)Online publication date: 6-Jun-2023
  • (2023)RD-FCAJournal of Parallel and Distributed Computing10.1016/j.jpdc.2023.04.011179:COnline publication date: 1-Sep-2023
  • (2023)Coordination-aware assurance for end-to-end machine learning systems: the R3E approachAI Assurance10.1016/B978-0-32-391919-7.00024-X(339-367)Online publication date: 2023
  • (2022)Task-Level Resilience: Checkpointing vs. SupervisionInternational Journal of Networking and Computing10.15803/ijnc.12.1_4712:1(47-72)Online publication date: 2022
  • (2022)Supercharging the APGAS Programming Model with Relocatable Distributed CollectionsScientific Programming10.1155/2022/50924222022Online publication date: 1-Jan-2022
  • (2021)Checkpointing vs. Supervision Resilience Approaches for Dynamic Independent Tasks2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)10.1109/IPDPSW52791.2021.00089(556-565)Online publication date: Jun-2021
  • (2020)Serverless Elastic Exploration of Unbalanced Algorithms2020 IEEE 13th International Conference on Cloud Computing (CLOUD)10.1109/CLOUD49709.2020.00033(149-157)Online publication date: Oct-2020
  • (undefined)Exploiting Inherent Elasticity of Serverless in Irregular AlgorithmsSSRN Electronic Journal10.2139/ssrn.4165424

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Login options

Full Access

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media