skip to main content
10.1145/3078597.3078609acmconferencesArticle/Chapter ViewAbstractPublication PageshpdcConference Proceedingsconference-collections
research-article

LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures

Authors Info & Claims
Published:26 June 2017Publication History

ABSTRACT

Requirements for reliability, low power consumption, and performance place complex and conflicting demands on the design of high-performance computing (HPC) systems. Fault-tolerance techniques such as checkpoint/restart (C/R) protect HPC applications against hardware faults. These techniques, however, have non negligible overheads particularly when the fault rate exposed by the hardware is high: it is estimated that in future HPC systems, up to 60% of the computational cycles/power will be used for fault tolerance.

To mitigate the overall overhead of fault-tolerance techniques, we propose LetGo, an approach that attempts to continue the execution of a HPC application when crashes would otherwise occur. Our hypothesis is that a class of HPC applications have good enough intrinsic fault tolerance so that its possible to re-purpose the default mechanism that terminates an application once a crash-causing error is signalled, and instead attempt to repair the corrupted application state, and continue the application execution. This paper explores this hypothesis, and quantifies the impact of using this observation in the context of checkpoint/restart (C/R) mechanisms.

Our fault-injection experiments using a suite of five HPC applications show that, on average, LetGo is able to elide 62% of the crashes encountered by applications, of which 80% result in correct output, while incurring a negligible performance overhead. As a result, when LetGo is used in conjunction with a C/R scheme, it enables significantly higher efficiency thereby leading to faster time to solution.

References

  1. Apache Storm. In http:// storm.apache.org/ .Google ScholarGoogle Scholar
  2. D. Abadi, Y. Ahmad, M. Balazinska, and et al. 2005. The Design of the Borealis Stream Processing Engine. In CIDR. 277--289.Google ScholarGoogle Scholar
  3. Y. Ahmad and U. Cetintemel. 2004. Network-aware Query Processing for Streambased Applications. In VLDB. 456--467. Google ScholarGoogle ScholarDigital LibraryDigital Library
  4. L. Aniello, R. Baldoni, and L. Querzoni. 2013. Adaptive Online Scheduling in Storm. In DEBS. 207--218. Google ScholarGoogle ScholarDigital LibraryDigital Library
  5. T. P. Chen, H. Haussecker, A. Bovyrin, and et al. 2005. Computer VisionWorkload Analysis: Case Study of Video Surveillance Systems. Intel Technology Journal 9, 2 (2005).Google ScholarGoogle Scholar
  6. D. Dewitt and J. Gray. 1992. Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM 35, 6 (1992), 85--98. Google ScholarGoogle ScholarDigital LibraryDigital Library
  7. Jianbing Ding, Tom Z. J. Fu, Richard T. B. Ma, Marianne Winslett, Yin Yang, Zhenjie Zhang, and Hongyang Chao. 2015. Optimal Operator State Migration for Elastic Data Stream Processing. CoRR abs/1501.03619 (2015).Google ScholarGoogle Scholar
  8. M. Elseidy, A. Elguindy, A. Vitorovic, and C. Koch. 2014. Scalable and Adaptive Online Joins. VLDB 7, 6 (2014), 441--452. Google ScholarGoogle ScholarDigital LibraryDigital Library
  9. Tom Z. J. Fu, Jianbing Ding, Richard T. B. Ma, Marianne Winslett, Yin Yang, and Zhenjie Zhang. 2015. DRS: dynamic resource scheduling for real-time analytics over fast streams. In Proceedings of the IEEE 35th International Conference on Distributed Computing Systems (ICDCS). 411--420.Google ScholarGoogle ScholarCross RefCross Ref
  10. Tom Z. J. Fu, Jianbing Ding, Richard T. B. Ma, Marianne Winslett, Yin Yang, Zhenjie Zhang, Yong Pei, and Bingbing Ni. 2015. LiveTraj: Real-Time Trajectory Tracking over Live Video Streams. In Proc. of ACM Multimedia, Demo. 777--780. Google ScholarGoogle ScholarDigital LibraryDigital Library
  11. B. Gedik. 2014. Partitioning Functions for Stateful Data Parallelism in Stream Processing. VLDBJ 23, 4 (2014), 517--539. Google ScholarGoogle ScholarDigital LibraryDigital Library
  12. B. Gedik, S. Schneider, M. Hirzel, and K. Wu. 2014. Elastic Scaling for Data Stream Processing. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2014), 1447--1463. Google ScholarGoogle ScholarDigital LibraryDigital Library
  13. David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In STOC. 654--663. Google ScholarGoogle ScholarDigital LibraryDigital Library
  14. Narendra Karmarkar and Richard M Karp. 1982. An efficient approximation scheme for the one-dimensional bin-packing problem. In Foundations of Computer Science. 312--320. Google ScholarGoogle ScholarDigital LibraryDigital Library
  15. R. Khandekar, K. Hildrum, S. Parekh, D. Rajan, J. Wolf, K. Wu, H. Andrade, and B. Gedik. 2009. COLA: Optimizing Stream Processing Applications via Graph Partitioning. In Middleware. 308--327. Google ScholarGoogle ScholarDigital LibraryDigital Library
  16. S. Kulkarni, N. Bhagat, M. Fu, and et al. 2015. Twitter Heron: Stream Processing at Scale. In SIGMOD. 239--250. Google ScholarGoogle ScholarDigital LibraryDigital Library
  17. Mahendra Kutare, Greg Eisenhauer, Chengwei Wang, Karsten Schwan, Vanish Talwar, and Matthew. Wolf. 2010. Monalytics: Online Monitoring and Analytics for Managing Large Scale Data Centers. In ICAC. 141--150. Google ScholarGoogle ScholarDigital LibraryDigital Library
  18. Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. 2012. Skewtune: Mitigating Skew in Mapreduce Applications. In SIGMOD. 25--36. Google ScholarGoogle ScholarDigital LibraryDigital Library
  19. Q. Lin, B. C. Ooi, Z. Wang, and C. Yu. 2015. Scalable Distributed Stream Join Processing. In SIGMOD. 811--825. Google ScholarGoogle ScholarDigital LibraryDigital Library
  20. M. Nasir, G. Morales, D. Garciasoriano, N. Kourtellis, and M. Serafini. 2015. The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines. In ICDE. 137--148.Google ScholarGoogle Scholar
  21. Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, Nicolas Kourtellis, and Marco Serafini. 2016. When two choices are not enough: Balancing at scale in distributed stream processing. In ICDE. 589--600.Google ScholarGoogle Scholar
  22. M. Shah, J. Hellerstein, S. Chandrasekaran, and M. Franklin. 2003. Flux: An Adaptive Partitioning Operator for Continuous Query Systems. In ICDE. 25--36.Google ScholarGoogle Scholar
  23. A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, and et al. 2014. Storm@ twitter. In SIGMOD. 147--156. Google ScholarGoogle ScholarDigital LibraryDigital Library
  24. B. Ufler, N. Augsten, A. Reiser, and A. Kemper. 2012. Load Balancing in MapReduce Based on Scalable Cardinality Estimates. In ICDE. 522--533. Google ScholarGoogle ScholarDigital LibraryDigital Library
  25. C. Walton, A. Dale, and R. Jenevein. 1991. A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. In VLDB. 537--548. Google ScholarGoogle ScholarDigital LibraryDigital Library
  26. J. Wolf, N. Bansal, K. Hildrum, S. Parekh, D. Rajan, R. Wagle, K. Wu, and L. Fleischer. 2008. SODA: An Optimizing Scheduler for Large-scale Stream-based Distributed Computer Systems. In Middleware. 306--325. Google ScholarGoogle ScholarDigital LibraryDigital Library
  27. Y. Wu and K. Tan. 2015. ChronoStream: Elastic Stateful Stream Computation in the Cloud. In ICDE. 723--734.Google ScholarGoogle Scholar
  28. Y. Xing, J. Hwang, U. Cetintemel, and S. Zdonik. 2006. Providing Resiliency to Load Variations in Distributed Stream Processing. In VLDB. 775--786. Google ScholarGoogle ScholarDigital LibraryDigital Library
  29. Y. Xing, S. Zdonik, and J. Hwang. 2005. Dynamic Load Distribution in the Borealis Stream Processor. In ICDE. 791--802. Google ScholarGoogle ScholarDigital LibraryDigital Library
  30. Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. 2008. Handling Data Skew in Parallel Joins in Shared-nothing Systems. In SIGMOD. 1043--1052. Google ScholarGoogle ScholarDigital LibraryDigital Library
  31. M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. 2013. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In SOSP. 423--438. Google ScholarGoogle ScholarDigital LibraryDigital Library
  32. Y. Zhou, B. Ooi, and K. Tan. 2005. Dynamic Load Management for Distributed Continuous Query Systems. In ICDE. 322--323. Google ScholarGoogle ScholarDigital LibraryDigital Library

Index Terms

  1. LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures

                    Recommendations

                    Comments

                    Login options

                    Check if you have access through your login credentials or your institution to get full access on this article.

                    Sign in
                    • Published in

                      cover image ACM Conferences
                      HPDC '17: Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing
                      June 2017
                      254 pages
                      ISBN:9781450346993
                      DOI:10.1145/3078597

                      Copyright © 2017 ACM

                      © 2017 Association for Computing Machinery. ACM acknowledges that this contribution was authored or co-authored by an employee, contractor or affiliate of the United States government. As such, the United States Government retains a nonexclusive, royalty-free right to publish or reproduce this article, or to allow others to do so, for Government purposes only.

                      Publisher

                      Association for Computing Machinery

                      New York, NY, United States

                      Publication History

                      • Published: 26 June 2017

                      Permissions

                      Request permissions about this article.

                      Request Permissions

                      Check for updates

                      Qualifiers

                      • research-article

                      Acceptance Rates

                      HPDC '17 Paper Acceptance Rate19of100submissions,19%Overall Acceptance Rate166of966submissions,17%

                      Upcoming Conference

                    PDF Format

                    View or Download as a PDF file.

                    PDF

                    eReader

                    View online with eReader.

                    eReader