ABSTRACT
Requirements for reliability, low power consumption, and performance place complex and conflicting demands on the design of high-performance computing (HPC) systems. Fault-tolerance techniques such as checkpoint/restart (C/R) protect HPC applications against hardware faults. These techniques, however, have non negligible overheads particularly when the fault rate exposed by the hardware is high: it is estimated that in future HPC systems, up to 60% of the computational cycles/power will be used for fault tolerance.
To mitigate the overall overhead of fault-tolerance techniques, we propose LetGo, an approach that attempts to continue the execution of a HPC application when crashes would otherwise occur. Our hypothesis is that a class of HPC applications have good enough intrinsic fault tolerance so that its possible to re-purpose the default mechanism that terminates an application once a crash-causing error is signalled, and instead attempt to repair the corrupted application state, and continue the application execution. This paper explores this hypothesis, and quantifies the impact of using this observation in the context of checkpoint/restart (C/R) mechanisms.
Our fault-injection experiments using a suite of five HPC applications show that, on average, LetGo is able to elide 62% of the crashes encountered by applications, of which 80% result in correct output, while incurring a negligible performance overhead. As a result, when LetGo is used in conjunction with a C/R scheme, it enables significantly higher efficiency thereby leading to faster time to solution.
- Apache Storm. In http:// storm.apache.org/ .Google Scholar
- D. Abadi, Y. Ahmad, M. Balazinska, and et al. 2005. The Design of the Borealis Stream Processing Engine. In CIDR. 277--289.Google Scholar
- Y. Ahmad and U. Cetintemel. 2004. Network-aware Query Processing for Streambased Applications. In VLDB. 456--467. Google ScholarDigital Library
- L. Aniello, R. Baldoni, and L. Querzoni. 2013. Adaptive Online Scheduling in Storm. In DEBS. 207--218. Google ScholarDigital Library
- T. P. Chen, H. Haussecker, A. Bovyrin, and et al. 2005. Computer VisionWorkload Analysis: Case Study of Video Surveillance Systems. Intel Technology Journal 9, 2 (2005).Google Scholar
- D. Dewitt and J. Gray. 1992. Parallel Database Systems: The Future of High Performance Database Systems. Commun. ACM 35, 6 (1992), 85--98. Google ScholarDigital Library
- Jianbing Ding, Tom Z. J. Fu, Richard T. B. Ma, Marianne Winslett, Yin Yang, Zhenjie Zhang, and Hongyang Chao. 2015. Optimal Operator State Migration for Elastic Data Stream Processing. CoRR abs/1501.03619 (2015).Google Scholar
- M. Elseidy, A. Elguindy, A. Vitorovic, and C. Koch. 2014. Scalable and Adaptive Online Joins. VLDB 7, 6 (2014), 441--452. Google ScholarDigital Library
- Tom Z. J. Fu, Jianbing Ding, Richard T. B. Ma, Marianne Winslett, Yin Yang, and Zhenjie Zhang. 2015. DRS: dynamic resource scheduling for real-time analytics over fast streams. In Proceedings of the IEEE 35th International Conference on Distributed Computing Systems (ICDCS). 411--420.Google ScholarCross Ref
- Tom Z. J. Fu, Jianbing Ding, Richard T. B. Ma, Marianne Winslett, Yin Yang, Zhenjie Zhang, Yong Pei, and Bingbing Ni. 2015. LiveTraj: Real-Time Trajectory Tracking over Live Video Streams. In Proc. of ACM Multimedia, Demo. 777--780. Google ScholarDigital Library
- B. Gedik. 2014. Partitioning Functions for Stateful Data Parallelism in Stream Processing. VLDBJ 23, 4 (2014), 517--539. Google ScholarDigital Library
- B. Gedik, S. Schneider, M. Hirzel, and K. Wu. 2014. Elastic Scaling for Data Stream Processing. IEEE Trans. Parallel Distrib. Syst. 25, 6 (2014), 1447--1463. Google ScholarDigital Library
- David Karger, Eric Lehman, Tom Leighton, Rina Panigrahy, Matthew Levine, and Daniel Lewin. 1997. Consistent hashing and random trees: Distributed caching protocols for relieving hot spots on the World Wide Web. In STOC. 654--663. Google ScholarDigital Library
- Narendra Karmarkar and Richard M Karp. 1982. An efficient approximation scheme for the one-dimensional bin-packing problem. In Foundations of Computer Science. 312--320. Google ScholarDigital Library
- R. Khandekar, K. Hildrum, S. Parekh, D. Rajan, J. Wolf, K. Wu, H. Andrade, and B. Gedik. 2009. COLA: Optimizing Stream Processing Applications via Graph Partitioning. In Middleware. 308--327. Google ScholarDigital Library
- S. Kulkarni, N. Bhagat, M. Fu, and et al. 2015. Twitter Heron: Stream Processing at Scale. In SIGMOD. 239--250. Google ScholarDigital Library
- Mahendra Kutare, Greg Eisenhauer, Chengwei Wang, Karsten Schwan, Vanish Talwar, and Matthew. Wolf. 2010. Monalytics: Online Monitoring and Analytics for Managing Large Scale Data Centers. In ICAC. 141--150. Google ScholarDigital Library
- Y. Kwon, M. Balazinska, B. Howe, and J. Rolia. 2012. Skewtune: Mitigating Skew in Mapreduce Applications. In SIGMOD. 25--36. Google ScholarDigital Library
- Q. Lin, B. C. Ooi, Z. Wang, and C. Yu. 2015. Scalable Distributed Stream Join Processing. In SIGMOD. 811--825. Google ScholarDigital Library
- M. Nasir, G. Morales, D. Garciasoriano, N. Kourtellis, and M. Serafini. 2015. The Power of Both Choices: Practical Load Balancing for Distributed Stream Processing Engines. In ICDE. 137--148.Google Scholar
- Muhammad Anis Uddin Nasir, Gianmarco De Francisci Morales, Nicolas Kourtellis, and Marco Serafini. 2016. When two choices are not enough: Balancing at scale in distributed stream processing. In ICDE. 589--600.Google Scholar
- M. Shah, J. Hellerstein, S. Chandrasekaran, and M. Franklin. 2003. Flux: An Adaptive Partitioning Operator for Continuous Query Systems. In ICDE. 25--36.Google Scholar
- A. Toshniwal, S. Taneja, A. Shukla, K. Ramasamy, J.M. Patel, S. Kulkarni, J. Jackson, K. Gade, M. Fu, J. Donham, and et al. 2014. Storm@ twitter. In SIGMOD. 147--156. Google ScholarDigital Library
- B. Ufler, N. Augsten, A. Reiser, and A. Kemper. 2012. Load Balancing in MapReduce Based on Scalable Cardinality Estimates. In ICDE. 522--533. Google ScholarDigital Library
- C. Walton, A. Dale, and R. Jenevein. 1991. A Taxonomy and Performance Model of Data Skew Effects in Parallel Joins. In VLDB. 537--548. Google ScholarDigital Library
- J. Wolf, N. Bansal, K. Hildrum, S. Parekh, D. Rajan, R. Wagle, K. Wu, and L. Fleischer. 2008. SODA: An Optimizing Scheduler for Large-scale Stream-based Distributed Computer Systems. In Middleware. 306--325. Google ScholarDigital Library
- Y. Wu and K. Tan. 2015. ChronoStream: Elastic Stateful Stream Computation in the Cloud. In ICDE. 723--734.Google Scholar
- Y. Xing, J. Hwang, U. Cetintemel, and S. Zdonik. 2006. Providing Resiliency to Load Variations in Distributed Stream Processing. In VLDB. 775--786. Google ScholarDigital Library
- Y. Xing, S. Zdonik, and J. Hwang. 2005. Dynamic Load Distribution in the Borealis Stream Processor. In ICDE. 791--802. Google ScholarDigital Library
- Y. Xu, P. Kostamaa, X. Zhou, and L. Chen. 2008. Handling Data Skew in Parallel Joins in Shared-nothing Systems. In SIGMOD. 1043--1052. Google ScholarDigital Library
- M. Zaharia, T. Das, H. Li, T. Hunter, S. Shenker, and I. Stoica. 2013. Discretized Streams: Fault-tolerant Streaming Computation at Scale. In SOSP. 423--438. Google ScholarDigital Library
- Y. Zhou, B. Ooi, and K. Tan. 2005. Dynamic Load Management for Distributed Continuous Query Systems. In ICDE. 322--323. Google ScholarDigital Library
Index Terms
LetGo: A Lightweight Continuous Framework for HPC Applications Under Failures
Recommendations
A survey of fault tolerance mechanisms and checkpoint/restart implementations for high performance computing systems
AbstractIn recent years, High Performance Computing (HPC) systems have been shifting from expensive massively parallel architectures to clusters of commodity PCs to take advantage of cost and performance benefits. Fault tolerance in such systems is a ...
Checkpoint Selection in Fault Recovery Based on Byzantine Fault Model
CICN '12: Proceedings of the 2012 Fourth International Conference on Computational Intelligence and Communication NetworksNowadays, with the growth of the performance, the reliability problem of supercomputers becomes more and more serious. In order to complete an application with small fault recovery overhead, Checkpoint/Restart(C/R) methods are widely used. So far, the ...
Hiding Checkpoint Overhead in HPC Applications with a Semi-Blocking Algorithm
CLUSTER '12: Proceedings of the 2012 IEEE International Conference on Cluster ComputingThe HPC community has seen a steady increase in the number of components in every generation of supercomputers. Assembling a large number of components into a single cluster makes a machine more powerful, but also much more prone to failures. Therefore, ...
Comments