Abstract
The distributed RAID for serverless cluster computer is used to save the checkpoint files periodically according to the checkpointing algorithm for rollback recovery. Striped checkpointing algorithm newly proposed in this paper can adopt the merits of the coordinated and the staggered checkpointing algorithms. Coordinating enables parallel I/O on distributed disks and staggering avoids network bottleneck in distributed disk I/O operations. With a fixed cluster size, we reveal the tradeoffs between these two speedup techniques. The striped checkpointing approach allows dynamical reconfiguration to minimize checkpointing overhead among concurrent software processes. We demonstrate how to reduce the overhead by striping and staggering dynamically. For communication-intensive computational programs, this new scheme can significantly reduce the checkpointing overhead. Linpack HPC Benchmark results prove the benefits of trading between stripe parallelism and distributed staggering. These results are useful to design efficient checkpointing algorithm for fast rollback recovery from any single node failure in a cluster computer.
Chapter PDF
Similar content being viewed by others
References
Cao, G., Singhal, M.: On Coordinated Checkpointing in Distributed Systems. IEEE Transactions on Parallel and Distributed Systems. 9(12) (1998)
Elnozahy, E., Zwaenepoel, W.: On the Use and Implementation of Message Logging. Proceedings of 24th International Symposium on Fault-Tolerant Computing. (1994)
Hwang, K., Jin, H., Ho, R.: RAID-x: A New Distributed Disk Array for I/O-Centric Cluster Computing. Proceedings of 9th High-Performance Distributed Computing Symposium. (2000)
Hwang, K., Jin, H., Ho, R., Ro, W.: Reliable Cluster Computing with a New Checkpointing RAID-x Architecture. Proceedings of 9-th Workshop on Heterogeneous Computing. (2000)
Hwang, K., Jin, H., Chow, E., Wang, C., Xu, Z.: Designing SSI Clusters with Hierarchical Checkpointing and Single IO Space. IEEE Concurrency Magazine. (1999)
Plank, J., Li, K., Puening, M.: Diskless Checkpointing. IEEE Transactions on parallel and Distributed Systems. (1998)
Plank, J., Beck, M., Kingsley, G., Li, K.: Libckpt: Transparent Checkpointing Under UNIX. Proceedings of USENIX Winter 1995 Technical Conference. (1995)
Vaidya, N.: Staggered Consistent Checkpointing. IEEE Transactions on Parallel and Distributed Systems. 10(7) (1999)
Author information
Authors and Affiliations
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2003 Springer-Verlag Berlin Heidelberg
About this paper
Cite this paper
Chang, Y.S., Cho, S.Y., Kim, B.Y. (2003). Performance Evaluation of the Striped Checkpointing Algorithm on the Distributed RAID for Cluster Computer. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J.J., Zomaya, A.Y. (eds) Computational Science — ICCS 2003. ICCS 2003. Lecture Notes in Computer Science, vol 2658. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-44862-4_103
Download citation
DOI: https://doi.org/10.1007/3-540-44862-4_103
Published:
Publisher Name: Springer, Berlin, Heidelberg
Print ISBN: 978-3-540-40195-7
Online ISBN: 978-3-540-44862-4
eBook Packages: Springer Book Archive