Abstract
Nowadays, people use multiple devices to meet a growing requirement for computing. With the application of multi-card computing, fault tolerance, load balance, and resource sharing have been the hot issues and the checkpoint/restart (CR) mechanism is critical in a preemptive system. This paper proposes a checkpoint/restart framework including the automatic compiler (CRAC) to achieve a feasible checkpoint/restart system, especially for GPU applications on heterogeneous devices in OpenCL program. By offering the positions of the checkpoint/restart in source code, CRAC inserts primitives into programs and invokes the runtime support modules for final results. A comprehensive example and experiments have demonstrated the feasibility and effectiveness of proposed framework.
Supported by Natural Science Foundation of China (No. 61572022) and the Ningbo eHealth Project (No. 2016C11024).
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Arora, R., Bangalore, P., Mernik, M.: A technique for non-invasive application-level checkpointing. J. Supercomputing 57(3), 227–255 (2011)
Bozyigit, M., Al-Tawil, K., Naseer, S.: A kernel integrated task migration infrastructure for clusters of workstations. Comput. Electr. Eng. 26(3), 279–295 (2000)
Bronevetsky, G., Marques, D., Pingali, K., Stodghill, P.: Automated application-level checkpointing of MPI programs. In: Proceedings of the Ninth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming. PPoPP 2003, pp. 84–94. ACM, New York (2003). https://doi.org/10.1145/781498.781513
Chen, G., Zhang, J., Pan, Y., Pang, C.: An image processing method via OpenCL for identification of pulmonary nodules. In: Asia-Pacific Web (APWeb) and Web-Age Information Management (WAIM) Joint International Conference on Web and Big Data (2018)
Danalis, A., et al.: The scalable heterogeneous computing (SHOC) benchmark suite. In: Workshop on General-Purpose Computation on Graphics Processing Units (2010)
Gioiosa, R., Sancho, J.C., Jiang, S., Petrini, F.: Transparent, incremental checkpointing at kernel level: a foundation for fault tolerance for parallel computers. In: SC 2005: Proceedings of the 2005 ACM/IEEE Conference on Supercomputing, p. 9, November 2005. https://doi.org/10.1109/SC.2005.76
Group, K.O.W.: The OpenCL Specification. KHRONOS (2017)
Jiang, H., Zhang, Y., Jennes, J., Li, K.C.: A checkpoint/restart scheme for CUDA programs with complex computation states. IJNDC 1(4), 196 (2013)
Juckeland, G., et al.: SPEC ACCEL: a standard application suite for measuring hardware accelerator performance. In: Jarvis, S.A., Wright, S.A., Hammond, S.D. (eds.) PMBS 2014. LNCS, vol. 8966, pp. 46–67. Springer, Cham (2015). https://doi.org/10.1007/978-3-319-17248-4_3
Kang, J., Yu, H.: Mitigation technique for performance degradation of virtual machine owing to GPU pass-through in fog computing. J. Commun. Netw. 20(3), 257–265 (2018). https://doi.org/10.1109/JCN.2018.000038
Laadan, O., Nieh, J.: Transparent checkpoint-restart of multiple processes on commodity operating systems. In: Usenix Technical Conference, Santa Clara, CA, USA, 17–22 June 2007, pp. 323–336 (2007)
Lama, P., et al.: pVOCL: power-aware dynamic placement and migration in virtualized GPU environments. In: 2013 IEEE 33rd International Conference on Distributed Computing Systems, pp. 145–154, July 2013. https://doi.org/10.1109/ICDCS.2013.51
Nukada, A., Takizawa, H., Matsuoka, S.: NVCR: a transparent checkpoint-restart library for NVIDIA Cuda. In: IEEE International Symposium on Parallel and Distributed Processing Workshops and Phd Forum, pp. 104–113 (2011)
Paindaveine, Y., Milojicic, D.S.: Process vs. task migration. In: Hawaii International Conference on System Sciences (1996)
Parr, T.J., Quong, R.W.: Adding semantic and syntactic predicates to LL(k): pred-LL(k). In: Fritzson, P.A. (ed.) CC 1994. LNCS, vol. 786, pp. 263–277. Springer, Heidelberg (1994). https://doi.org/10.1007/3-540-57877-3_18
Paul, H.: Berkeley lab checkpoint/restart (BLCR) for linux clusters. In: Journal of Physics: Conference Series, p. 494 (2006)
Pourghassemi, B., Chandramowlishwaran, A.: cudaCR: an in-kernel application-level checkpoint/restart scheme for CUDA-enabled GPUS. In: IEEE International Conference on CLUSTER Computing, pp. 725–732 (2017)
Sajjapongse, K., Wang, X., Becchi, M., Sajjapongse, K., Wang, X., Becchi, M.: A preemption-based runtime to efficiently schedule multi-process applications on heterogeneous clusters with GPUS. In: International Symposium on High-Performance Parallel and Distributed Computing, pp. 179–190 (2013)
Shuai, C., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (2009)
Takizawa, H., Koyama, K., Sato, K., Komatsu, K., Kobayashi, H.: CheCL: transparent checkpointing and process migration of OpenCL applications. In: Parallel & Distributed Processing Symposium, pp. 864–876 (2011)
Takizawa, H., Sato, K., Komatsu, K., Kobayashi, H.: CheCUDA: a checkpoint/restart tool for CUDA applications. In: International Conference on Parallel and Distributed Computing, Applications and Technologies, pp. 408–413 (2010)
Xiao, S., et al.: Transparent accelerator migration in a virtualized GPU environment. In: IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing, pp. 124–131 (2012)
Xiao, S., et al.: VOCL: an optimized environment for transparent virtualization of graphics processing units. In: Innovative Parallel Computing, pp. 1–12 (2012)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2020 Springer Nature Singapore Pte Ltd.
About this paper
Cite this paper
Chen, G., Zhang, J., Zhu, Z., Zhu, C., Jiang, H., Pang, C. (2020). CRAC: An Automatic Assistant Compiler of Checkpoint/Restart for OpenCL Program. In: He, J., et al. Data Science. ICDS 2019. Communications in Computer and Information Science, vol 1179. Springer, Singapore. https://doi.org/10.1007/978-981-15-2810-1_54
Download citation
DOI: https://doi.org/10.1007/978-981-15-2810-1_54
Published:
Publisher Name: Springer, Singapore
Print ISBN: 978-981-15-2809-5
Online ISBN: 978-981-15-2810-1
eBook Packages: Computer ScienceComputer Science (R0)