Abstract
As General-Purpose Graphics Processing Units (GPGPUs) are widely used in High-Performance Computing (HPC) applications, the vulnerability of GPGPUs to soft errors becomes a critical concern. In this paper, we propose an efficient instruction duplication mechanism that merely duplicates SDC vulnerable instructions for reliability overhead saving. We first observe that the SDC proneness of individual instruction is related to its instruction type, fault propagation, and whether it affects shared memory. Then, leveraging these observed factors, we utilize machine learning to intelligently identify all the SDC vulnerable instructions of GPU applications and efficiently protect them. Experimental results show that our method achieves a 90.45% SDC coverage only duplicating 37.8% of static instructions, which achieves a significant improvement in terms of performance and SDC detection capability compared to the state-of-the-art duplication technique in GPUs.
This work is supported by the National Natural Science Foundation of China (NSFC) (Grants No. 61772228, No. U19A2061), National key research and development program of China under Grants No. 2017YFC1502306 and Interdisciplinary Research Funding Program for Doctoral Students of Jilin University under Grants No. 101832020DJX063, No. 101832020DJX007.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
References
Grauer-Gray, S., Killian, W., Searles, R., Cavazos, J.: Accelerating financial applications on the GPU. In: 6th Workshop on General Purpose Processor Using Graphics Processing Units, New York, NY, USA, pp. 127–136 (2013)
Gao, Y., Iqbal, S., Zhang, P., Qiu, M.: Performance and power analysis of high-density multi-GPGPU architectures: a preliminary case study. In: IEEE 17th International Conference on High Performance Computing and Communications (HPCC), pp. 29–35 (2015)
Reis, G.A., Chang, J., Vachharajani, N., Rangan, R., August, D.: Swift: software implemented fault tolerance. In: International Symposium on Code Generation and Optimization, pp. 243–254 (2005)
Mahmoud, A., Hari, S., Sullivan, M.B., Tsai, T., Keckler, S.W.: Optimizing software-directed instruction replication for GPU error detection. In: International Conference for High Performance Computing, Networking, Storage, and Analysis (2018)
Kalra, C., Previlon, F., Rubin, N., Kaeli, D.: ArmorAll: compiler-based resilience targeting GPU applications. ACM Trans. Archit. Code Optim. 17(2), 1–24 (2020)
Gai, K., Qiu, M.: Optimal resource allocation using reinforcement learning for IoT content-centric services. Appl. Soft Comput. 70, 12–21 (2018)
Gai, K., Qiu, M.: Reinforcement learning-based content-centric services in mobile sensing. IEEE Netw. 32(4), 34–39 (2018)
Zhao, H., Chen, M., Qiu, M., Gai, K., Liu, M.: A novel pre-cache schema for high performance android system. Future Gener. Comput. Syst. 56, 766–772 (2016)
Qiu, M., Chen, Z., Liu, M.: Low-power low-latency data allocation for hybrid scratch-pad memory. IEEE Embed. Syst. Lett. 6(4), 69–72 (2014)
Wei, X., Yue, H., Tan, J.: LAD-ECC: energy-efficient ECC mechanism for GPGPUs register file. In: Design Automation Test in Europe Conference Exhibition (DATE), pp. 1127–1132 (2020)
Wei, X., Yue, H., Gao, S., Li, L., Zhang, R., Tan, J.: G-SEAP: analyzing and characterizing soft-error aware approximation in GPGPUs. Future Gener. Comput. Syst. 109, 262–274 (2020)
Sangchoolie, B., Pattabiraman, K., Karlsson, J.: One bit is not enough: an empirical study of the impact of single and multiple bit-flip errors. In: 47th IEEE/IFIP International Conference on Dependable Systems and Networks (DSN), pp. 97–108 (2017)
Hari, S., Tsai, T., Stephenson, M., Keckler, S.W., Emer, J.: Sassifi: an architecture-level fault injection tool for GPU application resilience evaluation. In: IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), pp. 249–258 (2017)
Anwer, A., Li, G., Pattabiraman, K., Sullivan, M., Tsai, T., Hari, S.: GPU-trident: efficient modeling of error propagation in GPU programs. In: SC’20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–15 (2020)
Weiser, M.: Program slicing. IEEE Trans. Softw. Eng. SE–10(4), 352–357 (1984)
Pouchet, L.N.: Polybench: the polyhedral benchmark suite (2012). http://www.cs.ucla.edu/pouchet/software/polybench
Che, S., et al.: Rodinia: a benchmark suite for heterogeneous computing. In: IEEE International Symposium on Workload Characterization (IISWC), pp. 44–54 (2009)
Kalra, C., Previlon, F., Li, X., Rubin, N., Kaeli, D.: PRISM: predicting resilience of GPU applications using statistical methods. In: International Conference for High Performance Computing, Networking, Storage, and Analysis (2018)
Laguna, I., Schulz, M., Richards, D.F., Calhoun, J., Olson, L.: IPAS: intelligent protection against silent output corruption in scientific applications. In: International Symposium on Code Generation and Optimization, NY, USA, New York, pp. 227–238 (2016)
Niu, J., Liu, C., Gao, Y., Qiu, M.: Energy efficient task assignment with guaranteed probability satisfying timing constraints for embedded systems. IEEE Trans. Parallel Distrib. Syst. 25(8), 2043–2052 (2013)
Qiu, M., Ming, Z., Wang, J., Yang, L.T., Xiang, Y.: Enabling cloud computing in emergency management systems. IEEE Cloud Comput. 1(4), 60–67 (2014)
Guo, Y., Zhuge, Q., Hu, J., Yi, J., Qiu, M., Sha, E.H.M.: Data placement and duplication for embedded multicore systems with scratch pad memory. IEEE Trans. Comput.-Aided Design Integr. Circuits 32, 809–817 (2013)
Dai, W., Qiu, L., Wu, A., Qiu, M.: Cloud infrastructure resource allocation for big data applications. IEEE Trans. Big Data 4(3), 313–324 (2016)
Feng, S., Gupta, S., Ansari, A., Mahlke, S.: Shoestring: probabilistic soft error reliability on the cheap. SIGARCH Comput. Archit. News 38(1), 385–396 (2010)
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2021 Springer Nature Switzerland AG
About this paper
Cite this paper
Wei, X., Jiang, N., Wang, X., Yue, H. (2021). Detecting SDCs in GPGPUs Through an Efficient Instruction Duplication Mechanism. In: Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, SY. (eds) Knowledge Science, Engineering and Management. KSEM 2021. Lecture Notes in Computer Science(), vol 12817. Springer, Cham. https://doi.org/10.1007/978-3-030-82153-1_47
Download citation
DOI: https://doi.org/10.1007/978-3-030-82153-1_47
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-030-82152-4
Online ISBN: 978-3-030-82153-1
eBook Packages: Computer ScienceComputer Science (R0)