Abstract:
Hardware reliability has been a major concern for nano-scale computing systems. Different hardware design choices, application workloads and software management schemes c...View moreMetadata
Abstract:
Hardware reliability has been a major concern for nano-scale computing systems. Different hardware design choices, application workloads and software management schemes can jointly affect the system's resilience. In this paper, we first develop a hardware evaluation platform based on an embedded/mobile development board and standard Linux kernel. We demonstrate the use of our platform to evaluate the system's power and radiation-induced soft error rate in presence of system power management schemes and with different application workloads and various hardware design configurations. We also propose system/cloud-based virtual sensing to capture varying ambient conditions for reliability evaluation. New reliability management policies are proposed and implemented in Linux kernel to exploit the flexibility in different existing power management schemes. We demonstrate that our policies can achieve the system reliability target under varying application workloads and ambient conditions. Experiments show that our policies are efficient and with less than 3% additional power overhead compared to the optimal schemes characterized offline.
Published in: 2015 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES)
Date of Conference: 04-09 October 2015
Date Added to IEEE Xplore: 12 November 2015
Electronic ISBN:978-1-4673-8320-2