Abstract
As technology scales, transistors become smaller and aggressive power optimization techniques combined with high operation frequencies and performance-enhancing microarchitectural techniques are employed to achieve increasingly higher performance and power efficiencies. Unfortunately, these developments make the modern systems more vulnerable to soft errors, which are becoming a critical issues in both hardware and software domains. Motivated by this observation, in this work, we propose, implement, and evaluate two error propagation metrics in order to characterize error propagation at both software and hardware levels. The first metric aims to measure error propagation on program data structures, whereas the second one measures the fraction of corrupted locations in the cache memory structure for a given period of time. We evaluate our proposed metrics by performing an empirical study of two application programs using both single-threaded and multi-threaded executions, and varying various experimental parameters such as thread count, error rate, location of errors, and architectural parameters. Our extensive experimental analysis reveals that error propagation over program data structures is highly dependent on application behavior.Further, depending on the cache parameters used, propagation of errors on cache can exhibit different patterns. This paper also discusses how our observed error propagation trends in program data structures and data caches are correlated with each other, focusing in particular on the differences in error propagation speeds in application data structures and data caches.
Similar content being viewed by others
Data Availability
The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.
Notes
Clearly, OS data structures can also be corrupted; but, in this work, we exclusively focus on application data structures.
This is the case for example in embedded and mobile systems.
References
Rebaudengo M, Reorda MS, Violante M. An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor. In (2003) Design. Autom Test Europe Conf Exhib 2003:602–607
Gold BT, Smolens JC, Falsafi B, Hoe JC. The granularity of soft-error containment in shared-memory multiprocessors; 2006
Smolens JC, Gold BT, Kim J, Falsafi B, Hoe JC, Nowatzyk AG. Fingerprinting: bounding soft-error detection latency and bandwidth. In: ACM SIGPLAN Notices. vol. 39; 2004. p. 224–234
Medeiros GE, Bortolon FT, Reis R, Ost L. Evaluation of compiler optimization flags effects on soft error resiliency. In: 2018 31st Symposium on Integrated Circuits and Systems Design (SBCCI); 2018. p. 1–6
Gava J, Bandiera V, Reis R, Ost L. Evaluation of compilers effects on OpenMP soft error resiliency. In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI); 2019. p. 259–264
Lins FM, Tambara LA, Kastensmidt FL, Rech P (2017) Register file criticality and compiler optimization effects on embedded microprocessor reliability. IEEE Trans Nucl Sci 64(8):2179–2187
Baumann RC. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability. 2005 Sept;5(3):305–316
Cappello F, Al G, Gropp W, Kale S, Kramer B, Snir M (2014) Toward exascale resilience: 2014 update. Supercomput Frontiers Innovations: Int J 1(1):5–28
Baumann RC (2001) Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Trans Device Mater Reliab 1(1):17–22
Shivakumar P, Kistler M, Keckler SW, Burger D, Alvisi L. Modeling the effect of technology trends on the soft error rate of combinational logic. In: Proceedings International Conference on Dependable Systems and Networks; 2002. p. 389–398
O’Gorman TJ, Ross JM, Taber AH, Ziegler JF, Muhlfeld HP, Montrose CJ et al (1996) Field testing for cosmic ray soft errors in semiconductor memories. IBM J Res Dev 40(1):41–50
Borkar S (2005) Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6):10–16
Asadi GH, Sridharan V, Tahoori MB, Kaeli D. Balancing performance and reliability in the memory hierarchy. In: IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.; 2005. p. 269–279
Sikai L, Jun Y. A method of soft error propagation based on cellular automata. In: 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA); 2018. p. 617–622
Mukherjee SS, Kontz M, Reinhardt SK. Detailed design and evaluation of redundant multi-threading alternatives. In: Proceedings 29th Annual International Symposium on Computer Architecture; 2002. p. 99–110
Reinhardt SK, Mukherjee SS. Transient fault detection via simultaneous multithreading. In: Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No. RS00201); 2000. p. 25–36
Rotta R, Ferreira RS, Nolte J. Real-time dynamic hardware reconfiguration for processors with redundant functional units. In: 2020 IEEE 23rd International Symposium on Real-Time Distributed Computing (ISORC); 2020. p. 154–155
Ainsworth S, Jones TM. Parallel error detection using heterogeneous cores. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 338–349
Györök G, Beszédes B. Duplicated control unit based embedded fault-masking systems. In: 2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY); 2017. p. 283–288
Reis GA, Chang J, Vachharajani N, Rangan R, August DI. SWIFT: Software implemented fault tolerance. In: Proceedings of the International Symposium on Code Generation and Optimization; 2005. p. 243–254
Asghari SA, Marvasti MB, Rahmani AM (2018) Enhancing transient fault tolerance in embedded systems through an OS task level redundancy approach. Futur Gener Comput Syst 87:58–65
Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW. Optimizing software-directed instruction replication for gpu error detection. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis; 2018. p. 842–853
Thati VB, Vankeirsbilck J, Penneman N, Pissoort D, Boydens J. An improved data error detection technique for dependable embedded software. In: 2018 IEEE 23rd Pacific Rim International Symposium on Dependable Computing (PRDC); 2018. p. 213–220
Chen YS, Chen PS. A software-based redundant execution programming model for transient fault detection and correction. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW); 2016. p. 66–71
Vallero A, Savino A, Chatzidimitriou A, Kaliorakis M, Kooli M, Riera M et al (2018) SyRA: Early system reliability analysis for cross-layer soft errors resilience in memory arrays of microprocessor systems. IEEE Trans Comput 68(5):765–783
Cheng E, Mirkhani S, Szafaryn LG, Cher CY, Cho H, Skadron K, et al. CLEAR: Cross-layer exploration for architecting resilience-combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference; 2016. p. 1–6
Vallero A, Savino A, Politano G, Di Carlo S, Chatzidimitriou A, Tselonis S, et al. Cross-layer system reliability assessment framework for hardware faults. In: 2016 IEEE International Test Conference (ITC); 2016. p. 1–10
Gupta M, Sridharan V, Roberts D, Prodromou A, Venkat A, Tullsen D, et al. Reliability-aware data placement for heterogeneous memory architecture. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA); 2018. p. 583–595
Jaulmes L, Moretó M, Valero M, Erez M, Casas M. Runtime-guided ECC protection using online estimation of memory vulnerability. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis; 2020. p. 1–14
Asadi G, Tahoori MB. An analytical approach for soft error rate estimation in digital circuits. In: 2005 IEEE International Symposium on Circuits and Systems; 2005. p. 2991–2994
Mukherjee SS, Emer J, Reinhardt SK. The soft error problem: An architectural perspective. In: 11th International Symposium on High-Performance Computer Architecture; 2005. p. 243–247
Weaver C, Emer J, Mukherjee SS, Reinhardt SK (2004) Techniques to reduce the soft error rate of a high-performance microprocessor. ACM SIGARCH Comput Archit News 32(2):264
Upasani G, Vera X, González A. Reducing due-fit of caches by exploiting acoustic wave detectors for error recovery. In: 2013 IEEE 19th International On-Line Testing Symposium (IOLTS); 2013. p. 85–91
Fratin V, Oliveira D, Lunardi C, Santos F, Rodrigues G, Rech P. Code-dependent and architecture-dependent reliability behaviors. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 13–26
Utrera G, Gil M, Martorell X. Analysis of the impact factors on data error propagation in HPC applications. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP); 2018. p. 546–549
Ferreira RR, Da Rolt J, Nazar GL, Moreira AF, Carro L. Adaptive low-power architecture for high-performance and reliable embedded computing. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014. p. 538–549
Hu J, Wang S, Ziavras SG. On the exploitation of narrow-width values for improving register file reliability. IEEE Transactions on Very Large Scale Integration (VLSI) systems. 2009;17(7):953–963
Subasi O, Arias J, Unsal O, Labarta J, Cristal A. Nanocheckpoints: A task-based asynchronous dataflow framework for efficient and scalable checkpoint/restart. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing; 2015. p. 99–102
Ashraf RA, Gioiosa R, Kestor G, DeMara RF. Exploring the effect of compiler optimizations on the reliability of HPC applications. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW); 2017. p. 1274–1283
Mukherjee SS, Weaver C, Emer J, Reinhardt SK, Austin T. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36.; 2003. p. 29–40
Mukherjee SS, Weaver CT, Emer J, Reinhardt SK, Austin T (2003) Measuring architectural vulnerability factors. IEEE Micro 23(6):70–75
Zhang W. Computing cache vulnerability to transient errors and its implication. In: 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’05); 2005. p. 427–435
Yan J, Zhang W. Compiler-guided register reliability improvement against soft errors. In: Proceedings of the 5th ACM International Conference on Embedded Software; 2005. p. 203–209
Jaulmes L, Moreto M, Valero M, Casas M. A vulnerability factor for ECC-protected memory. In: 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design (IOLTS); 2019. p. 176–181
Sridharan V, Kaeli DR. Eliminating microarchitectural dependency from architectural vulnerability. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture; 2009. p. 117–128
Borodin D, Juurlink BH. Protective redundancy overhead reduction using instruction vulnerability factor. In: Proceedings of the 7th ACM International Conference on Computing Frontiers; 2010. p. 319–326
Yu L, Li D, Mittal S, Vetter JS. Quantitatively modeling application resilience with the data vulnerability factor. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2014. p. 695–706
Oz I, Topcuoglu HR, Kandemir M, Tosun O (2012) Thread vulnerability in parallel applications. J Parallel Distribut Comput 72(10):1171–1185
Hiller M, Jhumka A, Suri N. On the placement of software mechanisms for detection of data errors. In: Proceedings International Conference on Dependable Systems and Networks; 2002. p. 135–144
Leeke M, Jhumka A. Towards understanding the importance of variables in dependable software. In: 2010 European Dependable Computing Conference; 2010. p. 85–94
Utrera G, Gil M, Martorell X. Analyzing data-error propagation effects in high-performance computing. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP); 2016. p. 418–421
Ashraf RA, Gioiosa R, Kestor G, DeMara RF, Cher CY, Bose P. Understanding the propagation of transient errors in HPC applications. In: SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2015. p. 1–12
Guo L, Li D. Moard: Modeling application resilience to transient faults on data objects. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2019. p. 878–889
Shantharam M, Srinivasmurthy S, Raghavan P. Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing; 2011. p. 152–161
Moríñigo JA, Bustos A, Mayo-García R (2022) Error resilience of three GMRES implementations under fault injection. J Supercomput 78(5):7158–7185
Guo L, Li D, Laguna I, Schulz M. Fliptracker: Understanding natural error resilience in hpc applications. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis; 2018. p. 94–107
Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R. Detection and correction of silent data corruption for large-scale high-performance computing. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 2012. p. 1–12
Guan Q, Hu X, Grove T, Fang B, Jiang H, Yin H, et al. Chaser: An enhanced fault injection tool for tracing soft errors in mpi applications. In: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2020. p. 355–363
DeFreez D, Bhowmick A, Laguna I, Rubio-González C. Detecting and reproducing error-code propagation bugs in MPI implementations. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2020. p. 187–201
Somani AK, Trivedi KS. A cache error propagation model. In: Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems; 1997. p. 15–21
Li ML, Ramachandran P, Sahoo SK, Adve SV, Adve VS, Zhou Y (2008) Understanding the propagation of hard errors to software and implications for resilient system design. ACM Sigplan Notice 43(3):265–276
Gu J, Zheng W, Zhuang Y, Zhang Q (2019) Vulnerability analysis of instructions for SDC-causing error detection. IEEE Access 7:168885–168898
Li Z, Menon H, Mohror K, Bremer PT, Livant Y, Pascucci V. Understanding a program’s resiliency through error propagation. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2021. p. 362–373
Li G, Pattabiraman K, Hari SKS, Sullivan M, Tsai T. Modeling soft-error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 27–38
Li G, Pattabiraman K. Modeling input-dependent error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 279–290
Anwer AR, Li G, Pattabiraman K, Sullivan M, Tsai T, Hari SKS. Gpu-trident: efficient modeling of error propagation in gpu programs. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis; 2020. p. 1–15
Li Z, Menon H, Maljovec D, Livnat Y, Liu S, Mohror K et al (2020) Spotsdc: Revealing the silent data corruption propagation in high-performance computing systems. IEEE Trans Visual Comput Graph 27(10):3938–3952
Previlon F, Kalra C, Tiwari D, Kaeli D (2022) Characterizing and exploiting soft error vulnerability phase behavior in gpu applications. IEEE Trans Dependable Secure Comput 19(1):288–300
Ko Y, Jeyapaul R, Kim Y, Lee K, Shrivastava A (2017) Protecting caches from soft errors: a microarchitect’s perspective. ACM Trans Embed Comput Sys (TECS) 16(4):1–28
Mittal S, Vetter JS. Reducing soft-error vulnerability of caches using data compression. In: Proceedings of the 26th Edition on Great Lakes Symposium on VLSI; 2016. p. 197–202
Houssany S, Guibbaud N, Bougerol A, Leveugle R, Miller F, Buard N (2012) Microprocessor soft error rate prediction based on cache memory analysis. IEEE Trans Nucl Sci 59(4):980–987
Vijayan A, Koneru A, Ebrahimit M, Chakrabarty K, Tahoori MB. Online soft-error vulnerability estimation for memory arrays. In: 2016 IEEE 34th VLSI Test Symposium (VTS); 2016. p. 1–6
Mamoutova OV, Antonov AP, Filippov AS. On design of cache with efficient soft error protection. In: 2017 IEEE 37th International Conference on Electronics and Nanotechnology (ELNANO); 2017. p. 57–60
Parasyris K, Tziantzoulis G, Antonopoulos CD, Bellas N. GemFI: A fault injection tool for studying the behavior of applications on unreliable substrates. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014. p. 622–629
Sangchoolie B, Pattabiraman K, Karlsson J. One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In: 2017 47th annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2017. p. 97–108
Lu Q, Farahani M, Wei J, Thomas A, Pattabiraman K. Llfi: An intermediate code-level fault injection tool for hardware faults. In: 2015 IEEE International Conference on Software Quality, Reliability and Security; 2015. p. 11–16
Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Mathemat Software (TOMS) 38(1):1–25
Acknowledgements
This research was supported by The Scientific and Technological Research Council of Turkey (TUBITAK) with a research grant (Project Number: 118E715).
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Ozturk, Z., Topcuoglu, H.R. & Kandemir, M.T. Studying error propagation on application data structure and hardware. J Supercomput 78, 18691–18724 (2022). https://doi.org/10.1007/s11227-022-04625-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-022-04625-x