Skip to main content
Log in

Studying error propagation on application data structure and hardware

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

As technology scales, transistors become smaller and aggressive power optimization techniques combined with high operation frequencies and performance-enhancing microarchitectural techniques are employed to achieve increasingly higher performance and power efficiencies. Unfortunately, these developments make the modern systems more vulnerable to soft errors, which are becoming a critical issues in both hardware and software domains. Motivated by this observation, in this work, we propose, implement, and evaluate two error propagation metrics in order to characterize error propagation at both software and hardware levels. The first metric aims to measure error propagation on program data structures, whereas the second one measures the fraction of corrupted locations in the cache memory structure for a given period of time. We evaluate our proposed metrics by performing an empirical study of two application programs using both single-threaded and multi-threaded executions, and varying various experimental parameters such as thread count, error rate, location of errors, and architectural parameters. Our extensive experimental analysis reveals that error propagation over program data structures is highly dependent on application behavior.Further, depending on the cache parameters used, propagation of errors on cache can exhibit different patterns. This paper also discusses how our observed error propagation trends in program data structures and data caches are correlated with each other, focusing in particular on the differences in error propagation speeds in application data structures and data caches.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11
Fig. 12
Fig. 13
Fig. 14
Fig. 15
Fig. 16
Fig. 17

Similar content being viewed by others

Data Availability

The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request.

Notes

  1. Clearly, OS data structures can also be corrupted; but, in this work, we exclusively focus on application data structures.

  2. This is the case for example in embedded and mobile systems.

References

  1. Rebaudengo M, Reorda MS, Violante M. An accurate analysis of the effects of soft errors in the instruction and data caches of a pipelined microprocessor. In (2003) Design. Autom Test Europe Conf Exhib 2003:602–607

    Google Scholar 

  2. Gold BT, Smolens JC, Falsafi B, Hoe JC. The granularity of soft-error containment in shared-memory multiprocessors; 2006

  3. Smolens JC, Gold BT, Kim J, Falsafi B, Hoe JC, Nowatzyk AG. Fingerprinting: bounding soft-error detection latency and bandwidth. In: ACM SIGPLAN Notices. vol. 39; 2004. p. 224–234

  4. Medeiros GE, Bortolon FT, Reis R, Ost L. Evaluation of compiler optimization flags effects on soft error resiliency. In: 2018 31st Symposium on Integrated Circuits and Systems Design (SBCCI); 2018. p. 1–6

  5. Gava J, Bandiera V, Reis R, Ost L. Evaluation of compilers effects on OpenMP soft error resiliency. In: 2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI); 2019. p. 259–264

  6. Lins FM, Tambara LA, Kastensmidt FL, Rech P (2017) Register file criticality and compiler optimization effects on embedded microprocessor reliability. IEEE Trans Nucl Sci 64(8):2179–2187

    Google Scholar 

  7. Baumann RC. Radiation-induced soft errors in advanced semiconductor technologies. IEEE Transactions on Device and Materials Reliability. 2005 Sept;5(3):305–316

  8. Cappello F, Al G, Gropp W, Kale S, Kramer B, Snir M (2014) Toward exascale resilience: 2014 update. Supercomput Frontiers Innovations: Int J 1(1):5–28

    Google Scholar 

  9. Baumann RC (2001) Soft errors in advanced semiconductor devices-part I: the three radiation sources. IEEE Trans Device Mater Reliab 1(1):17–22

    Article  Google Scholar 

  10. Shivakumar P, Kistler M, Keckler SW, Burger D, Alvisi L. Modeling the effect of technology trends on the soft error rate of combinational logic. In: Proceedings International Conference on Dependable Systems and Networks; 2002. p. 389–398

  11. O’Gorman TJ, Ross JM, Taber AH, Ziegler JF, Muhlfeld HP, Montrose CJ et al (1996) Field testing for cosmic ray soft errors in semiconductor memories. IBM J Res Dev 40(1):41–50

    Article  Google Scholar 

  12. Borkar S (2005) Designing reliable systems from unreliable components: the challenges of transistor variability and degradation. IEEE Micro 25(6):10–16

    Article  Google Scholar 

  13. Asadi GH, Sridharan V, Tahoori MB, Kaeli D. Balancing performance and reliability in the memory hierarchy. In: IEEE International Symposium on Performance Analysis of Systems and Software, 2005. ISPASS 2005.; 2005. p. 269–279

  14. Sikai L, Jun Y. A method of soft error propagation based on cellular automata. In: 2018 IEEE 3rd International Conference on Cloud Computing and Big Data Analysis (ICCCBDA); 2018. p. 617–622

  15. Mukherjee SS, Kontz M, Reinhardt SK. Detailed design and evaluation of redundant multi-threading alternatives. In: Proceedings 29th Annual International Symposium on Computer Architecture; 2002. p. 99–110

  16. Reinhardt SK, Mukherjee SS. Transient fault detection via simultaneous multithreading. In: Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No. RS00201); 2000. p. 25–36

  17. Rotta R, Ferreira RS, Nolte J. Real-time dynamic hardware reconfiguration for processors with redundant functional units. In: 2020 IEEE 23rd International Symposium on Real-Time Distributed Computing (ISORC); 2020. p. 154–155

  18. Ainsworth S, Jones TM. Parallel error detection using heterogeneous cores. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 338–349

  19. Györök G, Beszédes B. Duplicated control unit based embedded fault-masking systems. In: 2017 IEEE 15th International Symposium on Intelligent Systems and Informatics (SISY); 2017. p. 283–288

  20. Reis GA, Chang J, Vachharajani N, Rangan R, August DI. SWIFT: Software implemented fault tolerance. In: Proceedings of the International Symposium on Code Generation and Optimization; 2005. p. 243–254

  21. Asghari SA, Marvasti MB, Rahmani AM (2018) Enhancing transient fault tolerance in embedded systems through an OS task level redundancy approach. Futur Gener Comput Syst 87:58–65

    Article  Google Scholar 

  22. Mahmoud A, Hari SKS, Sullivan MB, Tsai T, Keckler SW. Optimizing software-directed instruction replication for gpu error detection. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis; 2018. p. 842–853

  23. Thati VB, Vankeirsbilck J, Penneman N, Pissoort D, Boydens J. An improved data error detection technique for dependable embedded software. In: 2018 IEEE 23rd Pacific Rim International Symposium on Dependable Computing (PRDC); 2018. p. 213–220

  24. Chen YS, Chen PS. A software-based redundant execution programming model for transient fault detection and correction. In: 2016 45th International Conference on Parallel Processing Workshops (ICPPW); 2016. p. 66–71

  25. Vallero A, Savino A, Chatzidimitriou A, Kaliorakis M, Kooli M, Riera M et al (2018) SyRA: Early system reliability analysis for cross-layer soft errors resilience in memory arrays of microprocessor systems. IEEE Trans Comput 68(5):765–783

    Article  MathSciNet  MATH  Google Scholar 

  26. Cheng E, Mirkhani S, Szafaryn LG, Cher CY, Cho H, Skadron K, et al. CLEAR: Cross-layer exploration for architecting resilience-combining hardware and software techniques to tolerate soft errors in processor cores. In: Proceedings of the 53rd Annual Design Automation Conference; 2016. p. 1–6

  27. Vallero A, Savino A, Politano G, Di Carlo S, Chatzidimitriou A, Tselonis S, et al. Cross-layer system reliability assessment framework for hardware faults. In: 2016 IEEE International Test Conference (ITC); 2016. p. 1–10

  28. Gupta M, Sridharan V, Roberts D, Prodromou A, Venkat A, Tullsen D, et al. Reliability-aware data placement for heterogeneous memory architecture. In: 2018 IEEE International Symposium on High Performance Computer Architecture (HPCA); 2018. p. 583–595

  29. Jaulmes L, Moretó M, Valero M, Erez M, Casas M. Runtime-guided ECC protection using online estimation of memory vulnerability. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis; 2020. p. 1–14

  30. Asadi G, Tahoori MB. An analytical approach for soft error rate estimation in digital circuits. In: 2005 IEEE International Symposium on Circuits and Systems; 2005. p. 2991–2994

  31. Mukherjee SS, Emer J, Reinhardt SK. The soft error problem: An architectural perspective. In: 11th International Symposium on High-Performance Computer Architecture; 2005. p. 243–247

  32. Weaver C, Emer J, Mukherjee SS, Reinhardt SK (2004) Techniques to reduce the soft error rate of a high-performance microprocessor. ACM SIGARCH Comput Archit News 32(2):264

    Article  Google Scholar 

  33. Upasani G, Vera X, González A. Reducing due-fit of caches by exploiting acoustic wave detectors for error recovery. In: 2013 IEEE 19th International On-Line Testing Symposium (IOLTS); 2013. p. 85–91

  34. Fratin V, Oliveira D, Lunardi C, Santos F, Rodrigues G, Rech P. Code-dependent and architecture-dependent reliability behaviors. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 13–26

  35. Utrera G, Gil M, Martorell X. Analysis of the impact factors on data error propagation in HPC applications. In: 2018 26th Euromicro International Conference on Parallel, Distributed and Network-based Processing (PDP); 2018. p. 546–549

  36. Ferreira RR, Da Rolt J, Nazar GL, Moreira AF, Carro L. Adaptive low-power architecture for high-performance and reliable embedded computing. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014. p. 538–549

  37. Hu J, Wang S, Ziavras SG. On the exploitation of narrow-width values for improving register file reliability. IEEE Transactions on Very Large Scale Integration (VLSI) systems. 2009;17(7):953–963

  38. Subasi O, Arias J, Unsal O, Labarta J, Cristal A. Nanocheckpoints: A task-based asynchronous dataflow framework for efficient and scalable checkpoint/restart. In: 2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing; 2015. p. 99–102

  39. Ashraf RA, Gioiosa R, Kestor G, DeMara RF. Exploring the effect of compiler optimizations on the reliability of HPC applications. In: 2017 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW); 2017. p. 1274–1283

  40. Mukherjee SS, Weaver C, Emer J, Reinhardt SK, Austin T. A systematic methodology to compute the architectural vulnerability factors for a high-performance microprocessor. In: Proceedings. 36th Annual IEEE/ACM International Symposium on Microarchitecture, 2003. MICRO-36.; 2003. p. 29–40

  41. Mukherjee SS, Weaver CT, Emer J, Reinhardt SK, Austin T (2003) Measuring architectural vulnerability factors. IEEE Micro 23(6):70–75

    Article  Google Scholar 

  42. Zhang W. Computing cache vulnerability to transient errors and its implication. In: 20th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems (DFT’05); 2005. p. 427–435

  43. Yan J, Zhang W. Compiler-guided register reliability improvement against soft errors. In: Proceedings of the 5th ACM International Conference on Embedded Software; 2005. p. 203–209

  44. Jaulmes L, Moreto M, Valero M, Casas M. A vulnerability factor for ECC-protected memory. In: 2019 IEEE 25th International Symposium on On-Line Testing and Robust System Design (IOLTS); 2019. p. 176–181

  45. Sridharan V, Kaeli DR. Eliminating microarchitectural dependency from architectural vulnerability. In: 2009 IEEE 15th International Symposium on High Performance Computer Architecture; 2009. p. 117–128

  46. Borodin D, Juurlink BH. Protective redundancy overhead reduction using instruction vulnerability factor. In: Proceedings of the 7th ACM International Conference on Computing Frontiers; 2010. p. 319–326

  47. Yu L, Li D, Mittal S, Vetter JS. Quantitatively modeling application resilience with the data vulnerability factor. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2014. p. 695–706

  48. Oz I, Topcuoglu HR, Kandemir M, Tosun O (2012) Thread vulnerability in parallel applications. J Parallel Distribut Comput 72(10):1171–1185

    Article  Google Scholar 

  49. Hiller M, Jhumka A, Suri N. On the placement of software mechanisms for detection of data errors. In: Proceedings International Conference on Dependable Systems and Networks; 2002. p. 135–144

  50. Leeke M, Jhumka A. Towards understanding the importance of variables in dependable software. In: 2010 European Dependable Computing Conference; 2010. p. 85–94

  51. Utrera G, Gil M, Martorell X. Analyzing data-error propagation effects in high-performance computing. In: 2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP); 2016. p. 418–421

  52. Ashraf RA, Gioiosa R, Kestor G, DeMara RF, Cher CY, Bose P. Understanding the propagation of transient errors in HPC applications. In: SC’15: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis; 2015. p. 1–12

  53. Guo L, Li D. Moard: Modeling application resilience to transient faults on data objects. In: 2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS); 2019. p. 878–889

  54. Shantharam M, Srinivasmurthy S, Raghavan P. Characterizing the impact of soft errors on iterative methods in scientific computing. In: Proceedings of the International Conference on Supercomputing; 2011. p. 152–161

  55. Moríñigo JA, Bustos A, Mayo-García R (2022) Error resilience of three GMRES implementations under fault injection. J Supercomput 78(5):7158–7185

    Article  Google Scholar 

  56. Guo L, Li D, Laguna I, Schulz M. Fliptracker: Understanding natural error resilience in hpc applications. In: SC18: International Conference for High Performance Computing, Networking, Storage and Analysis; 2018. p. 94–107

  57. Fiala D, Mueller F, Engelmann C, Riesen R, Ferreira K, Brightwell R. Detection and correction of silent data corruption for large-scale high-performance computing. In: SC’12: Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis; 2012. p. 1–12

  58. Guan Q, Hu X, Grove T, Fang B, Jiang H, Yin H, et al. Chaser: An enhanced fault injection tool for tracing soft errors in mpi applications. In: 2020 50th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2020. p. 355–363

  59. DeFreez D, Bhowmick A, Laguna I, Rubio-González C. Detecting and reproducing error-code propagation bugs in MPI implementations. In: Proceedings of the 25th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2020. p. 187–201

  60. Somani AK, Trivedi KS. A cache error propagation model. In: Proceedings Pacific Rim International Symposium on Fault-Tolerant Systems; 1997. p. 15–21

  61. Li ML, Ramachandran P, Sahoo SK, Adve SV, Adve VS, Zhou Y (2008) Understanding the propagation of hard errors to software and implications for resilient system design. ACM Sigplan Notice 43(3):265–276

    Article  Google Scholar 

  62. Gu J, Zheng W, Zhuang Y, Zhang Q (2019) Vulnerability analysis of instructions for SDC-causing error detection. IEEE Access 7:168885–168898

    Article  Google Scholar 

  63. Li Z, Menon H, Mohror K, Bremer PT, Livant Y, Pascucci V. Understanding a program’s resiliency through error propagation. In: Proceedings of the 26th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming; 2021. p. 362–373

  64. Li G, Pattabiraman K, Hari SKS, Sullivan M, Tsai T. Modeling soft-error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 27–38

  65. Li G, Pattabiraman K. Modeling input-dependent error propagation in programs. In: 2018 48th Annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2018. p. 279–290

  66. Anwer AR, Li G, Pattabiraman K, Sullivan M, Tsai T, Hari SKS. Gpu-trident: efficient modeling of error propagation in gpu programs. In: SC20: International Conference for High Performance Computing, Networking, Storage and Analysis; 2020. p. 1–15

  67. Li Z, Menon H, Maljovec D, Livnat Y, Liu S, Mohror K et al (2020) Spotsdc: Revealing the silent data corruption propagation in high-performance computing systems. IEEE Trans Visual Comput Graph 27(10):3938–3952

    Article  Google Scholar 

  68. Previlon F, Kalra C, Tiwari D, Kaeli D (2022) Characterizing and exploiting soft error vulnerability phase behavior in gpu applications. IEEE Trans Dependable Secure Comput 19(1):288–300

    Article  Google Scholar 

  69. Ko Y, Jeyapaul R, Kim Y, Lee K, Shrivastava A (2017) Protecting caches from soft errors: a microarchitect’s perspective. ACM Trans Embed Comput Sys (TECS) 16(4):1–28

    Article  Google Scholar 

  70. Mittal S, Vetter JS. Reducing soft-error vulnerability of caches using data compression. In: Proceedings of the 26th Edition on Great Lakes Symposium on VLSI; 2016. p. 197–202

  71. Houssany S, Guibbaud N, Bougerol A, Leveugle R, Miller F, Buard N (2012) Microprocessor soft error rate prediction based on cache memory analysis. IEEE Trans Nucl Sci 59(4):980–987

    Article  Google Scholar 

  72. Vijayan A, Koneru A, Ebrahimit M, Chakrabarty K, Tahoori MB. Online soft-error vulnerability estimation for memory arrays. In: 2016 IEEE 34th VLSI Test Symposium (VTS); 2016. p. 1–6

  73. Mamoutova OV, Antonov AP, Filippov AS. On design of cache with efficient soft error protection. In: 2017 IEEE 37th International Conference on Electronics and Nanotechnology (ELNANO); 2017. p. 57–60

  74. Parasyris K, Tziantzoulis G, Antonopoulos CD, Bellas N. GemFI: A fault injection tool for studying the behavior of applications on unreliable substrates. In: 2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks; 2014. p. 622–629

  75. Sangchoolie B, Pattabiraman K, Karlsson J. One bit is (not) enough: An empirical study of the impact of single and multiple bit-flip errors. In: 2017 47th annual IEEE/IFIP International Conference on Dependable Systems and Networks (DSN); 2017. p. 97–108

  76. Lu Q, Farahani M, Wei J, Thomas A, Pattabiraman K. Llfi: An intermediate code-level fault injection tool for hardware faults. In: 2015 IEEE International Conference on Software Quality, Reliability and Security; 2015. p. 11–16

  77. Davis TA, Hu Y (2011) The University of Florida sparse matrix collection. ACM Trans Mathemat Software (TOMS) 38(1):1–25

    MathSciNet  MATH  Google Scholar 

Download references

Acknowledgements

This research was supported by The Scientific and Technological Research Council of Turkey (TUBITAK) with a research grant (Project Number: 118E715).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Haluk Rahmi Topcuoglu.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Ozturk, Z., Topcuoglu, H.R. & Kandemir, M.T. Studying error propagation on application data structure and hardware. J Supercomput 78, 18691–18724 (2022). https://doi.org/10.1007/s11227-022-04625-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-022-04625-x

Keywords

Navigation